Power flow computation using field programmable gate arrays by Vachranukunkiet, Petya
Power Flow Computation 
Using Field Programmable Gate Arrays 
 
A Thesis 
Submitted to the Faculty 
of 
Drexel University 
by 
Petya Vachranukunkiet 
in partial fulfillment of the 
requirements for the degree 
of 
Doctor of Philosophy 
June 2007
© Copyright 2007 
Petya Vachranukunkiet.  All Rights Reserved. 
ii 
Dedications 
 
 
To my loving wife for her continuous support 
iii 
Acknowledgments 
 
First and foremost I would like to express my heart felt gratitude to my advisors Prawat 
Nagvajara and Jeremy Johnson for providing the opportunity and the guidance necessary 
to complete this work.  Their teaching, assistance, and encouragement have been 
instrumental in the research that I accomplished throughout my graduate life at Drexel 
University.   
I would like to thank my wife for enduring with me and supporting all of my efforts on 
a daily basis.  Her love and constant encouragement gave me the drive to continue 
onward. 
Special thanks go out to Dr. Chika Nwankpa for providing additional knowledge, 
direction, and insight during my time working on the Drexel PowerGrid project.  His aid 
was invaluable and I used many of his suggestions in shaping the form of my work. 
Finally I would like to thank my thesis advisory committee for participating in my 
dissertation defense. 
iv 
TABLE OF CONTENTS 
 
LIST OF TABLES............................................................................................................ vii 
LIST OF FIGURES ......................................................................................................... viii 
ABSTRACT...................................................................................................................... xii 
CHAPTER 1. INTRODUCTION ....................................................................................... 1 
1.1 Motivation........................................................................................................... 1 
1.1.1 Contingency Analysis ................................................................................. 2 
1.2 Background......................................................................................................... 5 
1.2.1 Power Flow ................................................................................................. 5 
1.2.2 Computation Technologies ......................................................................... 9 
1.2.3 FPGA Architecture ................................................................................... 13 
1.3 Objective ........................................................................................................... 18 
1.4 Approach........................................................................................................... 18 
1.5 Contribution ...................................................................................................... 19 
1.6 Organization...................................................................................................... 20 
CHAPTER 2. POWER FLOW COMPUTATION........................................................... 21 
2.1 Introduction....................................................................................................... 21 
2.2 Gauss Power Flow Method............................................................................... 21 
2.3 Newton Power Flow Method............................................................................ 24 
CHAPTER 3. SPARSE LINEAR SOLVERS .................................................................. 32 
3.1 Introduction....................................................................................................... 32 
3.2 Iterative Linear Solvers..................................................................................... 33 
3.3 Direct Linear Solvers ........................................................................................ 37 
v 
3.4 Direct Sparse LU Decomposition ..................................................................... 42 
3.4.1 Sparse Matrix Storage............................................................................... 43 
3.4.2 Ordering Sparse Matrices for Fill-In Reduction ....................................... 45 
3.4.3 Ordering Sparse Matrices for Parallelism................................................. 52 
CHAPTER 4. HARDWARE DESIGN............................................................................. 57 
4.1 Introduction....................................................................................................... 57 
4.2 Design Specification ......................................................................................... 59 
4.3 Analysis of LU Decomposition ........................................................................ 62 
4.4 Top Level Design.............................................................................................. 75 
4.5 Floating Point Units .......................................................................................... 76 
4.5.1 Floating point addition.............................................................................. 79 
4.5.2 Floating point multiplication..................................................................... 80 
4.5.3 Floating point division .............................................................................. 82 
4.6 Cache................................................................................................................. 85 
4.7 Pivot Logic........................................................................................................ 94 
4.8 Sub-matrix Update Logic.................................................................................. 98 
4.9 Hardware Debugging ...................................................................................... 107 
CHAPTER 5. BENCHMARK RESULTS ..................................................................... 108 
5.1 Introduction..................................................................................................... 108 
5.2 Benchmark System ......................................................................................... 108 
5.3 Hardware Prototype ........................................................................................ 109 
5.4 Performance Modeling.................................................................................... 118 
5.5 Parallelism....................................................................................................... 125 
vi 
5.6 Application...................................................................................................... 133 
CHAPTER 6. CONCLUSIONS AND FUTURE WORK.............................................. 138 
6.1 Summary of Contributions.............................................................................. 138 
6.2 Future Work .................................................................................................... 138 
LIST OF REFERENCES................................................................................................ 140 
Vita.................................................................................................................................. 148 
vii 
LIST OF TABLES 
 
Table 1. Power Flow Computation Applications................................................................ 2 
Table 2.  Sparse LU Decomposition Software Efficiency................................................ 58 
Table 3.  Benchmark Power System Jacobian Matrix Summary...................................... 63 
Table 4.  NNZ in L + U with Column Ordering ............................................................... 66 
Table 5.  NNZ in L + U with Symmetrical Ordering ....................................................... 66 
Table 6.  NNZ in L + U with Symmetrical Ordering and Diagonal Pivoting .................. 66 
Table 7.  Floating Point Operation Counts ....................................................................... 68 
Table 8.  Number of Entries Rejected for Pivot Column.................................................. 69 
Table 9.  Pivot Row Size Statistics ................................................................................... 73 
Table 10.  Pivot Column Size Statistics............................................................................ 73 
Table 11.  Compilation Results for Floating Point Units on Altera EP1S25F780C5 FPGA
........................................................................................................................................... 78 
Table 12.  Solve Time in Seconds for Sparse Linear Solver Packages .......................... 109 
Table 13.  Number of Clock Cycles to Perform Sparse LU Decomposition with Two 
Merge Units .................................................................................................................... 116 
Table 14.  Number of Clock Cycles to Perform Sparse LU Decomposition with One 
Merge Unit ...................................................................................................................... 116 
 
viii 
LIST OF FIGURES 
 
Figure 1.  Major Components of On-Line Security Analysis............................................. 4 
Figure 2. Comparison of VLSI Technologies................................................................... 10 
Figure 3.  VLSI Computational Technology Design Flow Summary .............................. 10 
Figure 4. General Field Programmable Gate Array Architecture..................................... 14 
Figure 5. Six-input LUT with Fast Carry Logic and Flip-Flop ........................................ 16 
Figure 6. Four Input Lookup Table with SRAM Configuration....................................... 17 
Figure 7. Graphical Illustration of Newton’s Method for an Arbitrary Function............. 25 
Figure 8.  Newton Power Flow Computation Flow Chart and Breakdown...................... 31 
Figure 9. Basic Gaussian Elimination Pseudocode........................................................... 38 
Figure 10. jki Gaussian Elimination Pseudocode and Inner Loop Data Dependency ...... 39 
Figure 11. kji Gaussian Elimination Pseudocode and Inner Loop Data Dependency ...... 40 
Figure 12. ijk Gaussian Elimination Pseudocode and Inner Loop Data Dependency Part 1
........................................................................................................................................... 41 
Figure 13. ijk Gaussian Elimination Pseudocode and Inner Loop Data Dependency Part 2
........................................................................................................................................... 42 
Figure 14. Sparse Matrix and Corresponding Compressed Storage ................................. 45 
Figure 15.  Sparse Matrix, Graph Representation, and Post LU Filled Matrix ................ 49 
Figure 16. Basic Minimum Degree Algorithm................................................................. 50 
Figure 17. Minimum Degree Ordered Sparse LU Decomposition ................................... 51 
Figure 18. Matrix in BBD Form ....................................................................................... 54 
Figure 19.  Ordered 1648 Bus Jacobian Matrices Before LU Decomposition ................. 63 
Figure 20.  L+U Matrix for Ordered 1648 Bus Jacobian Matrices................................... 64 
ix 
Figure 21.  Ratio of Matching Row Indices for Successive Updates to Total Pivot 
Column Size...................................................................................................................... 70 
Figure 22.  Pivot Row Size for Sub-Matrix Update.......................................................... 71 
Figure 23.  Pivot Column Size for Sub-Matrix Update .................................................... 72 
Figure 24.  Jacobian Matrix Data Transfer Time Relative to Pentium 4 Solve Time ...... 74 
Figure 25.  LU Result Data Transfer Time Relative to Pentium 4 Solve Time................ 75 
Figure 26.  High Level Block Diagram of Sparse LU Hardware ..................................... 75 
Figure 27.  IEEE-754 Single Precision 32-bit Format...................................................... 77 
Figure 28.  Floating Point Addition Unit .......................................................................... 80 
Figure 29.  Floating Point Multiplication Unit ................................................................. 82 
Figure 30.  Floating Point Division Unit .......................................................................... 85 
Figure 31.  Top Level Cache Block Diagram................................................................... 89 
Figure 32.  Cache Controller............................................................................................. 90 
Figure 33.  Cache Tag Registers and Tag Array............................................................... 91 
Figure 34.  Cache MSHR.................................................................................................. 93 
Figure 35.  Cache Serial/Parallel Conversion Units ......................................................... 94 
Figure 36.  Colmap Unit ................................................................................................... 96 
Figure 37.  Swap Unit ....................................................................................................... 97 
Figure 38.  Pivot Unit........................................................................................................ 98 
Figure 39.  Sub-Matrix Update Filter and Merge_Memory Units.................................... 99 
Figure 40.  LU Floating Point Multiply and Divide Units.............................................. 101 
Figure 41.  Sub-Matrix Update Merge Unit.................................................................... 103 
Figure 42.  Sub-Matrix Update Test Unit ....................................................................... 104 
x 
Figure 43.  First-In-First-Out Buffer............................................................................... 105 
Figure 44.  Sub-Matrix Update Select Unit .................................................................... 106 
Figure 45.  Tsunami Sparse LU Hardware Prototype..................................................... 111 
Figure 46.  Sparse LU Hardware Prototype Operation................................................... 113 
Figure 47.  Sparse LU Hardware Prototype Resource Utilization.................................. 114 
Figure 48.  Sparse LU Hardware Computation Efficiency............................................. 117 
Figure 49.  Frequency Scaling of Sparse LU Hardware Prototype with Dual Merge Units
......................................................................................................................................... 118 
Figure 50.  Sparse LU Hardware Prototype Performance Normalized to Software Model
......................................................................................................................................... 119 
Figure 51.  Projected Multiple Merge Unit Scaling (a) 1648 Bus (b) 7917 Bus (c) 10278 
Bus .................................................................................................................................. 120 
Figure 52.  Relative Time Spent During LU Decomposition ......................................... 121 
Figure 53.  Diagonal Pivoting Speedup vs. Number of Merge Units by Power System 122 
Figure 54.  Scaling of UMFPACK and LU Hardware with Respect to Matrix Size...... 123 
Figure 55.  Cache Row Hit Rate vs. Number of Rows in the Cache by Power System 
Matrix.............................................................................................................................. 124 
Figure 56.  Flops vs. Number of Clock Cycles for LU Decomposition ......................... 125 
Figure 57.  Sparse LU Hardware Projected Speedup Relative to Single Merge Unit .... 127 
Figure 58.  Two Level Recursively Partitioned BBD Sparse Matrix ............................. 129 
Figure 59.  Tree Breakdown of Recursively Partitioned BBD Sparse Matrix................ 130 
Figure 60.  Coarse-grained Parallel Decomposition Speedup for Sparse LU Hardware 131 
xi 
Figure 61.  Speedup of Combined Coarse-grain and Fine-grain Parallel LU 
Decomposition ................................................................................................................ 132 
Figure 62.  Breakdown of Newton Power Flow Computation Using FPGA Based 
Accelerator...................................................................................................................... 134 
Figure 63.  Power Flow Computation without BBD ...................................................... 136 
Figure 64.  Power Flow Computation with BBD ........................................................... 136 
Figure 65.  Speedup in Power Flow without Host Bottlenecks ...................................... 137 
 
 
xii 
ABSTRACT 
Power Flow Computation  
Using Field Programmable Gate Arrays 
Petya Vachranukunkiet 
Prawat Nagvajara, Ph.D. 
Jeremy Johnson, Ph.D. 
 
Power flow computation is ubiquitous in the operation and planning of power systems.  
Traditional power flow computation is performed using commodity general purpose 
processors that are commonly found in today’s personal computers.  These general 
purpose processors however, are not necessarily designed to provide optimal 
performance for power flow computation.  The goal of these general purpose processors 
is to provide good performance across a wide range of problems, but not necessarily for 
all problems.  Recently, field programmable gate arrays (FPGA) have been identified as 
having the capability to compete with these high frequency processors for floating point 
throughput despite operating at a much lower clock speed.  This research presents the 
methodology used to design an implementation of an FPGA based accelerator for power 
flow computation that provides increased performance over the standard general purpose 
processor solution.  The design follows a fine-grained study of the sparse LU 
decomposition of several benchmark power system matrices.  Performance results for this 
hardware design indicate an order of magnitude improvement in solve time compared to 
a state of the art sparse LU solver package for personal computers.  The limitations of 
fine-grained parallelism are explored and additional performance through higher-level 
parallelism is also presented. 
 
 
 
1 
CHAPTER 1.  INTRODUCTION 
 
1.1 Motivation 
 
Power flow is one of the most common and important computations performed on a 
power transmission system.  The calculation is as basic and essential to the understanding 
of the steady state behavior of a power system as circuit analysis.  The power flow 
computation is ubiquitous in the operation of existing and future planning of power 
systems and finds applications in many practical power system operations, as 
summarized in Table 1.  Power flow is a numerical analysis based on the application of 
Kirchoff’s Current Law (KCL) to a mathematical model of the power system.  Unlike 
traditional circuit analysis, due to the complexity of a power system, power flow 
computation is performed using a simplified representation of the power system.  The 
goal of power flow computation is the calculation of the steady state AC voltage 
magnitude and phase angle at each node and the real and reactive power flowing through 
each line.  Once these values have been determined, other properties of the power system 
can be easily calculated.  Current power flow computational methods are typically 
performed using an off the shelf computer system that is based on a commodity general 
purpose processor as the main computation device.  The time required for a single power 
flow computation is quite modest, less than one second for power systems with thousands 
of buses on today’s personal computer.  In certain applications however, such as real-
time or interactive programs and contingency analysis, many power flow computations 
are required to be completed within a fixed window of time.  As the size of power 
systems have increased over time, the increasing computational load has placed pressure 
2 
on the capabilities of these computer systems to meet the required performance.  This 
research proposes the use of a hardware accelerator based on field programmable gate 
array (FPGA) technology to address the performance needs of power flow computation.  
Compared to traditional methods which only use a general purpose processor, this 
research considers the use of application specific hardware designed to address the 
performance bottlenecks found in load flow computations. 
 
Table 1. Power Flow Computation Applications 
Application Description 
Transmission Planning Check for overloads, voltage problems, and 
to identify locations of network 
reinforcements. 
Contingency Analysis Test for effect of line and/or generator 
outages. 
VAR/Voltage Analysis Evaluate effectiveness of VAR/voltage 
devices. 
Transfer Capability Analysis Test for inter-utility power transfer limits. 
On-Line Control, Security Enhancement Analyze effectiveness of corrective 
measures to alleviate emergencies. 
 
 
1.1.1 Contingency Analysis 
 
Contingency analysis is an important investigation of a power system that is used in 
design, operation, and planning in order to maintain system security (Stott, Alsac et al. 
1987; Balu, Bertram et al. 1992).  Security of a power system entails uninterrupted 
operation of the power system within acceptable limits despite changes in demand or in 
the power distribution network.  In a power system the outage of a single piece of 
3 
equipment, such a transmission line or generator, can cause a cascading outage of other 
equipment the end result of which could be a large outage in the power system.  The idea 
behind contingency analysis is to provide operators with information that allows them to 
avoid or mitigate the effects of a loss of a component in the power system due to 
unforeseen circumstances such as weather or due to planned outages such as equipment 
maintenance. 
A single contingency case consists of the analysis of an outage event such as the loss of 
a generator or a transmission component.  The sheer number of possible events in today’s 
power systems, consisting of thousands of buses and branches, is significant and does not 
allow power system operators to consider all possible multiple outage combinations 
within a reasonable time frame.  In order to reduce the computational load a subset of the 
possible outages is used instead.  A complete analysis of all contingency cases that are 
the result of an outage of a single piece of equipment is commonly referred to as N-1 
contingency analysis, and is one possible criteria used to define system stability.  N-1 
contingency analysis is currently used in the operation of many of today’s power systems. 
Today’s power system security analysis techniques use static “snapshots” of the power 
system to monitor and determine system security during on-line operation.  These static 
pictures of the system are taken at regular intervals and are used to determine if the power 
system is operating in a secure state or if corrective actions are required.  The time 
window for these computations ranges from a few minutes to fifteen minutes.  Any 
longer and the data gathered is stale and the results arising from this data could be 
unreliable.  The use of this quasi-static technique is necessary due to the current 
computational intractability of dynamic analysis which is orders of magnitude more 
4 
complex.  Figure 1 illustrates the computation of on-line power system security analysis, 
and is explained below.   
 
Measurements
Network
Topology
Bus Load
Forecast
State
Estimation
External
Network
Modeling
On - Line
Load
Flow
Contingency
Evaluation
Contingency
Selection
Limit
Checking
Bad Data
Processing
Observability
Analysis
Filtering
Emergency State
Restorative State
Normal
State
Secure State Insecure State  
Figure 1.  Major Components of On-Line Security Analysis 
 
 
The static state of a power system is defined by the bus voltage magnitudes and phase 
angles.  State estimation is a mathematical process used to take the data from noisy real-
time measurements and estimate the value of the power system state variables.  This state 
information is used to perform contingency analysis to determine whether or not the 
system is operating in a secure state.  The results of contingency analysis provides 
operators a scenario of what would happen if a particular event would occur, i.e. the loss 
5 
of lines and/or generators.  This information allows them to take proactive measures that 
would mitigate or eliminate an undesirable state, or could provide information that would 
be used to generate a corrective plan of action should a particular event occur.  
Unfortunately the number of variables in current power systems does not allow 
computation of all possible contingencies within the necessary window of time using 
current computational techniques. 
The approach used for attacking the problem of on-line contingency analysis is 
comprised of three stages: contingency definition, selection, and evaluation.  
Contingency definition is the creation of a list of contingencies with the highest 
possibility of occurring.  This list varies with system topology and load and is normally 
very large.  Contingency selection is the screening of the list generated by contingency 
definition to provide a shorter list of ranked contingency events.  This screening 
eliminates contingency events that cause no violations in the operation of the power 
system and uses approximate techniques in order to provide rapid results.  In the final 
stage the ranked and screened list of contingencies are computed using power flow in 
decreasing order of severity until all contingencies have been calculated or the time 
window has been reached.  In some cases, the final two stages are combined into a single 
process.  The core computation which is the basis and most computationally intensive 
part of contingency analysis is power flow, also commonly called load flow. 
 
1.2 Background 
 
1.2.1 Power Flow 
 
6 
In power flow analysis the power system is modeled as a set of nodes, also called 
buses, interconnected by transmission links, also called branches.  In order to explain 
power flow a brief background on the modeling of the power system as a collection of 
buses and branches is provided.  The bus admittance matrix, or Ybus matrix, is a 
numerical model of the power system that characterizes the behavior of the network 
components such as power lines, transformers, and loads using the complex injected 
currents (I) at the nodes and their relation to the complex node voltages (V), based on 
Kirchoff’s current law (KCL). 
VYI bus=  
 
For a power system with n buses, the Ybus matrix is a symmetric n-by-n matrix.  The 
non-zero pattern of the entries in the Ybus is directly related to the structure of the power 
system.  The diagonal terms of the matrix represent the sum of admittances connected to 
each node, while the off-diagonal terms represent the sum of admittances connected 
between two nodes.  Since the amount of interconnection between nodes in the network 
is low, typically less than four branches per bus, for large power systems the Ybus matrix 
is a sparse matrix.  A sparse matrix has very few non-zero entries O(N) relative to the 
total number of entries possible for the matrix O(N2).  There are many numerical 
techniques that can be utilized to take advantage of this sparsity to reduce computations 
as compared to full dense matrix methods.  Methods for the construction of the Ybus 
matrix can be found in (Brown 1985; Bergen 1986; Heydt 1986). 
In power flow analysis, the quantities used to describe the power injected into and 
removed from the system are the complex power generated at bus i (SGi) and the complex 
7 
power removed at bus i (SDi).  In addition to the power injected and removed at each bus 
in the power system, there is also power flowing between buses i and k through the 
branches (Sik).  Using the principle of conservation of power, the complex bus power at 
the ith bus (Si) is  
∑
=
=−=
n
k
ikDiGii SSSS
1
 
 
Calculating the ith bus power using the relation S=VI* and the Ybus matrix we get 
∑
=
=
n
k
kikii VYVS
1
**  
 
Defining the following relations 
ikikik
kiik
j
ii
jBGY
eVV i
+=
−=
=
θθθ
θ
 
 
and using those definitions in the above ith bus power equation, the result is the real and 
complex power defined as a function of the Ybus and the complex bus voltages for the ith 
bus in the power network. 
( ) ( )[ ]
( ) ( )[ ]kiikkiikn
k
kii
kiikkiik
n
k
kii
iii
BGVVQ
BGVVP
jQPS
θθθθ
θθθθ
−−−=
−+−=
+=
∑
∑
=
=
cossin
sincos
1
1
 
 
The goal of power flow is to solve for the steady state complex power (Si) and complex 
voltage (Vi) at each bus of the power system given some known values.  In the power 
8 
system some buses are supplied power by generators, while others buses do not have 
generators, as a result different information about the power system is considered known 
and unknown at each bus.  At generator buses typically the real power (P) and voltage 
magnitude (|V|) is defined as a known quantity since they can be controlled by the 
generator, also called PV buses.  At buses without generators, or load buses, the real (P) 
and imaginary power (Q) are typically specified based on measurements and statistics, 
also called PQ buses. 
Another necessary consideration for power flow calculation is the lossy nature of the 
power network.  Since we can’t calculate the power transmission losses without first 
knowing the flow through the branches of the power system, one of the generator buses is 
designated to pick up this slack in power generation.  This bus is commonly referred to as 
the slack bus, swing bus, or voltage reference bus.  In addition to the other known 
quantities at generator buses, the voltage phase angle for this bus is assumed to be zero 
degrees, which is nothing more than setting a time reference.  Thus in a given power 
system for power flow there are three different types of buses, the slack bus, PV buses, 
and PQ buses. 
The mathematical formulation of the power flow problem is a set of non-linear sparse 
equations relating complex power to complex voltage.  The problem is sparse due to the 
structure of the power system (Ybus), and non-linear because power is expressed as a 
function of voltage.  The power flow equations are generated using the ith bus complex 
power equation as a function of the complex bus voltages, excluding the slack bus since 
for this bus we assume known quantities for the complex voltage magnitude and phase 
angle.  Once the solution for the complex voltage values has been determined, the real 
9 
and reactive power for the remaining buses can be calculated and the power flow solution 
is determined.  A review of various load flow calculation methods up to the mid 1970s is 
contained in (Stott 1974).  A more detailed overview of the two most commonly used 
methods is provided in chapter 2. 
 
1.2.2 Computation Technologies 
 
There are several classes of competing technologies that fall under the umbrella of very 
large scale integration (VLSI) technology for high performance computing (HPC).  All of 
these technologies allow designers to implement specific algorithms to be computed on 
an integrated circuit or chip.  The selection of a particular technology depends on several 
factors including performance requirements, time to implementation, and associated costs 
such as manufacturing.  A summary of the trade-offs with respect to these factors is 
provided in Figure 2.  Figure 3 provides a summary of the design process for each of the 
competing technologies.  These technologies are not necessarily mutually exclusive and a 
hybrid computational system could utilize one or all of these technologies in concert.  A 
more detailed explanation of the aforementioned technologies follows these figures. 
 
 
10 
Cost
Design Time
Performance
Cheaper
Shorter
Lower
Software FPGA ASIC
More Expensive
Higher
Longer
 
Figure 2. Comparison of VLSI Technologies 
 
 
 
Functional Specification
Coding
Compilation
Verification
Software
Functional Specification
HDL
Synthesis
Place and Route
FPGA
Functional Specification
HDL
Synthesis
Layout
ASIC
Download and Verify Manufacture
Verify in Circuit
 
Figure 3.  VLSI Computational Technology Design Flow Summary 
 
 
 
One class of technology for computing is the general purpose processor, denoted 
software in Figure 2 and Figure 3.  This paradigm for computation is not a VLSI 
11 
implementation of a computation but rather an application of a VLSI device to a 
particular application.  These devices are commodity parts and are easier to design for 
relative to other VLSI technologies.  Implementing a computation on a general purpose 
processor consists of writing software which implements the desired functionality in a 
programming language such as C which is then compiled into machine code which is 
understood by the processor.  The processor which executes this machine code is an 
application specific integrated circuit (ASIC), but the underlying hardware design is 
general purpose which places it in a different category than a fully ASIC VLSI design 
which is only designed to perform one or very few tasks.  These general purpose 
processors are designed to perform well for a variety of tasks but are not optimized at the 
physical level for specific computations, which puts them at the bottom of the 
performance ladder.  Writing software to achieve optimal performance requires tailoring 
the algorithm to fit the performance specifics of the processor that executes the machine 
code.  The same algorithm may perform better or worse on different processors 
depending on the hardware implementation, how the program was written, and the 
quality of the compiler used to generate the machine code, so care must be taken to make 
sure that the software is written appropriately for the processor.  Multi-processor 
computer systems and cluster computing systems are designed to exploit the 
computational power of general purpose processors by using an interconnected array of 
general purpose processors.  Software must be specially written for these multi-processor 
systems in order to fully take advantage of the potential parallelism available in the 
processing array. 
12 
At the high end of the spectrum for cost and performance are application specific VLSI 
devices, “ASIC” in Figure 2.  Devices which fall under this category are designed to 
implement only one particular computation or logic functionality at the physical design 
level.  Utilization of this technology is both costly and time consuming from an 
engineering standpoint, even for a few working prototype devices.  Initial costs to design 
and manufacture a design using cutting a cutting edge manufacturing technology can 
exceed a million dollars which limits the application of this technology to high volume 
markets.  Once the initial engineering costs are paid, ASIC manufacturing technology 
benefits from economies of scale which distributes the large initial cost of production 
across the sale of many units.  ASIC design typically begins with description of the 
desired functionality using a hardware description language (HDL).  This hardware 
description is then converted to a physical equivalent which is then physically laid out for 
integrated circuit (IC) manufacturing.  IC manufacturing is a process that builds the 
device layer by layer using masking technology.  Dozens of masks are required to build 
the final product and are specific to the design.  The manufacturing process, not including 
ASIC design time, takes 1-3 months to manufacture the desired device.  At the end of the 
process, device verification is required and any errors detected in the design are corrected 
by a respin the design.  A respin is a revision of the original IC design with new masks 
that contains any corrections that are necessary to fix errors in the original design.  For 
designs of significant complexity, several respins may be required for full functionality to 
be obtained (Synopsys 2006). 
Fitting in the middle ground are programmable logic devices such as the field 
programmable gate array, “FPGA” in Figure 2.  FPGAs are reconfigurable logic devices 
13 
that allow designers to electrically configure device operation through software. 
Hardware design using FPGAs begins with a description of the desired hardware in a 
HDL such as VHDL or Verilog.  This description is then decomposed via a process 
called synthesis and then mapped to a specific FPGA device family through a process 
called place and route.  The result of the place and route operation is a binary image file 
which can be used to program the target FPGA with the desired functionality.  FPGAs 
can be programmed multiple times with different binary images which makes them well 
suited for tasks such as prototyping.  Performance for these devices is better than 
software since the underlying hardware is targeted to a particular application, but not as 
good as ASIC due to the overhead associated with re-configurability.  Unlike ASIC 
design however, there is not a long lead time between design, test, and revision due to the 
re-configurability of the FPGA.  A designer can simply compile the revised design and 
reprogram the FPGA device without having to wait for revised masks and a new device 
to be manufactured (Thirumalai and Krishnan 2005). 
 
1.2.3 FPGA Architecture 
 
FPGAs are composed of an array of configurable logic blocks (CLBs) connected by a 
programmable interconnection network, and configurable inputs/outputs (Kang and 
Leblebici 2003).  The sum of these parts forms the basic reconfigurable fabric of an 
FPGA, illustrated by Figure 4.  Since the same manufacturing technologies used to make 
integrated circuits are also used to manufacture FPGAs, FPGAs benefit from a periodic 
doubling of device density as predicted by Moore’s Law.  FPGA designers have used 
14 
these increased resources not only for additional reprogrammable logic but have also 
expanded CLB functionality, and incorporated high performance fixed function blocks 
such as binary multipliers, embedded memory blocks, high speed programmable 
input/output devices, and fully functional microprocessors into the reconfigurable fabric.  
(Altera 2006; Xilinx 2007) 
 
Configurable 
Logic Block
Programmable 
Switch Matrix
Vertical 
Routing 
Channel
Horizontal 
Routing 
Channel
 
Figure 4. General Field Programmable Gate Array Architecture 
 
 
 
Configurable logic block (CLB) composition and design varies between different 
FPGA vendors and FPGA families however they share the same basic components and 
design for implementing electrical re-programmability.  A typical CLB, which contains 
one or more lookup tables (LUTs), routing and configuration, enhancements such as a 
fast carry propagation chain, and a flip-flop that can be used to register data 
15 
synchronously.  An illustration of a CLB by Altera is provided in Figure 5.  When 
compared to ASICs, FPGAs are not able to achieve comparable operating frequencies 
when implementing an equivalent logic function due to the higher delays associated with 
re-configurability.  An ASIC design is able to achieve high operating frequency for many 
levels of cascaded logic, but FPGAs can only afford a few levels without degrading 
maximum frequency of operation.  In order to be able to achieve a higher operating 
frequency, FPGA based hardware designs rely on pipelining, the decomposition of a task 
into several smaller tasks, using the registers that are in the CLBs.  Pipelining is not 
exclusive to FPGAs and is commonly used in high performance microprocessors to 
increase performance.  Heavy pipelining however can be detrimental to performance 
when the latency of an operation, which is equal the number of pipeline stages, grows to 
a point where performance suffers due to stalling while waiting for results to exit the 
pipeline.  Key to a successful high performance FPGA based design is balancing the 
trade-off between maximum frequency of operation and latency.  The maximum speed of 
operation is governed by the speed of the CLBs and the routing fabric, and no amount of 
pipelining will allow a design to exceed this limitation.  Today’s highest performance 
FPGAs can operate at clock speeds in excess of 500 MHz. 
FPGAs do have some advantages when compared to ASICs.  Embedded in the 
reconfigurable fabric of today’s FPGAs are blocks of SRAMs.  The implementation of 
many smaller memories in ASICs requires the addition of routing and addressing wires 
which already exist in the reconfigurable FPGA fabric.  The use of registers for 
pipelining is at no additional cost in a FPGA because the registers are already included in 
the architecture “for free”.  The registers in the CLBs can also be used to implement 
16 
small memories, whereas in an ASIC design the insertion of lots of small memories can 
have a significant impact on routing and design size due to the addition of address and 
data lines necessary for memories, which already exist in the FPGA architecture. 
 
 
Figure 5. Six-input LUT with Fast Carry Logic and Flip-Flop 
 
 
 
  The LUT forms the basis for the electrical re-programmability of a CLB.  A LUT is a 
programmable logic device that is able to perform any arbitrary N-bit input logic 
functions.  Today’s FPGAs typically use SRAM to configure the LUTs for 
programmability.  In SRAM based configuration LUTs, the logic function is determined 
by programming a set of SRAM bits called the LUT mask.  To implement an N-bit input 
LUT device, 2N RAM bits and a 2N:1 multiplexer are required.  A 4-bit LUT is illustrated 
in Figure 6 which consists of 16 SRAM bits and a 16:1 multiplexer implemented as a tree 
of 2:1 multiplexers.  The inputs to the LUT (A, B, C, D) control a series of multiplexers 
(MUXs) that are used to decode and select one of the SRAM bits which drives the LUT 
output (Y).  Larger LUTs can also be constructed by using multiple smaller LUTs and 
multiplexers.  The general requirement is that the total number of ram bits in the smaller 
17 
LUTs equals the number required to implement an N-bit LUT.  On FPGAs, complex 
logic functions that require more inputs than a single LUT can provide are implemented 
by combining several LUTs or even several CLBs. 
 
  
Figure 6. Four Input Lookup Table with SRAM Configuration 
 
 
 
The application of FPGAs to high performance floating point computing has been 
previously reported (Shirazi, Walters et al. 1995; Ligon, McMillan et al. 1998; Caliga and 
Barker 2001; Jian, Tessier et al. 2003; Gokul, Zhuo et al. 2004).  Underwood reports that 
scaling trends project that FPGAs will exceed the floating point computational ability of 
general purpose processors, which is a promising trend for the usage of FPGAs in the 
area of floating point computation (Underwood 2004).  The emergence of high 
performance scientific computing cores on FPGA has also been reported (Govindu, Choi 
18 
et al. 2004; Morris and Prasanna 2005; Smith, Vetter et al. 2005; Zhuo and Prasanna 
2005).  Compared to previous work, this research explores the use of FPGA technology 
to implement direct sparse LU decomposition, which has not been previously reported. 
 
1.3 Objective 
 
The objective of this research is to investigate the use of FPGA based application 
specific hardware design to accelerate power flow computation. 
  
1.4 Approach 
 
The most computationally intensive part of power flow computation is the solution of a 
sparse set of linear equations, which is the focus of this effort.  The hardware design is 
based on an empirical study of the direct sparse LU decomposition of power system 
Jacobian matrices.  Direct methods are used due to their high performance and robustness 
compared to other methods.  Since the target application of this hardware is power flow 
computation, the hardware design will be tailored to these particular matrices and may 
not perform well on matrices from other applications.  Based on this empirical study, a 
hardware design will be generated and a model of the hardware’s operation will be 
created.  This model will be used to predict the expected performance of the design and 
will aid in the debug of the hardware implementation.  Having an accurate model of the 
hardware will allow the projection of results without requiring the design and debug of 
additional hardware.  Finally a prototype will be created and implemented on an FPGA 
19 
prototyping board to verify the performance reported by the model and provide a proof of 
concept. 
 
1.5 Contribution 
 
This research is the first application specific hardware design created to perform sparse 
LU decomposition.  As a result of this research the following contributions have been 
made: 
• Methodology to evaluate potential computations for acceleration using an 
FPGA  
o Identification of key algorithm properties useful in the hardware design  
• Prototype hardware accelerator design, implemented on an FPGA, for the LU 
decomposition of sparse matrices using sparse direct methods 
o Partial pivoting and diagonal pivoting 
o Design decisions related to the FPGA implementation 
• Empirical results of Minimum Degree ordering effectiveness for larger (>1000 
buses) power system Jacobian matrices 
• Empirical study of the LU decomposition profile of power system Jacobian 
matrices 
o Fine-grained parallelism 
o Coarse-grained parallelism using recursive dissection 
20 
o Combination fine-grained and coarse-grained parallelism 
 
1.6 Organization 
 
The study and development of a high performance power flow computation requires 
knowledge of the various power flow calculation methods available and the underlying 
computations that are required.  Basic power flow computation methods will be presented 
in chapter 2.  The sparse linear solver techniques applicable to power flow computation 
will be discussed in chapter 3.  In chapter 4, a hardware design for power flow specific 
LU decomposition will be presented including a discussion of the methodology used as 
well as design trade-offs.  Benchmark results for the prototype implementation of this 
hardware design and performance modeling of the design will be provided in chapter 5.  
Finally, in chapter 6, the conclusions of this work and future research will be discussed.   
 
21 
CHAPTER 2.  POWER FLOW COMPUTATION 
 
2.1 Introduction 
 
The two most widely accepted numerical methods for power flow computation are 
Gauss-Jacobi and Newton-Raphson.  Both techniques are iterative methods that converge 
to the desired solution as long as certain conditions are met.  The Gauss-Jacobi method 
will be presented first, including a variation of this method called the Gauss-Seidel 
method.  The second method presented, Newton-Raphson power flow, is the more 
commonly used method due to its superior convergence characteristics.  Performance of 
Newton-Raphson power flow is highly dependent on the performance of the underlying 
linear solver and these will be covered in greater detail in chapter 3. 
 
2.2 Gauss Power Flow Method 
 
The Gauss-Jacobi method or Jacobi method is a method for solving the matrix equation 
Ax=b, for a matrix that is diagonally dominant (Acton 1970).  In this equation, A 
represents a matrix of coefficients with a non-zero diagonal, x represents a vector of 
unknown values, and b is a vector representing the right hand side of the equations.  
Derivation of the Jacobi method begins by examining each of the n equations in isolation.  
The ith equation in the matrix is defined as 
∑
=
=
n
j
ijij bxa
1
 
 
22 
Solving for xi, while assuming the other unknown entries remain fixed results in the 
following equation which describes the Jacobi method. 
[ ]∑ ≠+ −= ij vjiji
ii
v
i xaba
x 11  
 
For a given iteration, indicated by the superscript v, the vth vector of unknowns xv is used 
to generate a new vector containing a refinement of the original values xv+1, until the 
change in the vector of unknowns between iterations is within a specified tolerance.  
Equations are isolated and treated separately which means that the order in which the 
equations are evaluated is irrelevant, allowing highly parallel computation of the 
individual elements in the vector of unknown values.  A modification of this method, 
called the Gauss-Seidel method, uses the updated values immediately after calculation 
within the current iteration.  Since the updated value is typically closer to the solution 
than the previous value, this modification results in a reduced number of iterations for 
convergence.  The reduced number of iterations comes at the price of reduced parallelism 
since the method introduces dependencies that are not present in the Gauss-Jacobi 
method. 
The power flow equations can be solved by application of the Gauss-Jacobi or Gauss-
Seidel method (Bergen 1986).  Solving the ith bus power flow equation for the unknown 
voltage Vi results in the following equation 
⎥⎦
⎤⎢⎣
⎡ −= ∑ ≠ jij ij
i
i
ii
i VYV
S
Y
V *
*1  
 
23 
This equation is in a similar form to the Gauss-Jacobi equation and can be used to 
iteratively solve for the unknown voltages in the power system. 
The biggest benefit to using Gauss-Jacobi iteration is the parallel examination of the 
equations within a given iteration.  Gauss-Seidel also enjoys a high level of parallelism, 
however not as much as exists in an equivalent Gauss-Jacobi iteration due to the 
introduced dependencies designed to reduce the number of iterations required for 
solution.  Another tradeoff in the Gauss-Jacobi method is the storage of two separate 
vectors representing the current and updated vector of unknowns.  Gauss-Jacobi only 
requires the storage of one vector since the updated values are used immediately. 
Both of these Gaussian methods require significantly more iterations, and consequently 
more time when compared to the Newton power flow method.  Even though a single 
iteration for the Newton power flow method takes roughly seven times longer to compute 
than a single iteration of the Gaussian methods, this is more than offset by the number of 
iterations required for convergence.  For the Gauss method the number of iterations is 
approximately O(# of buses), while the number of iterations required by the Newton 
power flow method is generally invariant of system size.  This makes the Newton power 
flow method particularly attractive for larger sized power systems (Tinney and Hart 
1967).  Another issue affecting convergence of the Gauss-Jacobi method was reported by 
Tinney.  He concludes that the Gauss-Jacobi iterative methods have convergence issues 
for what he describes as ill-conditioned systems.  The convergence of the Gaussian 
methods is affected by the diagonal dominance of the matrix which describes the system.  
Matrices which are not diagonally dominant may fail to converge when solved by the 
24 
Gaussian methods.  The use of the more robust Newton power flow algorithm as a 
solution method for these types of systems is suggested. 
One of the first applications of digital computers to iterative power flow computation 
was presented by Ward and Hale (Ward and Hale 1956).  A later study performed on 
several small IEEE test systems, up to 118 buses, using personal computers concludes 
that Gauss-Seidel is slower than Newton-Raphson algorithms (Keyhani, Abur et al. 
1989).  Previous work (Chassin, Armstrong et al. 2006) on the application of FPGAs to 
Gauss-Seidel computation has been reported however the author concludes that their 
approach is not feasible for implementation on today’s FPGAs for large power systems 
due to a lack of sufficient resources.  Another study (Foertsch, Johnson et al. 2005) which 
proposes the use of Gauss-Jacobi power flow on FPGA did not project a significant 
performance improvement over a Newton-Raphson based software solver. 
 
2.3 Newton Power Flow Method 
 
The Newton-Raphson method is an iterative technique for calculating the roots of a 
function by linearizing the function using the derivative of the function and the value of 
the function at a specific point (Acton 1970).  For this method to work the function must 
meet certain conditions near the root, for example the derivative of the function must 
exist and be non-zero.  The attractiveness of the Newton-Raphson method comes from its 
convergence properties.  Close to the root, the Newton-Raphson method provides 
Quadratic convergence, however a starting point that is far form the root can lead to poor 
convergence.  Beginning with an approximation to the desired solution, the method can 
25 
be applied repeatedly until reaching the level of accuracy desired for the solution and is 
illustrated graphically in Figure 7. 
 
 
f(x) tangent line
x xn+1 xn
 
Figure 7. Graphical Illustration of Newton’s Method for an Arbitrary Function 
 
 
 
By extending the tangent of the function, evaluated at some initial value (xn), and 
calculating the zero of this tangent, we receive a new value (xn+1) which is closer to the 
root (x) than the initial value.  Mathematically this is equivalent to calculating the 
intercept of the derivative of the function at that point.  By repeating this process using 
the newly generated value, we can continually improve the accuracy of our initial 
approximation and iteratively converge to the desired result. 
Algebraically the Newton-Raphson method can be derived by using the Taylor-Series 
expansion for a function around a point. 
26 
( ) ( ) ( ) ( ) ( ) ( ) KK +Δ++Δ+Δ+=Δ+
!2
2
'''
n
xxfxxfxxfxfxxf
n
n  
 
Grouping the higher order terms (h.o.t.) beyond the linear term we get 
( ) ( ) ( ) ...' tohxxfxfxxf +Δ+=Δ+  
 
For small Δx, the higher order terms are insignificant and can be ignored.  Since we are 
solving for a root (i.e. f(x) = 0), setting the left hand size to zero we get. 
( )
( )xf
xfx '−=Δ  
 
Using this result we can derive the Newton-Raphson iterative formula, where the 
superscript v denotes the iteration number. 
( )( )v
v
vv
xf
xfxx '
1 −=+  
 
This scalar derivation can be generalized to vector functions as well, where the vector 
Taylor series is a system of scalar Taylor series.  For the vector Taylor expansion we get 
( ) ( ) ( ) ... tohJ ++=+ ΔxxxfΔxxf  
 
where 
( )
( )
( )
( )
( ) ( )
( ) ( ) ⎥⎥
⎥
⎦
⎤
⎢⎢
⎢
⎣
⎡
Δ
Δ
=
⎥⎥
⎥⎥
⎥
⎦
⎤
⎢⎢
⎢⎢
⎢
⎣
⎡
∂
∂
∂
∂
∂
∂
∂
∂
=
⎥⎥
⎥
⎦
⎤
⎢⎢
⎢
⎣
⎡
=
n
n
nn
n
nn
n
x
x
x
xf
x
xf
x
xf
x
xf
xxf
xxf
M
L
MM
L
K
M
K 1
1
1
1
1
1
11
,,
,,
ΔxxJxf  
 
27 
J(x) is the matrix of partial derivatives of f evaluated at x, also called the Jacobian matrix.  
Using this result the Newton-Raphson equation for vector quantities is 
( )[ ] ( )v1vv1v xfxJxx −+ −=  
 
A slightly modified form of this equation using 
v1vv xxΔx −= +  
 
where Δx is the correction factor for each iteration is 
( )vvv xfΔxJ −=⋅  
 
Instead of computing the inverse of the Jacobian matrix as the Newton-Raphson formula 
states, we can reduce the amount of computation by using the modified form and solving 
the system of linear equations. 
The application of Newton-Raphson iteration for solving the power flow equations was 
first reported by (Ness 1959; Ness and Griffin 1961; Tinney and Hart 1967), followed by 
an optimally ordered solution method by (Tinney and Walker 1967).  For Newton power 
flow computation we define three vectors to be a vector of the unknown voltage 
magnitudes (|V|), the unknown phase angles (θ), and the another vector (x) which 
comprises both the unknown voltage magnitudes and phase angles. 
⎥⎦
⎤⎢⎣
⎡=
⎥⎥
⎥
⎦
⎤
⎢⎢
⎢
⎣
⎡
=
⎥⎥
⎥
⎦
⎤
⎢⎢
⎢
⎣
⎡
=
V
θ
xVθ
nn V
V
MM
22
θ
θ
 
 
The complex bus power equations (Pi and Qi) can be expressed as a function of x. 
28 
( ) ( ) ( )[ ]
( ) ( ) ( )[ ]kiikkiikn
k
kii
kiikkiik
n
k
kii
BGVVQ
BGVVP
θθθθ
θθθθ
−−−=
−+−=
∑
∑
=
=
cossin
sincos
1
1
x
x
 
 
We define power mismatch vectors (ΔP(x) and ΔQ(x)) to be the difference between the 
known real and reactive ith bus power and the ith bus power expressed as a function of 
the vector of unknown voltage magnitudes and phase angles (x). 
( )
( )
( )
( )
( )
( )
( ) ( )( )⎥⎦
⎤⎢⎣
⎡=−
⎥⎥
⎥
⎦
⎤
⎢⎢
⎢
⎣
⎡
−
−
=
⎥⎥
⎥
⎦
⎤
⎢⎢
⎢
⎣
⎡
−
−
=
xΔQ
xΔP
xf
x
x
xΔQ
x
x
xΔP
nnnn QQ
QQ
PP
PP
MM
2222
 
 
Since the calculated power should be equal to the known values, we solve for the values 
in the unknown vector x until the mismatch between the known power and the calculated 
power is approximately zero, i.e. f(x) ≈ 0.  The unknown vector of voltage magnitudes 
and phase angles can be iteratively solved for by using the previously derived Newton-
Raphson method until the desired mismatch accuracy is obtained. 
For convenience we partition the Jacobian matrix of the complex power equations into 
four parts.  Placing the equations into Newton-Raphson equation form we get 
( )( )⎥⎦
⎤⎢⎣
⎡=⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡
v
v
v
v
v
22
v
21
v
12
v
11
xΔQ
xΔP
VΔ
Δθ
JJ
JJ
 
 
The elements of the Jacobian matrix are calculated using the partial derivatives of the 
complex bus power equations.  The Jacobian matrix is a sparse matrix due to its 
formulation being based upon the Ybus matrix, which is itself a sparse matrix.  We use the 
notation 11pqJ  to represent the pq element of the J11 Jacobian sub-matrix.  We also define 
29 
the following relation, qppq θθθ −= , which results in the following equations that 
describe the calculations necessary to determine the entries in the Jacobian matrix. 
( ) ( )
( ) ( )
( ) ( )
( ) ( )pqpqpqpqp
q
p
pq
pqpqpqpqp
q
p
pq
pqpqpqpqqp
q
p
pq
pqpqpqpqqp
q
p
pq
BGV
V
Q
J
BGV
V
P
J
BGVV
Q
J
BGVV
P
J
qp
θθ
θθ
θθθ
θθθ
cossin
sincos
sincos
cossin
 indicesFor 
22
12
21
11
−=∂
∂=
+=∂
∂=
+−=∂
∂=
−=∂
∂=
≠
x
x
x
x
 
 
and 
( )
( )
( )
( )
ppp
p
p
p
p
pp
ppp
p
p
p
p
pp
pppp
p
p
pp
pppp
p
p
pp
VB
V
Q
V
Q
J
VG
V
P
V
P
J
VGP
Q
J
VBQ
P
J
qp
−=∂
∂=
+=∂
∂=
−=∂
∂=
−−=∂
∂=
=
x
x
x
x
22
12
221
211
 indicesFor 
θ
θ
 
 
The calculation of the entries in the Jacobian matrix is based upon the current values 
for the voltage magnitude and phase angle.  These entries are updated based on the new 
values for the voltage magnitude and phase angles after each iteration of Newton-
Raphson power flow for the subsequent iteration.  Several variations on the Newton 
method for power flow have arisen including re-use of the lower and upper triangular 
30 
factors for multiple iterations (Tinney and Hart 1967).  Decoupled Newton power flow 
(Stott 1972; Stott and Alsac 1974) takes advantage of neglecting relatively small 
numerical values by sacrifices convergence rate for a reduced computational burden.  The 
result is a simplified problem where the P-θ and Q-V components of the Jacobian matrix 
are treated separately.  All of these variations still rely on the fundamental Newton power 
flow method, and any improvements that speed up the base Newton power flow 
calculation, also called full-AC Newton power flow, can also be applied to the other 
Newton-based power flow methods. 
Benchmark studies of standard full-AC Newton power flow, see Figure 8, computation 
using a direct linear solution method to solve the Jacobian matrix indicates that the most 
time consuming portion, roughly 85% of the computation time, is in the sparse linear 
solver step of the algorithm.  The data in this figure was generated using the 1648 Bus 
benchmark system presented in chapter 4.  A similar result has also been reported by 
other sources (Feng and Flueck 2002).  Consequently much attention has been given to 
improving the sparse linear solver methods used in Newton power flow in order to reduce 
the computation time.  Unfortunately the serial nature of the algorithm combined with the 
very sparse structure of power system Jacobian matrices has led to an inability to 
parallelize the problem effectively (Feng and Flueck 2002).  This is due to a variety of 
reasons including inefficient processor utilization and the limited amount of work, both 
of which are tied to matrix sparsity. 
 
31 
1. Problem Formulation &
Symbolic Analysis
2. Jacobian Construction
3. LU Decomposition
4. Solve Forward &
Backward Substitution
Data Input
5. Does Solution
Converge ?
Load Flow Solution
YES
Host Processor
(Pentium 4)
Software
LU
88.0%
Solve
8.8%
Symbolic
3.2%
Jacobian
0.0%
 
Figure 8.  Newton Power Flow Computation Flow Chart and Breakdown 
 
 
 
32 
CHAPTER 3.  SPARSE LINEAR SOLVERS 
 
3.1 Introduction 
 
Two classes of techniques exist for solving sparse linear systems of equations, direct 
solvers or iterative solvers.  Iterative solvers rely on a repeated calculation method 
designed to refine an initial starting point until the desired accuracy is achieved.  The 
performance of iterative methods relies highly on pre-conditioning and the iterative 
method used to descend to the desired solution.  High performance iterative solvers 
typically fall under the class of preconditioned Krylov subspace solvers.  These methods 
have varying amounts of computational parallelism available depending on their 
formulation.  Iterative methods do not suffer from performance reducing phenomena such 
as fill-in that can occur in direct methods.  Unfortunately convergence to a solution is not 
guaranteed for iterative methods.  A brief overview of iterative methods and application 
of these methods to power flow will be provided as well as previously reported work. 
Unlike iterative methods, direct solution methods use calculations designed to calculate 
the intended solution.  Direct solvers for the most part are variations on Gaussian 
elimination, the most common algorithm being Lower/Upper triangular (LU) 
decomposition (George and Liu 1981; Duff, Erisman et al. 1989).  LU decomposition is 
competitive with Gaussian elimination regarding the number of floating point 
computations required for solution with the added benefit of being independent of the 
coefficients on the right hand side.  The LU solution can be reused to repeatedly solve 
systems of equations with the same set of coefficients in the unknowns on the left hand 
side of the equation.  Compared to other direct techniques such as matrix inversion, LU 
33 
decomposition can take advantage of sparse matrix methods and also requires fewer 
floating point computations.  Direct LU decomposition of sparse matrices can achieve 
high performance by taking advantage of sparse matrix methods such as using a sparse 
matrix data storage format to reduce storage requirements and skipping calculations 
where the result is known to be zero to reduce the amount of calculation necessary to 
solve the sparse linear system of equations.  Unfortunately, sparse matrix methods do not 
necessarily have large amounts of computational parallelism. 
 
3.2 Iterative Linear Solvers 
 
Iterative solvers use successive approximations to converge to the solution of a linear 
system of equations.  The two classes of iterative methods are stationary and non-
stationary.  Stationary solvers such as the Gauss-Seidel method, which was previously 
detailed as a method used to solve the non-linear power flow equations, rely on 
computations using information that does not vary from iteration to iteration.  These 
methods are typically easy to implement than non-stationary methods, but do not yield 
good performance.  Non-stationary solvers, such as the Conjugate Gradient (CG) method, 
involve computations using information that varies at each iteration.  The key to 
obtaining good convergence for Iterative solvers depends on a transformation matrix, 
called the preconditioner, which is used to modify the original system to obtain favorable 
characteristics.  Iterative solvers such as the CG method typically function by using 
mainly matrix-vector multiplications or matrix transpose-vector multiplications.  Sparse 
matrix-vector multiplication (SpMxV) using FPGAs for the implementation of this 
34 
computation has been previously reported (deLorimier, Andr et al. 2005; Zhuo and 
Prasanna 2005). 
There are many different methods which fall under the umbrella of iterative linear 
solvers.  The conjugate gradient method is presented as an example of an iterative linear 
solver which can be used to solve the system of linear equations found Newton power 
flow.  Use of the conjugate gradient method and other iterative methods in Newton power 
flow has been previously reported (Semlyen 1996; Pai and Dag 1997; Alves, Asada et al. 
1999).  Implementation of a FPGA based Jacobi iterative solver on FPGA has also been 
reported (Morris and Prasanna 2005). 
The conjugate gradient method is an algorithm that can be used to solve the symmetric 
positive definite matrix of a system of linear equations.  This method uses conjugate 
directions to descend to the nearest local minimum of a function.  The function to be 
minimized is 
( ) xbxAx
2
1xf ⋅−⋅⋅=  
 
which is minimal when its gradient 
bxAf −⋅=∇  
 
is zero.  This is equivalent to solving the system of equations Ax=b.  For the solution of 
linear but not necessarily positive definite or symmetric equations, such as those in 
Newton power flow, a generalization the conjugate gradient method called the 
biconjugate gradient method can be used.  The biconjugate gradient method begins with 
an initial guess x1 of the vector solution.  The initial residual r1 is chosen to be  
35 
11 xAbr ⋅−=  
 
based on the coefficient matrix A and the initial guess x1, with =r1.  The initial guess is 
refined using 
vvv1v pxx α+=+  
 
and iterating the following equations 
  
v1v1v
v1v1v
vv
vv
vv1v
vv1v
vv
vv
prp
prp
rr
rr
pArr
pArr
pAp
rr
v
v
v
v
v
v
β
β
β
α
α
α
+=
+=
⋅
⋅=
⋅−=
⋅−=
⋅⋅
⋅=
++
++
++
+
+
11
 
 
For the first iteration 
11
11
11
rp
rp
rr
=
=
=
 
 
As previously mentioned, the use of a preconditioner 
1~
A
−
 to modify the matrix of 
coefficients A can be applied to improve the rate of convergence, otherwise known as 
reducing the condition number.  The idea behind using a preconditioner is that the 
solution for a preconditioned system may be easier to find, i.e. take fewer iterations, than 
36 
applying the iterative solver to the original system of equations.  Solving the 
preconditioned system 
bAxAA
1~1~ ⋅=⋅⎟⎟⎠
⎞
⎜⎜⎝
⎛ ⋅
−−
 
 
is equivalent to solving the original system of equations.  The ideal preconditioner is the 
inverse of the original matrix, however calculating the inverse would result in no 
computational savings.  The calculation and application of the preconditioner must be 
less than the subsequent reduction in iterations in order to improve the performance of the 
iterative solution method. 
There are a variety of iterative solvers beyond the conjugate gradient method.  Practice 
has shown that selection of the best iterative solver for a specific application is usually an 
empirical study.  For a more complete treatment of conjugate gradient and other iterative 
methods see (Barrett, Berry et al. 1994; Saad 2003).  De Leon has benchmarked various 
iterative solvers for power flow computation relative to a traditional direct solver using 
Matlab’s built in iterative solver routines (de Leon and Sernlyen 2002).  Their work uses 
floating point operations per second (FLOPs) to compare the different methods, but this 
is not a true measure of performance, and details about the cost of the preconditioning 
method were not included.  In (Pai and Dag 1997), Pai concluded that iterative methods 
still can’t compete with direct methods due to possible issues with convergence.  In 
(Alves, Asada et al. 1999), Alves concluded that iterative methods are slower than direct 
methods for the solution of power flow.  Dag presents a pre-conditioner for use with 
conjugate gradient power flow (Dag and Semlyen 2003).  The application of iterative 
linear solvers for power flow computation is not well established.  Instead, typical 
37 
Newton power flow solvers use more robust direct methods for solving the sparse linear 
system of equations. 
 
3.3 Direct Linear Solvers 
 
Lower/Upper triangular decomposition is the factorization of a matrix A into the 
product of a lower triangular matrix L and an upper triangular matrix U.  For direct 
methods, during the factorization it may become necessary to perform pivoting, the 
swapping or a row/column with another, in order to maintain numerical stability as well 
as accuracy in the solution.  This amounts to nothing more than changing the order of the 
elimination of the equations and variables.  Direct LU solvers for dense matrices, where a 
dense matrix is one that has very few non-zero entries, requires O(n3) floating point 
operations for LU decomposition by Gaussian elimination.  In the general case, with P for 
row and Q for column pivoting, the solution equation is. 
( ) PBxLUxQPAQ T == ~  
 
 Using the upper, lower, and permutation matrices for a given system of linear equations, 
the solution x can be obtained by forward and backward substitution. 
~
~
xQx
yxU
PBLy
=
=
=
 
 
The basic direct LU decomposition algorithm based on Gaussian elimination is 
comprised of three embedded loops, illustrated by the pseudo-code in Figure 9.  In this 
38 
figure, the looping variable “i” denotes a row index, “j” denotes a column index, and “k” 
denotes an equation or variable to be eliminated.  The subscripts denote a particular 
element to be accessed in the matrix, for example Aij is the matrix element in the i’th row 
and the j’th column. 
 
 
for _____ loop
for _____ loop
for _____ loop
Aij = Aij – (Aik * Akj) / Akk
end loop
end loop
end loop
 
Figure 9. Basic Gaussian Elimination Pseudocode 
 
 
 
The labels of the loops are left out because the order of the loop variables can be re-
arranged.  This results in six possible ordering sequences for performing LU 
decomposition by Gaussian elimination.  The division operation is typically performed 
outside of the inner loop which leaves a multiply and subtract as the arithmetic operation 
in the innermost loop.  Selection of a particular loop ordering does not affect the lower 
and upper triangular matrices or the number of floating point operations for LU 
decomposition, omitting any differences that may result due to pivoting or rounding 
error.  The difference that does arise due to the different orderings is in the data access 
and computation pattern, which could have a significant impact on performance of the 
selected method depending on the platform used for computation.  The six methods can 
be broken down into three row based eliminations and three column based eliminations, 
39 
the difference is merely a switch in the role of the row and column during the LU 
decomposition.  Historically there has been an emphasis on column-based algorithms 
rather than row-based algorithms.  This is mainly due to the use of FORTRAN, a column 
major programming language, for the implementation of high performance computing 
software. 
 
 
  
i=k+1..N
jk
k for j = 2 to n loop
for s = j to n loop
Ls,j-1 = As,j-1 / Aj-1,j-1
end loop
for k = 1 to j-1 loop
for i = k+1 to n loop
Ai,j = Ai,j – Li,k * Ak,j
end loop
end loop
for s = 1 to j loop
Us,j = As,j
end loop
end loop
 
Figure 10. jki Gaussian Elimination Pseudocode and Inner Loop Data Dependency 
 
 
 
The jki method, illustrated in Figure 10, is commonly referred to as delayed column or 
“left looking” Gaussian elimination.  This ordering constructs the lower and upper 
triangular result one column at a time.  At the jth iteration, the jth column in the lower 
and upper triangular matrix is calculated using columns in the previously determined 
40 
lower triangular matrix and values in the jth column.  It has been proven that this method 
can be solved in time proportional to the number of FLOPS (Gilbert and Peierls 1986). 
 
 
k
k
i=k+1..N
j
for k = 1 to n-1 loop
for s = k+1 to n loop
Ls,k = As,k / Ak,k
end loop
for j = k+1 to n loop
for i = k+1 to n loop
Ai,j = Ai,j – Li,k * Ak,j
end loop
end loop
for s = k to n loop
Uk,s = Ak,s
end loop
end loop
Un,n = An,n
 
Figure 11. kji Gaussian Elimination Pseudocode and Inner Loop Data Dependency 
 
 
 
The kji method, illustrated in Figure 11, is commonly referred to as column sub-matrix 
or “right-looking” Gaussian elimination.  This ordering constructs the lower triangular 
matrix one column at a time and the upper triangular matrix one row at a time.  At the jth 
step, the remaining portion of the matrix, called the sub-matrix, is updated using the jth 
column in the lower matrix and the jth row in the upper triangular matrix. 
 
 
41 
k=i..n
for i = 1 to n loop
for j = 1 to i-1 loop
for k = i to n  loop
Ai,k = Ai,k – Ai,j * Aj,k
end loop
end loop
(continued)
i
j
j
 
Figure 12. ijk Gaussian Elimination Pseudocode and Inner Loop Data Dependency Part 1 
 
 
 
The ijk method is commonly referred to as Crout or dot-product Gaussian elimination 
due to the use of inner products during calculation.  This method is comprised of two 
steps.  At the ith step, illustrated in Figure 12, the ith row of the upper triangular matrix is 
calculated using rows in the previously computed upper triangular matrix and the ith row 
of the lower result matrix. 
 
 
42 
i
k=i+1..n
(continued)
for j = 1 to i-1 loop
for k = i+1 to n loop
Ak,i = Ak,i – Aj,i * Ak,j
end loop
end loop
for s = i+1 to n loop
As,i=As,i/Ai,i
end loop
end loop
j
j
 
Figure 13. ijk Gaussian Elimination Pseudocode and Inner Loop Data Dependency Part 2 
 
 
 
The second part of the ijk method, illustrated in Figure 13, calculates the ith column of 
the lower triangular matrix using the columns of the previously calculated lower 
triangular matrix and the ith column of the upper triangular result matrix. 
 
3.4 Direct Sparse LU Decomposition 
 
Sparse matrices have a low number of non-zero entries (NNZ), roughly O(N),  
compared to O(N2) NNZ in dense matrices.  Direct dense LU decomposition methods 
also have a computational complexity for solution of roughly O(N3) FLOPS compared to 
O(N) for direct sparse matrix methods (Press, Teukolsky et al. 1992).  Sparse matrix 
methods reduce the number of FLOPs required by skipping obviously zero results and 
43 
can reduce the amount of storage space required for solution by allocating storage space 
only for the non-zero entries in the matrix.  Unlike direct dense LU decomposition, which 
has fairly regular data access and computational patterns, direct sparse LU decomposition 
using sparse matrix methods suffers from irregular data access and computational 
patterns due to the non-zero structure of the matrix.  Despite these issues, the benefits of 
direct sparse LU decomposition using sparse matrix methods more than compensates for 
the additional burdens of using sparse matrix techniques. 
Using a compressed storage format to represent a sparse matrix reduces the amount of 
storage space required to store sparse matrices relative to a dense representation.  Dense 
matrix representations use a two dimensional array (2D) with each element in the array 
representing a value.  This method of representing a matrix requires space for N2 values, 
given a square matrix of size N.  Row and column indices are not implicitly stored for a 
dense representation but instead are inferred through indexing of the 2D array of values.  
Since there are O(N) NNZ in a sparse matrix there is potential for significant storage 
savings by only allocating space for the non-zero values in the matrix, especially as the 
size of the sparse matrix becomes large. 
 
3.4.1 Sparse Matrix Storage 
 
To reduce the amount of memory required to store a sparse matrix, an indexed storage 
method is commonly used.  There are many different types of compressed storage 
schemes for use with sparse matrices (George and Liu 1981; Kincaid, Respess et al. 
1982; Oppe, Joubert et al. 1988; Saad 1990; Gilbert, Moler et al. 1992; Press, Teukolsky 
44 
et al. 1992; Davis 2006).  The most basic compressed storage format for sparse matrices 
is called compressed row for row-wise storage or compressed column for column-wise 
storage.  These two formats are generic and do not rely on any specific feature of the 
sparse matrices that they represent, such as a specific non-zero pattern.  Storage space for 
these formats is allocated in the form of a one dimensional data vector containing 
contiguous blocks of entire rows or columns of the matrix elements and a corresponding 
size vector which stores the index of the last element for each block in the vector 
containing the matrix elements.  The elements of the matrix are represented by a floating 
point value and the corresponding index for that value, row index for compressed column 
storage or vice versa.  The individual elements in the row or column blocks are not 
required to be in any particular order; however the blocks themselves must be stored in 
ascending order in the data vector.  The amount of storage required for these two formats 
is NNZ indices and values + N indices, which is potentially much less than the N2 space 
required for a dense matrix representation.  An example sparse matrix and its 
corresponding compressed storage vectors are illustrated in Figure 14. 
 
 
45 
1 2 3 4 5 10 1 2 7 8
1
2
3
4
5
6
7
8
9
10
1 3 6 1 4 1 5 6 3 5 6 2 7 9 2 8 7 9 1 10
5 9 12 14 17 20 23 25 27 29
Index
Value
Last 
Entry
 
Figure 14. Sparse Matrix and Corresponding Compressed Storage 
 
 
 
 In the example shown, for the 10x10 sparse matrix with 30 non-zero elements, the 
compressed storage format allocates space for 30 values and 40 indices.  The same matrix 
stored using a dense matrix representation would require 100 values to be stored, the 
indices are implicit.  There are however two major issues with using a compressed sparse 
matrix representation.  The first is inserting and deleting entries, which can result in a 
large amount of data movement.  The second issue is accessing the data in the matrix.  
Depending on the encoding used, row or column, accessing the matrix with the opposite 
pattern could require searching through a portion of the vector of matrix elements.  
Choosing a compressed sparse matrix storage format is often influenced by the data 
access patterns of the algorithm to be applied to the sparse matrix. 
 
3.4.2 Ordering Sparse Matrices for Fill-In Reduction 
 
46 
Direct sparse LU decomposition also suffers from a phenomenon referred to as fill-in.  
Fill-in occurs when a previously zero entry becomes non-zero during factorization and 
has to be inserted into the sparse matrix.  This results in a change in the matrix structure 
which effects future computations and increases the amount of computation necessary for 
solution.  Large amounts of fill-in can degrade the performance of sparse LU solvers 
significantly by increasing the number of mathematical operations required for the LU 
decomposition as well as potentially requiring modifications to the sparse data storage 
structure.  Fill-in during sparse LU decomposition is caused by the non-zero structure of 
the matrix prior to and during the LU decomposition process.  In order to limit the 
amount of fill-in that occurs, the non-zero structure of the sparse matrix can be modified 
by re-ordering the rows and/or columns of the matrix prior to LU decomposition.  Re-
ordering only affects the order of the variables in the system of equations, or the order in 
which the equations are eliminated during LU decomposition. 
Yannakakis proved that finding the best ordering which results in minimum fill-in is an 
NP-complete problem (Yannakakis 1981).  NP-complete problems are those problems 
which can’t be solved using a polynomial time algorithm.  Thus, the use of heuristic 
algorithms to determine a suitable matrix ordering for sparse LU decomposition is 
required.  Sparse matrix ordering algorithms based on heuristic techniques, such as the 
minimum degree algorithm, use graph theoretical methods to determine the next node to 
be eliminated in the LU factorization process without taking into account the actual 
values that would be computed.  Since values are not computed, they do not consider 
cancellations that may occur and also do not take into account numerical accuracy, which 
could be compromised as a result of the re-ordering.  The computation of the sparse 
47 
matrix ordering using these heuristics typically takes less time than performing the LU 
decomposition on the re-ordered matrix.  Re-use of the ordering for matrices with the 
same non-zero structure, such as in Newton-Raphson power flow, is possible potentially 
allowing amortization of the ordering cost across several LU decompositions. 
There are in general three classes of sparse matrix orderings typically used to reduce 
fill-in for sparse matrix LU decomposition.  Bandwidth reducing techniques such as 
Cuthill-McKee (Cuthill and McKee 1969) attempt to reduce the matrix profile, i.e. how 
close non-zero entries are to the diagonal, of the matrix.  Minimum degree algorithms 
(George and Liu 1989) use a local greedy strategy to select the next element to be 
eliminated.  This strategy attempts to locally minimize fill-in by selecting the next 
element to be eliminated as the one with minimum degree, i.e. the one with the least off-
diagonal non-zero entries, and is quite effective for a wide range of problems including 
power flow.  Dissection algorithms such as Nested Dissection (Kernighan and Lin 1970) 
attempt to reduce fill-in by finding a node separator that divides the sparse matrix into 
two equal partitions when the nodes and edges in the separator are removed from the 
graph.  The method is then recursively applied to the generated partitions.  Compared to 
minimum degree, dissection algorithms have a more global view of the sparse matrix 
structure. 
Sato and Tinney first proposed matrix renumbering by number of branches in (Sato and 
Tinney 1963).  The use of a minimum degree ordering algorithm for power flow 
computation, referred to as optimally ordered decomposition, was later reported by 
Tinney and Walker (Tinney and Walker 1967).  The result of this work were several 
ordering methods historically referred to as Tinney-1, 2, and 3.  Ogbuobiri concluded that 
48 
partial pivoting is compatible with optimal ordering and also proposed renumbering by 
clusters using a node tearing scheme (Ogbuobiri, Tinney et al. 1970).   
The Tinney algorithms are the symmetric analogue of an algorithm proposed by 
Markowitz (Markowitz 1957) for linear programming applications.  Markowitz proposed 
performing row and column permutations to minimize the product of the number of off-
diagonal non-zeros in the pivot column and pivot row.  This has the effect of minimizing 
the work at each step in the elimination, which also indirectly reduces the number of fill-
in by creating a local bound on the maximum amount of fill-in that can be created.  Rose 
later developed a graph theoretic model of Tinney and Walker’s optimal ordering strategy 
and renamed it to the minimum degree algorithm (Rose 1970).  Since then, the use of 
minimum degree ordering for sparse matrices has been the subject of significant research 
(George and Liu 1989), the results of which have been improvements to the performance 
of the algorithm, as well as improvements to the quality of the orderings generated. 
Research in the area of power system matrices by Wendel has shown that while 
minimum degree does not necessarily determine the minimum possible fill-in ordering, it 
is close to minimum possible fill-in for sparse LU decomposition of power system 
matrices (Wendel 1987).  Due to the speed and quality of the orderings generated for 
power system matrices, the Minimum Degree ordering algorithms are widely used when 
performing direct sparse LU decomposition of power system matrices. 
The numerical structure of a sparse matrix can be represented by using a graph.  One 
graph representation for a sparse matrix is as set of nodes interconnected by a set of 
directed or undirected edges.  Use of a graph to represent a sparse matrix allows 
identification and manipulation of useful features of the sparse matrix such as the non-
49 
zero structure of the sparse matrix entries without having to perform numerical 
computations.  For each row/column in the sparse matrix there exists a node.  For each 
off-diagonal non-zero entry aij, where i and j are the row and column indices of a given 
non-zero entry in the sparse matrix, a directed edge from node i to node j exists in the 
graph.  In the case of symmetrical matrices if aij is non-zero then aji must also be non-
zero, which allows the representation of a non-zero entry by a single undirected edge in 
lieu of a pair of directed edges.  The number of nodes adjacent to a particular node is also 
called the degree of that node. 
The information about the non-zero structure of the matrix conveyed by the graph 
representation can be used to determine the next node that should be chosen for 
elimination by using a heuristic algorithm.  The graph can then be modified to reflect the 
impact of that choice on the non-zero structure of the remaining matrix.  Given the graph 
of a sparse matrix, the elimination of a particular node will introduce new edges to the 
graph between all nodes connected to the eliminated node, except where an edge already 
exists, and then all edges in the graph connected to the eliminated node are removed. 
 
 
1
2
3
456
78 9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10  
Figure 15.  Sparse Matrix, Graph Representation, and Post LU Filled Matrix  
 
 
 
50 
Figure 15 illustrates a symmetrical sparse matrix, its associated undirected graph 
representation, and the filled matrix after LU decomposition assuming the nodes are 
eliminated as they are initially numbered (1, 2, 3, 4, 5, 6, 7, 8, 9, 10).  In an undirected 
graph, nodes are represented by an encircled number which corresponds to the numerical 
index of that node.  A non-zero entry in a particular row and column is indicated by a link 
between the nodes representing that row and column.  In the post-LU decomposition 
filled matrix, fill-in entries are indicated by empty circles, the original non-zero entries 
are indicated by filled circles.  
The basic minimum degree algorithm for ordering a symmetrical matrix is summarized 
in Figure 16. 
 
Begin with the symmetric graph G of a sparse matrix
While G is not empty loop
Select a node y of minimum degree in G
Add edges to G between all nodes that were adjacent to y
Remove y and all of its edges from the graph
End loop
 
Figure 16. Basic Minimum Degree Algorithm 
 
 
 
51 
1
2
3
456
7 9
101
2
3
4
5
6
7
8
9
10
1
2
3
456
7
101
2
3
4
5
6
7
8
9
10  
1
2
3
56
7
101
2
3
4
5
6
7
8
9
10
1
2
3
56
7
1
2
3
4
5
6
7
8
9
10
 
1
2
3
56
1
2
3
4
5
6
7
8
9
10
1
3
56
1
2
3
4
5
6
7
8
9
10
 
1
3
5
1
2
3
4
5
6
7
8
9
10
15
1
2
3
4
5
6
7
8
9
10
 
1
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
 
Figure 17. Minimum Degree Ordered Sparse LU Decomposition 
 
52 
Figure 17 demonstrates for the same matrix, a minimum degree ordering algorithm 
with a graph representation of the sparse matrix at each step in the elimination process.  
The node elimination order determined is (8, 9, 4, 10, 7, 2, 6, 3, 5, 1).  For a symmetrical 
re-ordering the selection of a particular node for elimination can be viewed as an atomic 
column pivot and row pivot operation between to the selected node and the node in the 
current row/column being eliminated.  The only fill-in that occurs is in the example is a 
result of the elimination of node 6, which introduces a fill-in edge between 3 and 5.  This 
example shows how an ordered factorization can lead to a significant reduction, from 50 
down to 2, in the amount of fill-in for the lower and upper triangular factors after LU 
decomposition.  Not explicitly shown in this example is the resulting reduction in the 
amount of computation required for the LU decomposition.  The unordered LU 
decomposition requires the calculation of 35 divisions and 161 multiplications, while the 
ordered LU decomposition requires 11 divisions and 15 multiplications.  A 3x reduction 
in divisions and over 10x reduction in multiplications required for LU decomposition. 
 
3.4.3 Ordering Sparse Matrices for Parallelism 
 
In addition to ordering sparse matrices for fill-in reduction, there has been some work 
in ordering sparse matrices for parallelism in order to obtain higher performance by using 
multiple processors (Duff 2000).  For dense matrix LU decomposition methods, 
parallelism is easily extracted by using a pre-determined matrix blocking algorithm due 
to a fairly regular computational and data access pattern.  For sparse LU decomposition 
however, parallelism by matrix blocking is harder to extract due to irregular 
53 
computational and data access patterns caused by the sparse matrix non-zero structure 
which leads to load balancing and scaling issues.  There are however special sparse 
matrix forms and ordering algorithms that result in these structures that are amenable to 
coarse-grained parallel sparse LU decomposition. 
These coarse-grained parallel techniques are different from parallel optimally ordered 
factorization (POOF) (Erisman 1977; Wallach 1986).  POOF attempts to identify within 
the optimally ordered matrix, i.e. after minimum degree ordering, diagonal elements 
which are deemed compatible pivots.  Compatible pivots are those that share no off-
diagonal elements.  Since these pivots do not share off-diagonal elements they can be 
eliminated in parallel.  There are two identifiable shortcomings with these methods.  The 
first is load balancing since there is no guarantee that the elimination of the compatible 
pivots requires the same amount of work.  The second is the elimination of those pivots 
which are not compatible which could lead to a bottleneck in the computation. 
A bordered block diagonal (BBD) or block diagonal bordered (BDB) structured sparse 
matrix, illustrated in Figure 18, has features which allows high level parallelism to be 
taken advantage of.  The main structural features of a sparse matrix in BBD form is a 
series of square blocks along the diagonal, and a border along the right and bottom edges 
of the matrix.  The only non-zero entries in the sparse matrix are found in the diagonal 
blocks and along the border (A11, A22, A13, A23, A33, A31, A32, A33), all other 
entries in the matrix are zero (A12, A21).  The diagonal blocks are independent of each 
other since there are no non-zero matrix entries outside of the diagonal blocks with the 
exception of the border.  The BBD sparse matrix form allows parallel computation of the 
diagonal blocks in the sparse matrix, assuming no pivoting is required or pivoting is 
54 
restricted to within the diagonal blocks.  As each of the diagonal blocks is computed, 
updates are made to the final diagonal block that is also a part of the border.  The 
computation of the final block in the sparse matrix can’t begin until all of the 
contributions from the diagonal blocks have been accumulated. 
 
 
A11
A22
A12
A21
A13
A23
A33A32A31
 
Figure 18. Matrix in BBD Form 
 
 
 
In graph theoretic terms, a matrix in BBD form is a collection of independent sub-
matrices interconnected by a set of coupling equations.  The set of nodes in the 
independent sub-matrices represent the indices of the diagonal blocks and the nodes in 
the coupling equations represent the indices of the border of the BBD matrix.  Ordering a 
matrix into a BBD form involves balancing the amount of work required in each of the 
55 
diagonal blocks while simultaneously minimizing the number of coupling equations.  For 
sparse LU decomposition, balancing the work required in each of the diagonal blocks is 
not as easy as dividing the nodes into equal numbers, since the amount of work required 
to perform the decomposition is influenced not only by the number of non-zeros but by 
the structure of these non-zeros as well.  Determining the optimal BBD ordering for 
partitioning a sparse matrix, which can be thought of as optimal graph partitioning, is an 
NP complete problem.  After determining the BBD ordering for a sparse matrix, 
additional ordering of the matrix is required without disturbing the BBD non-zero 
structure, in order to limit the amount of fill-in generated in the border and the diagonal 
blocks.  Ordering the BBD matrix for fill-in reduction can be accomplished by using one 
of the previously mentioned fill-reducing heuristics, restricting the scope of the algorithm 
to the sub-matrices within the BBD matrix. 
There are many algorithms that can be used to generate a node ordering of a sparse 
matrix that results in a BBD structure (Kernighan and Lin 1970; Lucas, Blank et al. 1987; 
Khaira, Miller et al. 1992; Zecevic and Siljak 1994; Karypis and Kumar 1995; Aykanat, 
Pinar et al. 2002; Hu and Scott 2003).  Recursive partitioning of a sparse matrix using a 
dissection technique is one method that is widely used due to the simplicity of the 
method.  Dissection techniques use a variety of algorithms to bisect a graph into two 
equal partitions.  The partitions are connected by a set of nodes referred to as the node 
separator.  The edges connecting the two partitions which would be cut as a result of the 
removal of the node separator are referred to as the edge-cut.  In terms of a BBD 
structure, each partition would represent an independent sub-matrix with the nodes in the 
node separator forming the set of coupling equations.  Recursively applying the 
56 
dissection method leads to a power-of-two number of independent sub-matrices and a set 
of coupling equations. 
One of the most widely used methods for graph partitioning is referred to as multilevel 
graph partitioning and is used by packages such as METIS (Karypis and Kumar 1995).  
Multilevel schemes for graph partitioning seek to balance the time required to determine 
the partition with the quality of the partitions generated compared to other graph 
partitioning methods.  These methods are called multilevel because they operate by 
repeatedly simplifying the original graph and using the simplified graph to generate 
partitions in the original graph.  The basic steps in a multilevel scheme are broken down 
into three parts: coarsening, partitioning, and refinement.  During coarsening, the 
algorithm simplifies the original graph by collapsing vertices and edges to create a 
smaller graph.  In the next step, partitioning, the simplified graph is bisected such that 
two partitions are created which contain roughly equal vertex weights with a small edge-
cut.  In the final stage, refinement, the bisected simplified graph is projected back onto 
the original set of vertices and edges.  Since the original graph has a greater amount of 
freedom in selecting nodes, a refinement of the coarse bisection is typically also 
performed which further decreases the size of the edge-cut.   
Application of coarse-grained parallel decomposition techniques to power flow 
computation has been previously reported (Erisman 1977; Lau, Tylavsky et al. 1991; 
Koester, Ranka et al. 1993; Shyan-Lung and Van Ness 1994; Chan, Dunn et al. 1995; Jun 
Qiang and Bose 1995; Montagna, Granelli et al. 1996; Chan 2001; Feng and Flueck 
2002).  Application of parallel FPGA-based decomposition using soft-core processors 
and coarse-grained parallelism has also been reported (Xiaofang and Ziavras 2003). 
57 
CHAPTER 4.  HARDWARE DESIGN 
 
4.1 Introduction 
 
The trend in the design of modern high performance sparse LU decomposition software 
for implementation on general purpose processor based computer systems is to design a 
computation algorithm that maximizes the amount of high-level basic linear algebra 
system (BLAS) computations performed while minimizing the number of scalar 
computations.  High frequency superscalar processors such today’s Intel Pentium 4 rely 
on constantly feeding the computation pipeline using the integrated low latency caches on 
these processors to achieve high performance.  Due to the increasing gap in memory 
performance relative to processor performance, commonly referred to as the “memory 
wall”,  there is a performance penalty incurred by having to fetch data from off-chip 
memories instead of accessing the higher performance on-chip caches.  The increased 
latencies as a result of a cache miss can result in hundreds of idle cycles while waiting for 
the desired data to be returned by external memory.  BLAS operations are efficient and 
cache friendly, providing a stream of arithmetic operations that can be computed 
effectively on superscalar processors.  Scalar computations on the other hand may not 
effectively exploit locality and regularity which can result in poor performance due to 
cache misses and idle computational resources. 
Even with the use of higher level BLAS operations, high performance sparse LU 
decomposition software can fail to effectively utilize the computational resources 
available.  Table 2 summarizes the reported FLOP rate for numerical factorization using 
UMFPACK and the efficiency of the software as the ratio of FLOPS to the clock rate of 
58 
the processor.  Details of the properties of the benchmark systems used are provided 
below in Table 3.  The low utilization of the processor’s floating point units indicates the 
possibility that a more efficient application specific design could lead to a significant 
improvement in performance.  Instead of adapting the problem to the general purpose 
hardware, the design of hardware that specifically addresses the needs of sparse LU 
decomposition is proposed as an alternative.  Mapping a particular algorithm to hardware 
involves designing logic that will efficiently utilize the available hardware resources, and 
if necessary overcome any shortcomings that could be found in the hardware fabric. 
 
Table 2.  Sparse LU Decomposition Software Efficiency 
Power 
System 
Mflops reported by 
UMFPACK 
Efficiency % (obtained flop * 100 
/clock rate) 
1648Bus 27.42 1.05% 
7917Bus 34.69 1.33% 
10278Bus 24.90 0.96% 
26829Bus 89.74 3.45% 
 
 
 
FPGAs by design are inherently parallel devices.  The reconfigurable fabric is 
composed of many independent CLB which all operate in parallel at the bit level.  To 
maximize performance of hardware implemented on the FPGA the design should take 
advantage of the fine-grained parallelism that can be extracted at this low level in order to 
efficiently use the resources available.  Typical FPGA hardware designs take advantage 
of the inherently parallel nature of FPGAs by instantiating many similar processing 
59 
elements which all operate in concert to complete a specific task.  For some problems 
with less than ideal computational patterns, such as sparse LU decomposition, simply 
generating a large number of identical processing elements may not be appropriate.  For 
sparse LU decomposition the choices of the particular algorithm used, e.g. right-looking 
or left-looking as well as a matrix representation scheme, will significantly impact the 
hardware design.  In this section the design and operation of a sparse LU decomposition 
hardware prototype is discussed. 
 
4.2 Design Specification 
 
To get maximum performance out of a hardware design, it should exploit parallelism 
wherever possible in the algorithm to be implemented while balancing latency with high 
frequency of operation.  Data parallelism can be maximized through data layout in 
memory and by using multiple memory banks and buffers that could allow large amounts 
of concurrent data transfers to occur.  Computational parallelism can be maximized by 
using fully pipelined logic units designed to operate at a high frequency.  By pipelining 
the computation, i.e. dividing it into smaller pieces which can be computed faster, the 
frequency of operation can be increased.  Pipelining also allows many independent 
calculations to be solved concurrently within the pipeline, resulting in high computational 
throughput.  Multiple computation units can also be operated in parallel, increasing 
computational throughput further. 
Examination of the inner loop of the three basic forms of direct sparse LU 
decomposition gives insight into the amount potential parallelism available for these 
60 
three methods.  While the total number of computations is the same for the three forms of 
LU decomposition, the order of the computation and the inherent data dependencies are 
different which will affect the amount of parallelism available.  In the right-looking 
decomposition form, all of the elements in the inner-most loop can be operated on 
independently.  The other two forms of LU decomposition require element reduction 
calculations inside the inner-loop which introduces data dependencies between values 
that are being computed.  This reduces the amount of parallelism available in the inner-
loop, making the right-looking algorithm a more suitable algorithm for extracting fine-
grained parallelism assuming data access patterns and not an issue. 
In a right-looking sparse LU algorithm with row-based partial pivoting, the overall 
flow of computation follows two serial steps which are repeated until the entire matrix 
has been reduced and the lower and upper triangular factors have been found.  The first 
step is the pivot search.  Pivot search requires a search through the current column of the 
matrix being operated on for an entry which meets specific criteria.  This entry is selected 
as the pivot element and the current pivot row and the row containing this element are 
swapped in the matrix.  After the pivot search has completed, the next step is to update 
the remaining portion of the matrix.  Since the matrix is sparse and a sparse data storage 
method will be used, this requires determining the fill-in entries that must be created as 
well as updating the pre-existing entries in the matrix.  After the sub-matrix update has 
completed, the process repeats until the LU decomposition is complete. 
In order to accommodate the ability to perform partial pivoting, a row-wise storage 
method was selected.  This allows the hardware to keep track of the permuted matrix 
indices by using a pair of tables, without having to physically swap rows in memory or 
61 
update the stored index values.  The use of a compressed matrix representation presents a 
challenge, since the pivot search requires a column-wise search of the matrix.  To address 
this issue, two representations of the matrix are used.  The sparse matrix column indices 
and values are stored in a compressed row fashion.  An additional representation 
(colmap), maintained by the pivot hardware, stores the row indices as an adjacency list to 
aid in the pivot search.  This reduces the pivot search read and compare requirements 
from O(N2) entries to O(NNZ in L+U) entries.  By storing both a column wise and row 
wise representation of the sparse matrix, the hardware in essence maintains a symbolic 
directed graph representation of the sparse matrix non-zero structure. 
In order to be able to maximize computational parallelism through the use of pipelined 
functional units, the data must be stored in such a way that the retrieval of data can be 
performed with sufficient throughput to keep the computational pipeline busy and not 
idle waiting for data.  The consumption of data in the inner-most loop of a right-looking 
sparse LU algorithm can follow any particular pattern since the computations themselves 
are independent.  The use of a compressed storage method however introduces data 
dependencies when indexing particular entries.  Unlike a dense storage format where any 
element can be accessed by simply using a row and column index, a compressed storage 
method does not have a 1:1 mapping between indices and elements.  Use of a compressed 
storage method does allow burst memory operations since the data is stored contiguously.  
The determination of the computations necessary for a particular element of a sub-matrix 
update requires a comparison between the indices of the vector being updated with the 
corresponding indices in the pivot row or column.  In the case of row-wise data storage as 
in the prototype hardware the stored column indices are merged with the pivot row’s 
62 
column indices.  To reduce duplicated work and in order to maintain a high throughput, 
the indices are stored in sorted order which makes the row comparison operation a merge 
sort, which can be completed in linear time.  This allows a quick row wise determination 
of all of the operations necessary to update a row of the sub-matrix. 
 
4.3 Analysis of LU Decomposition 
 
In order to quantify the features of a right-looking sparse LU decomposition, empirical 
study is necessary.  Sparse matrices, even those within a particular application such as 
power flow, do not share identical non-zero patterns.  Since the computations performed 
during sparse LU decomposition are dependent on the non-zero structure of the matrix, 
empirical testing was performed in order to facilitate the design of application specific 
hardware.  Using the information gathered, such as data access and computation patterns, 
memory requirements, and amount of fine-grained parallelism, an architecture and 
prototype of the sparse LU decomposition hardware was created.  A summary of the key 
features of the benchmark power system matrices used is provided in Table 3.  In this 
table, sparsity is defined as the number of non-zeros divided by the total possible number 
of non-zeros in the matrix. 
 
63 
Table 3.  Benchmark Power System Jacobian Matrix Summary 
 Matrix Size Number of Non-Zeros Sparsity % 
1648 Bus 2,982 21,196 0.24% 
7917 Bus 14,508 105,522 0.050% 
10279 Bus 19,285 134,621 0.036% 
26982 Bus 50,092 351,200 0.014% 
 
 
 
Matrix ordering techniques alter the non-zero structure of the sparse matrix prior to LU 
decomposition.  Figure 19 illustrates the non-zero structure of the 1648 bus power system 
Jacobian matrix after ordering but prior to LU decomposition. 
 
0 1000 2000
0
1000
2000
nz = 21196
AMD
0 1000 2000
0
1000
2000
nz = 21196
COLAMD
0 1000 2000
0
1000
2000
nz = 21196
NESDIS
0 1000 2000
0
1000
2000
nz = 21196
SYMRCM
 
Figure 19.  Ordered 1648 Bus Jacobian Matrices Before LU Decomposition 
 
64 
 
 
These matrix ordering techniques also affect the non-zero pattern of the sparse matrix 
during LU decomposition as well as the non-zero pattern of the resulting lower and upper 
triangular factors.  Figure 20 illustrates the resulting lower + upper triangular matrices for 
the ordered 1648 bus Jacobian matrices after LU decomposition has been performed. 
 
0 1000 2000
0
1000
2000
nz = 44958
AMD
0 1000 2000
0
1000
2000
nz = 82208
COLAMD
0 1000 2000
0
1000
2000
nz = 48423
NESDIS
0 1000 2000
0
1000
2000
nz = 238835
SYMRCM
 
Figure 20.  L+U Matrix for Ordered 1648 Bus Jacobian Matrices 
 
 
 
The goal of these orderings, as previously mentioned, is to reduce the number of fill-in 
elements generated during sparse LU decomposition.  This also has the effect of reducing 
65 
the number of computations required and the amount of data storage necessary to 
perform the sparse LU.  Several matrix orderings were tested using their Matlab 
implementations and Matlab’s built-in sparse LU decomposition routines.  The basic 
Matlab sparse LU decomposition routine performs a left-looking sparse LU 
decomposition with partial pivoting.  The resulting number of non-zeros (NNZ) in the 
lower and upper triangular factors is comparable to the results of the sparse LU hardware 
since it does not take into account fill-in reduction during LU decomposition. 
Various minimum degree orderings (amd, symamd, colamd), a nested dissection hybrid 
ordering (nesdis) from CHOLMOD (Davis and Hager 2005), and the Reverse Cuthill 
McKee (rcm) ordering were used to order the matrix prior to LU decomposition.  The LU 
algorithm used is the default Matlab 7.3 implementation, a left-looking partial-pivoting 
sparse LU decomposition algorithm.  A comparison of these orderings, applied three 
different ways, is summarized in Table 4, Table 5, and Table 6.  Table 4 reports the NNZ 
in the lower and upper triangular triangular factors (L + U) after LU decomposition using 
the generated orderings applied column-wise to the sparse matrix.  Table 5 reports the 
same results using the orderings applied symmetrically, i.e. to the rows and columns, of 
the sparse matrix.  Table 6 reports the results of a symmetrical ordering with diagonal 
pivoting.  A diagonal pivoting LU decomposition does not perform pivot search if there 
is a non-zero on the diagonal of the matrix.  This has the effect of preserving the original 
symmetrical structure of the matrix which would be disturbed by partial pivoting, at the 
cost of a potential reduction in numerical accuracy. 
 
66 
Table 4.  NNZ in L + U with Column Ordering 
 1648 Bus 7917 Bus 10279 Bus 26829 Bus 
Amd 45,267 237,298 308,275 871,750 
symamd 45,543 236,269 296,732 872,741 
colamd 82,208 468,181 453,257 1,426,040 
Nesdis 48,704 253,871 325,754 961,806 
symrcm 239,691 1,944,634 1,700,365 8,463,847 
 
 
Table 5.  NNZ in L + U with Symmetrical Ordering 
 1648 Bus 7917 Bus 10279 Bus 26829 Bus 
Amd 44,958 235,399 293,000 832,180 
symamd 45,302 234,848 284,447 845,298 
colamd 81,959 466,651 440,530 1,400,699 
Nesdis 48,423 251,866 308,343 920,152 
symrcm 238,835 1,940,587 1,655,633 8,307,255 
 
 
Table 6.  NNZ in L + U with Symmetrical Ordering and Diagonal Pivoting 
 1648 Bus 7917 Bus 10279 Bus 26829 Bus 
Amd 44,607 232,123 252,714 720,104 
symamd 44,841 233,558 250,489 730,022 
colamd 81,669 464,324 429,077 1,363,385 
Nesdis 48,027 250,153 274,302 786,380 
symrcm 236,177 1,930,246 1,638,852 8,112,044 
 
 
 
The results indicate that the symmetrical approximate minimum degree algorithms 
(amd and symamd) generate the best orderings for reducing fill-in during LU 
decomposition when combined with a symmetrical ordering and diagonal pivoting.  
67 
Restricting the pivot to the diagonal shows increased benefit as the matrices get larger, 
however this may be due to the structure of these matrices.  This result is consistent with 
previous studies performed on smaller power system matrices by Wendel that report that 
the minimum degree-based ordering methods provide the best orderings for sparse LU 
decomposition of power system matrices.  Historically the Tinney ordering method is 
analogous to a symmetrical minimum degree ordering with diagonal pivoting.  For the 
purposes of the hardware design, the amd ordering method was used since it is slightly 
faster than symamd, and provides comparable results. 
The number of non-zeroes in the lower and upper triangular factors impacts the 
minimum buffer size necessary to store the results of the factorization.  The amount of 
storage required is proportional to the number of bits used to store the indices and values 
in the matrix and the colmap.  For example the NNZ in the L+U factors of the 1648 Bus 
Jacobian Matrix using the amd ordering applied symmetrically is 45,267.  For each non-
zero there will be a colmap entry and an entry in the row-wise matrix storage.  Each 
colmap entry contains an index, while the matrix representation contains an index and a 
floating point value.  Using 32 bits to represent the indices and single precision 32-bit 
values for the matrix non-zero entries requires 45,267 * 96 bits of data storage minimum 
to complete the sparse LU decomposition.  This amount of data exceeds the embedded 
memory resources available on today’s FPGAs, so an external storage device with a high 
data density is required. 
Table 7 summarizes the number of floating point operations for sparse LU 
decomposition of the benchmark matrices using the AMD ordering algorithm.  A floating 
point division (FDIV) is required for every element in the pivot column, not including the 
68 
pivot element.  A floating point multiply (FMUL) is required for every pivot column 
entry and pivot row entry pair.   If the result of this multiplication updates an entry in the 
matrix that already exists, then a floating point addition (FADD) is required to update 
that entry.  The number of fill-in entries is therefore the difference between the number of 
FMULs and FADDs.   The number of floating point operations required to reduce the 
sub-matrix clearly dominates the total number of floating point operations. 
 
Table 7.  Floating Point Operation Counts 
 FDIV FMUL FADD Total FLOPs
1648 Bus 20,682 268,086 244,285 533,233 
7917 Bus 109,066 1,803,467 1,673,159 3,585,692 
10279 Bus 140,658 2,054,276 1,894,533 4,089,467 
26829 Bus 387,788 9,631,332 9,149,126 19,168,246 
 
 
 
Table 8 summarizes the number of matrix entries that would be rejected, due to being 
previously eliminated, during the construction of the pivot column.  The rejection of 
entries for pivot search is required, since the column representation of the matrix 
(colmap) contains all non-zero element row indices for the sparse matrix, not just the 
indices remaining in the un-eliminated portion of the matrix.  This allows a simpler 
implementation of the column mapping.  Search is not required to remove entries which 
have been eliminated and the order of access for update can be random, which allows the 
entries to be stored in any order, eliminating the need to maintain a sorted order of only 
those entries remaining in the matrix.  The overhead for this simplified operation is the 
number of entries that are rejected. 
69 
 
Table 8.  Number of Entries Rejected for Pivot Column 
 1648 Bus 7917 Bus 10279 Bus 26829 Bus 
Pivot Rejects 21,153 112,256 134,421 395,526 
 
 
 
Since a combination of external and on-FPGA memory will be used, the potential 
benefit of a cache was investigated.  Similar to the high speed caches found on 
microprocessors, a data cache on the FPGA could be used to reduce the latency of 
memory operations as and provide higher levels of bandwidth than external memory 
devices would be able to support.  The amount of data re-use helps to determine the 
potential efficiency of a cache in the sparse LU hardware design.  Figure 21 is a 
histogram plot of the ratio of matching row indices to the total pivot column size between 
successive sub-matrix updates for the 1648 Bus matrix.  This graph is representative for 
all of the other benchmark systems tested.  It shows that despite the fairly random access 
patterns of sparse LU decomposition, on average roughly 70% of rows are re-used 
between successive sub-matrix updates.  The median is slightly higher at 80%.  This 
indicates that a row data cache would be beneficial to the performance of the hardware 
design and would be able to take advantage of row-wise temporal data locality. 
 
70 
RATIO
1.00
.94
.88
.81
.75
.69
.63
.56
.50
.44
.38
.31
.25
.19
.13
.06
0.00
Fr
eq
ue
nc
y
1600
1400
1200
1000
800
600
400
200
0
 
Figure 21.  Ratio of Matching Row Indices for Successive Updates to Total Pivot Column Size 
 
 
 
The size of the pivot row and the pivot column during sparse LU decomposition was 
gathered to determine the size of buffers to be used in the hardware design.  Sizing the 
buffers to accommodate the maximum size encountered is sufficient to insure that data 
overflowing buffers would not be an issue in the hardware design.  Figure 22 is a 
histogram plot of the range of pivot row sizes encountered for the sparse LU 
decomposition of the 1648 bus power system matrix.  The Y-axis corresponds to the 
number of occurrences for a particular ratio plotted on the X-axis.  It also illustrates that 
the typical size of a pivot row is much smaller than the maximum encountered during 
sparse LU decomposition.  The shape of the graph for the 1648 bus power system matrix 
is similar to the results for the other benchmark matrices used, with the exception that the 
range increases as the size of the matrix increases.  The pivot column size histogram also 
71 
shares a similar shape and is illustrated in Figure 23.  The pivot column size reported also 
includes the pivot element, while the pivot row size does not. 
 
ROW SIZE
45.0
42.5
40.0
37.5
35.0
32.5
30.0
27.5
25.0
22.5
20.0
17.5
15.0
12.5
10.0
7.5
5.0
2.5
0.0
Fr
eq
ue
nc
y
2000
1000
0
  
Figure 22.  Pivot Row Size for Sub-Matrix Update 
 
 
72 
COLSZ
45.0
42.5
40.0
37.5
35.0
32.5
30.0
27.5
25.0
22.5
20.0
17.5
15.0
12.5
10.0
7.5
5.0
2.5
COLSZ
Fr
eq
ue
nc
y
2000
1000
0
 
Figure 23.  Pivot Column Size for Sub-Matrix Update 
 
 
 
A data summary of the maximum, median, and mean pivot row and pivot column sizes 
are provided in Table 9 and Table 10.  The pivot column values include one for the pivot 
element, but the pivot row does not include the pivot entry.  The median and mean pivot 
row/column size does not appear to scale significantly with the size of the matrix unlike 
the maximum size encountered during decomposition.  The amount of buffer space 
required to store an entire set of indices and values for the worst case row encountered in 
the largest benchmark system is 181 * 64 bits = 11 kbits, which is comparable to the size 
of the embedded memory blocks found on today’s FPGAs.  Today’s FPGAs also contain 
hundreds of such embedded memory blocks, which allows the instantiation of multiple 
buffers. 
 
73 
Table 9.  Pivot Row Size Statistics 
 Maximum Median Mean 
1648 Bus 44 5 7.1 
7917 Bus 90 5 7.7 
10279 Bus 115 5 7 
26829 Bus 181 5 7.9 
 
 
Table 10.  Pivot Column Size Statistics 
 Maximum Median Mean 
1648 Bus 45 6 8 
7917 Bus 71 6 8.5 
10279 Bus 100 6 8.3 
26829 Bus 163 6 8.7 
 
 
 
 An additional consideration for the design of hardware to address power flow 
computation is transferring the sparse matrix data necessary for the computation to the 
hardware and transferring the result from the hardware.  Significant data transfer can 
bottleneck a hardware accelerator if the ratio of computation to data transfer time is not 
favorable.  Transfer of the original matrix needs to be completed prior to the start of LU 
decomposition and is an additional cost incurred by using a hardware accelerator.  
Transfer of the lower and upper triangular result can be hidden behind the computation.  
Figure 24 illustrates the amount of time required to transfer the original matrix to the 
FPGA for three different matrix sizes at three different bandwidths.  A 32-bit, 33 MHz 
PCI interface can provide one gigabit per second of peak bandwidth.  A 64-bit 125 MHz 
PCI-X interface can provide eight gigabits per second of peak bandwidth.  A 
74 
HyperTransport interface operating at 200 MHz DDR x8 bits can provide 3.2 gigabits per 
second of peak bandwidth.  Figure 25 illustrates the time to send the lower and upper 
triangular results back to the host from the FPGA using the same three types of 
connections. 
 
 
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
9.0%
10.0%
32-bit/33 MHz PCI 125 MHz PCI-X HT200
1648 Bus 7917 Bus 10278 Bus
 
Figure 24.  Jacobian Matrix Data Transfer Time Relative to Pentium 4 Solve Time 
 
75 
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
9.0%
10.0%
32-bit/33 MHz PCI 125 MHz PCI-X HT200
1648 Bus 7917 Bus 10278 Bus
 
Figure 25.  LU Result Data Transfer Time Relative to Pentium 4 Solve Time 
 
 
 
4.4 Top Level Design 
 
LU Hardware
SWAP
PIVOT
COLMAP
FILTER
DIVIDE RESULT
MER_MEM
MERGE
CACHE
CONTROL
Pivot Logic Sub-matrix Update Logic
 
Figure 26.  High Level Block Diagram of Sparse LU Hardware 
 
 
 
76 
The sparse LU hardware implements a right-looking sparse LU decomposition 
algorithm with partial pivoting.  Compared to a software linear solver, the design 
resembles a right-looking frontal sparse linear solver.  Figure 26 illustrates the main 
components of the sparse LU hardware.  The four major components of the hardware are 
the control logic, the cache logic, the pivot logic, and the sub-matrix update logic.  The 
control unit monitors the operation of the other logic units and provides signals to ensure 
synchronization and correct operation of the LU hardware.  The cache unit handles all 
data requests for the pivot logic and the sub-matrix update logic.  The last two units in the 
LU hardware, pivot logic and sub-matrix update logic, perform the two serial steps in the 
sparse LU decomposition algorithm, analogous to assembling the sparse matrix front and 
updating the partially factored sparse matrix using the front.  Floating point arithmetic for 
the calculations is supported by fully pipelined IEEE single precision floating point units. 
 
4.5 Floating Point Units 
 
The sparse LU decomposition algorithm requires three computations: subtraction, 
multiplication, and division.  Support for these computations in the sparse LU hardware 
is provided by IEEE-754 single precision floating point computation units (1985).  The 
target for the design of these units is fully pipelined operation, even for division, and the 
ability to operate at 200 MHz on Stratix FPGAs.  The choice of 200 MHz follows a 
general rule of thumb that states in general 60% to 70% of the maximum frequency of 
operation of a FPGA can be achieved.  All of the units perform IEEE round to nearest 
77 
even rounding, support sub-normal numbers, and special numbers such as infinity (INF) 
and not-a-number (NaN). 
The format for IEEE-754 single precision binary numbers is illustrated in Figure 27.  A 
single bit is used to encode the sign, eight bits encode a biased exponent, and 23 bits are 
used to encode a normalized mantissa.  The encoded exponent is biased by +127 which 
results in a range of values from -126 to +127.  The mantissa is normalized with an 
implied leading 1, resulting in a 24-bit binary encoded significand.  Special values are 
defined for a zero exponent or an exponent of 255.  A zero exponent with a zero mantissa 
represents a value of plus or minus zero.  A zero exponent with a non-zero mantissa 
represents a de-normalized or sub-normal number.  De-norms have a mantissa that is not 
normalized, and allow the representation of very small numbers albeit at reduced 
precision.  Numbers with an exponent of 255 and a zero mantissa are used to represent 
infinity which can result from a divide by zero.  Numbers with an exponent of 255 and a 
non-zero mantissa represents not-a-number (NaN) which can result from a zero divide by 
zero. 
 
Sign Exponent Mantissa
1         8                          23
Value Represented = (-1)Sign x 2Exponent-127 x (1.Mantissa)   
Figure 27.  IEEE-754 Single Precision 32-bit Format 
 
 
 
78 
 The determination of where the computation should be split for a pipeline stage was 
performed empirically.  Each stage was targeted to operate at slightly over 200 MHz 
individually.  This frequency was chosen because it is the maximum frequency of 
operation for 32-bit wide binary addition on Altera Stratix FPGAs.  A summary of the 
features of the floating point units (FPUs) is provided in Table 11.  The resource 
utilization of the FPUs is summarized by the number of logic elements consumed, the 
DSP blocks represent the utilization of embedded multipliers, and memory bits represent 
the utilization of embedded memory blocks.  The Fmax or maximum frequency of 
operation represents the reported maximum frequency of operation for these three units 
with the corresponding number of pipeline stages for these units. 
 
Table 11.  Compilation Results for Floating Point Units on Altera EP1S25F780C5 FPGA 
 Logic 
Elements 
DSP Blocks Memory Bits Fmax 
(MHz) 
Pipeline 
Stages 
FADD 1,774 0 0 227 8 
FMUL 862 8 66 207 10 
FDIV 1,447 34 3,085 219 24 
 
 
 
Where necessary to the implementations these three units all share the same left and 
right shift algorithm and the same leading zero detection algorithm. The shift left and 
shift right algorithm used is a log2 based shifter coded in HDL.  At each level the shift 
amount is a power of two (1, 2, 4, 8, 16).  Five levels are sufficient to implement the 
maximum shift possible for a single precision mantissa.  One of the benefits to this kind 
of implementation is easier pipelining, since the shift is performed at regular levels.  Also 
79 
since the description is in HDL, the shifter is portable between different vendor’s FPGAs.  
The leading zero detection algorithm is a HDL implementation of (Oklobdzija 1994; 
Bruguera and Lang 1999).  A summary of basic floating point design in microprocessors 
can be found in (Patterson, Hennessy et al. 2003). 
 
4.5.1 Floating point addition 
 
The floating point addition unit, illustrated in Figure 28, performs single precision 
floating point addition using a two path algorithm.  A straight forward implementation of 
floating point addition requires the following steps: alignment shift, addition, leading-one 
detection, normalization shift.  In the alignment shift stage, the operand with the smaller 
exponent is right shifted to align the decimal points of the two addends.  After the addend 
decimal points have been aligned they are added in the addition step.  Leading-one 
detection of the sum produced is required to re-normalize the result in the event of 
cancellation due to effective subtraction.  The result of the leading one detection and the 
addition is used to produce a normalized sum.  (Patterson, Hennessy et al. 2003) 
One optimization for latency reduction in floating point addition is the implementation 
of what is commonly referred to as a two path addition algorithm (Quach and Flynn 
1990).  In a two path adder there are two paths for addends to follow.  A “near” path is 
dedicated to operands with exponents differing by no more than one, and a “far” path is 
dedicated to operands with exponents that differ by more than one.  This is done to 
remove a full-width shift from the computation.  The near path does not require a full 
width shift to align the mantissas prior to addition, and the far path does not require a full 
80 
width shift to normalize the mantissas after addition.  This reduces the overall latency of 
the adder at the cost of additional logic resources.  To reduce the latency of the leading 
zero detection in the case of cancellation, leading zero prediction can be performed in 
parallel with addition (Hokenek and Montoye 1990; Quach and Flynn 1991).  The use of 
compound adders can result in additional latency reduction (Brugera and Lang 1998). 
 
dataa
signa expoa manta
datab
signb XOR
prefixexpob mantb
IEEE Special
a_NZI
expoa manta expob mantb
Output Exception
RESULT
predict
+/-
+/-
near_s1near_e1 near_m1
xor
shift
- swap
shift
+
ediff far_m1bfar_s far_m1a stickyfar_e1
LOD
shift
near_m2
-
near_s2near_e2 far_m2far_e2
round +
select
+/-
b_NZI
 
Figure 28.  Floating Point Addition Unit 
 
 
 
4.5.2 Floating point multiplication 
 
81 
The floating point multiplication unit is illustrated in Figure 29.  The design takes 
advantage of the high speed embedded multiplier blocks found in today’s FPGAs.  These 
blocks are pipeline-able for increased frequency of operation.  They can also be cascaded 
to accommodate multiplication of operands with bit widths larger than the size of the 
multipliers.  The exponent path operates in parallel with the mantissa path, and a separate 
path handles IEEE special numbers and exceptions. 
Compared to floating point addition and division, floating point multiplication is 
straight forward to implement.  To support de-normalized numbers, or sub-normals, a 
leading one detector and a full width left shifter is required pre-mantissa multiplication 
and a full width right shifter is necessary post-mantissa multiplication.  The mantissa 
multiplication logic uses the embedded multipliers with pipeline registers for high speed 
multiplication of the mantissas. 
 
82 
dataa
signa expoa manta
LOD
shift
-
expoan mantan
datab
signb expob mantb
LOD
shift
-
expobn mantbn
-
+
x
result lowmultexpoab
xor IEEE Special
a_NZI b_NZI
expoa manta expob mantb
shift
expo resultn sticky2
OR_red-
round
+
+
Output Exception
RESULT
127
0
 
Figure 29.  Floating Point Multiplication Unit 
 
 
 
4.5.3 Floating point division 
 
There are many algorithms which can be used to perform binary division (Obermann 
and Flynn 1997).  The floating point division unit, illustrated in Figure 30, uses a 
Newton-Raphson algorithm to perform the required mantissa division (Flynn 1970).  
Division by functional multiplication, by algorithms such as Goldschmidt or Newton-
Raphson, can take advantage of the high speed multiplier blocks found on today’s 
83 
FPGAs.  The hardware also utilizes the FPGA’s embedded memory blocks to implement 
the lookup tables used to generate the initial guess for the Newton-Raphson method.   
Division can be re-written as the product of the dividend and the reciprocal of the 
divisor, 
b
ab
aQ 1×==  
 
where Q is the quotient, a is the dividend, and b is the divisor.  Using this formulation for 
division, the goal is to compute the reciprocal of the divisor.  The Newton-Raphson 
method applied to solving for the reciprocal of the divisor iterates using the following 
formula, 
( )vvv XbXX ×−×=+ 21  
 
where Xv is the reciprocal from the v’th iteration.  The formula illustrates that each 
iteration of the algorithm requires two serial multiplies and one subtraction. 
The Newton-Raphson method converges quadratically to the desired result.  Ignoring 
rounding errors due to truncation, for binary numbers this is equivalent to a doubling of 
the number of bits of accuracy in the result at each step, e.g. beginning with a single bit 
of accuracy, 64 bits of accuracy results after six iterations. 
1 –> 2 –> 4 –> 8 –> 16 –> 32 –> 64. 
 
It is easy to see that a more accurate initial approximation will reduce the number of 
iterations to compute a reciprocal with the desired number of bits of accuracy.  Instead of 
using a direct approximation of the reciprocal via a lookup table indexed with the top bits 
84 
of the divisor, the starting approximation implementation uses a higher order modified 
linear approximation (Ito, Takagi et al. 1995; Ito, Takagi et al. 1997).  The result is a 
more accurate initial approximation using the same size lookup table at the cost of an 
additional multiplier implemented using the embedded multipliers on the FPGA.   
One of the drawbacks of division by functional multiplication is rounding.  IEEE round 
to nearest even rounding requires the remainder to perform accurate rounding.  Division 
by multiplication however does not produce the remainder as a result of the calculation.  
Instead of explicitly forming the remainder, the algorithm used to implement IEEE round 
to nearest rounding is the one proposed in (Oberman and Flynn 1996).  This method 
produces IEEE round to nearest even rounded results with reduced resources and latency 
compared to explicitly calculating the remainder and using those results for rounding. 
 
85 
dataa
signa expoa manta
datab
signbexpob mantb
IEEE Special
a_NZI
expoa manta expob mantb
Output Exception
RESULT
b_NZI
LOD
shift
-
expoan mantan
LOD
shift
-
expobn mantbn
xor
expoab
expo
-
+
127
0
+
-
x
multB multA
xor
dir quot
trunc- +2
guess
shift
round
1
FSM
 
Figure 30.  Floating Point Division Unit 
 
 
 
4.6 Cache 
 
Similar to high-end microprocessors, today’s high-end FPGAs contain roughly 1 to 2 
MB of on-chip memory.  Unlike microprocessors however, where the on-chip memory is 
concentrated into a few data caches, on FPGAs the on-chip memory is scattered 
throughout the device organized as regular embedded blocks.  Analysis of the memory 
requirements of the matrices encountered in power flow analysis shows that the memory 
footprint of the larger power system matrices, even stored in a sparse matrix compressed 
86 
format, will not fit entirely within the available embedded memory on today’s FPGAs.  
To address this issue the use of external high density memory devices, such as dynamic 
random access memory (DRAM), used to store the necessary data is required.  This is 
much like the case for high performance microprocessor based computers.  DRAM is 
commonly found on prototyping FPGA boards and FPGA vendors provide information 
and/or designs that allow their FPGAs to interface with these external devices.  Using 
external memory to store the matrix frees up the embedded memory for use in the 
implementation of many small buffers, look-up tables, as well as a data cache. 
The use of a memory hierarchy consisting of one or more levels of cache has been used 
in the design of high performance processors in order to address the growing disparity 
between memory performance and the performance of high speed logic, while 
simultaneously addressing the need for a large amount of dynamic data storage 
commonly referred to as “the memory wall” (Patterson, Hennessy et al. 2003).  High 
speed synchronous dynamic random access memory (DRAM) offers high data density 
and bandwidth, but requires several cycles to access the memory array and also requires 
data to be refreshed periodically.  Synchronous random access memory (SRAM) on the 
other hand is not as dense but does not require periodic refresh, can be multi-ported, and 
can provide a much lower access latency, in part due to being incorporated on the same 
die as high speed logic.  The compromise, in order to maximize performance and storage 
space, is the utilization of both in a hierarchical fashion. 
The use of a cache in the FPGA based sparse LU hardware design is to reduce idle 
cycles where computations on valid data could occur.  Memory operations can affect 
computation efficiency if the latency or bandwidth of reads results in idle computation 
87 
units or if computation units must stall to prevent buffer overflow and data loss.  In the 
sparse LU decomposition algorithm chosen, for any given sub-matrix reduction the 
number of memory reads required is roughly equal to the number of memory writes 
required for the reduction.  These reads and writes can be serviced simultaneously for a 
single row or for multiple rows, given an appropriate memory system design.  The sparse 
LU hardware design takes advantage of the dual ported embedded memory blocks found 
on today’s FPGAs to implement an application specific memory hierarchy. 
Data stored in the cache and in the external data storage device use a compressed row 
storage format to allow burst-able and scalable read bandwidth to the sub-matrix update 
units.  For increased read and write bandwidth the data can be interleaved across multiple 
memory banks.  The cache design used is single level and takes advantage of the 
embedded FPGA memory blocks for cache data storage and tag arrays.  Since the 
embedded memory blocks on most FPGAs do not have content addressable memory 
(CAM) functionality, a suitable alternative design is necessary. 
In order to implement CAM functionality for the tag arrays, our design followed the 
method illustrated in (Xilinx 2002).  To implement the CAM read/write ability on an 
FPGA without using programmable logic elements the method uses one-hot encoding of 
the data pattern and stores it using embedded memory blocks.  To read data, the desired 
search pattern is used as the input address to the embedded memory blocks and the 
resulting pattern is decoded to an address in the case of a match.  To write data, the 
pattern is encoded and the appropriate bit is set or cleared in the embedded memory 
blocks.  To increase throughput and frequency of operation these operations can be 
pipelined.  
88 
The cache policy is write-back with read miss allocation and a modified First-In-First-
Out (FIFO) replacement policy to reduce write bandwidth requirements to the external 
DRAM memory.  The replacement policy first selects rows which have been previously 
eliminated.  In the event that none are available, a FIFO policy is used.  The cache is fully 
associative and stores entire matrix rows to allow high speed constant burst read/write 
operations.  The cache tag array and all requests to the cache are by row number with or 
without a full row sized burst.  The choice of a fully associative cache is to avoid data 
thrashing, where the same rows are constantly being read and then replaced, which may 
occur with a set associative design due to the fairly random row read patterns that occur 
during sparse LU decomposition.  Cache read requests are pipelined to reduce latency, 
and are either single word reads in the case of pivot search or variable length full row 
bursts in the case of sub-matrix updating.  Read requests are handled in the order 
received, and in the case of multiple merge units, row misses from one unit are non-
blocking relative to the other unit.  This is accomplished by using miss status holding 
registers (MSHR) that keep track of pending memory operations (Scheurich and Dubois 
1988; Kroft 1998).  Cache write requests are also pipelined and do not share the same 
request port or tag array as read requests to allow fully parallel reads and writes.  A 
victim buffer is used to store replaced rows that are pending write-back so hits can 
proceed in parallel with the write-back to main memory (Jouppi 1998).  Additional tag 
data is used to mark rows that have been eliminated, indicating those rows do not need 
write-back when replaced.  In the case of misses, data is forwarded directly to the 
requesting sub-matrix update unit, and logic insures that space exists for the eventual 
89 
write to the row that will occur.  Additional logic guarantees that no rows in-process will 
be replaced which insures that all write requests will be a cache hit. 
 
Size Table
RAM
Cache
ctrl
Cache
mshr
Victim
Buffer 1
Data Store
RAM
Victim
Buffer 0
Sin to
Pout
Pin to
Sout
Sin to
Pout
Write
Buffer
Write
Buffer
Pin to
Sout Output 0
Output 1
Write
Data 0
Write
Data 1
Read
Data 0
Read
Data 1
To SDRAM
Cache
Request
 
Figure 31.  Top Level Cache Block Diagram 
 
 
 
The block diagram in Figure 31 illustrates the connectivity of the cache design with 
two read and write data ports.  Requests are initially sent to the main cache controller, 
labeled cache ctrl in the diagram, and will be explored in detail below.  Misses are sent to 
the miss handler, labeled cache mshr in the diagram.  This unit will also be explored in 
greater detail below.  Hits are issued by the cache controller directly to the data store, 
which outputs to the appropriate cache port.  FIFO buffers, serial-to-parallel buffers, and 
parallel-to-serial buffers are used to pack and unpack the data as necessary. 
 
90 
 
Tag
Store RAM
Size Table
RAM
Issue
(miss)
Read
Tag_array
Write
Tag_array
Burst
Counter 0
Read
Queue 0
Read Queue
Select
Tag Status
Registers
Write Req
Buffers
Tag
Control
Used
Counter
Replace
Logic
Read
Queue 1
Burst
Counter 1
Read
Control 0
Read
Control 1
Issue
(hit)Tag Reg
Read RequestWrite RequestTo Data Store
To 
Data 
Store
 
Figure 32.  Cache Controller 
 
 
 
All cache requests are sent to the cache controller which is illustrated in Figure 32.   
Multiple read hits are buffered and queued for each read port.  This insures maximum 
utilization of the cache data store bandwidth.  Miss resolution for write requests is not 
necessary since all writes are guaranteed to be cache hits.  Instead the buffers are used to 
store multiple write requests pending results from the write tag array.  The tag_array 
contains the logic necessary to determine if a request for a row is a hit or a miss and is 
detailed below.  The unit labeled tag reg, are the tag registers and can buffer several 
requests while the tag arrays are being accessed, this unit is also detailed below.  
Additional details of the operation of the cache controller are provided below. 
 
91 
Tag_reg
Issue
Control
To
Cache ctrl
Cache
Request
State
Registers
Request
Registers
From 
Cache ctrl
Tag_array
Encoder
To
Tag_reg
Cache
Request
DecoderTAGRAM
 
Figure 33.  Cache Tag Registers and Tag Array 
 
 
 
Separate tag arrays, illustrated in Figure 33, one for each read port and one for each 
write port, determine if a request is a hit or a miss and is used by the hit/miss dispatcher.  
The cache tag arrays are fully pipelined and capable of accepting row numbers at a rate of 
one per clock cycle.  A dedicated write port on each of the tag registers allows tag 
updates to proceed in parallel with tag queries.  Since cache writes have a dedicated port 
to the data store and are guaranteed to hit 100% by design, the write tags are only used to 
determine the write address to the data storage and will never initiate a tag update.  After 
the tags determine the address, the write requests proceed immediately to the cache data 
store.  Lookup tables stored in embedded memory are used to keep track of the row sizes 
of the matrix.  Any necessary updates to the cache row size table are handled in parallel 
with write operations.  Read requests do not modify the lookup table, but are more 
complex and require the support of additional logic. 
92 
Read requests are either single word, in the case of pivot search, or full row burst, in 
the case of sub-matrix update.  The cache read request dispatch logic insures cache 
consistent operation for hits and misses by stalling requests when necessary and 
invalidating hits on pending request to replaced rows.  Row replacement only occurs 
during sub-matrix update, i.e. in the case of full row burst requests.  Multiple pending 
read requests are possible in the controller to reduce the latency of tag requests.  The 
request buffer has a signal to indicate when it is full and can’t store any additional 
requests to prevent buffer overflow.  Cache hits are issued directly to read queues which 
store hit requests and generate burst-able read addresses to the cache data storage.  Cache 
misses generate a tag update, a cache read in the case of a write-back, and a miss request 
to the cache MSHR. 
 
93 
Receive
Control
Request
Control
Miss
Registers
Receive
Counter
Request
Counter
Victim
Counter
Output
Registers
Victim
Data
Miss
Control
Miss
Request
From 
Victim 
Buffer
To SDRAM
From SDRAM
To 
Output
Avalon Master
Output
Avalon Master
Input
 
Figure 34.  Cache MSHR 
 
 
 
The cache MSHR logic unit, illustrated in Figure 34, handles miss resolution separately 
from the hit resolution inside of the cache controller.  Multiple misses are possible and 
can be handled by the logic unit which keeps track of transactions in various states of 
completion.  Separate read request, data receive, victim buffer write control, counters, 
and other buffers allow data parallel operation of this unit.    Since the request and receive 
control are able to operate in parallel, requests can be made for a row while the data is 
being received for a different row, reducing latency for multiple misses.  Misses are full-
row or single word and take advantage of the ability of the external DRAM memory 
devices to provide high speed bursts of contiguous data for reads and writes.  In the case 
of a miss with no replacement, there is no data written back to the external memory.  For 
misses that require a write-back, the row write occurs immediately after the last data 
94 
word for the read miss has been requested.  A separate victim buffer, which stores the 
data pending write-back, exists for each read port in the cache.  Miss data is forwarded 
directly to the requesting logic and is not written to the cache.  Since the algorithm used 
replaces the read data right after computation, writing the miss data to the cache is not 
necessary for consistency.  The MSRH buffers for a pending miss request are cleared 
once the request, receive, and write-back operations are completed. 
 
Sin_to_Pout
Parallel
Data
Serial
Data
Output
Control
Data
Registers
Pin_to_Sout
Serial
Data
Parallel
Data
Output
Control
Data
Registers
 
Figure 35.  Cache Serial/Parallel Conversion Units 
 
 
 
Data storage bandwidth is scaled proportionally to the number of ports by increasing 
the size of the reads and write ports to the data store.  Ports are serviced round-robin to 
guarantee equal and fair bandwidth sharing.  Parallel to serial and serial to parallel 
conversion logic, illustrated in Figure 35, are used in the read/write ports to accommodate 
the increased data storage width. 
 
4.7 Pivot Logic 
 
95 
Pivot search for sparse LU decomposition requires identifying rows which have a non-
zero entry in the current pivot column.  Since the cache stores matrix data in a row-wise 
compressed format, accessing this data without having to search through all rows 
remaining in the matrix requires a column-wise storage of row indices.  The use of an 
additional separate bank of memory to store this representation alleviates this issue.  The 
logic which controls the reading and writing of this bank of memory is called the colmap. 
The logic to perform the pivot search for the sparse LU hardware consists of three 
units, refered to as colmap, swap, and pivot.  The three hardware units that comprise the 
pivot logic are able to operate in parallel.  The pivot selection algorithm used in the 
hardware design is row partial pivoting based solely on numerical criteria and does not 
perform any analysis for potential fill-in reduction.  The pivot logic performs a search, 
element by element, of the current column for the LU decomposition.  The highest 
magnitude element found is selected as the pivot element.  In the hardware design row 
data is not swapped through data reads and writes, instead the reordering is maintained by 
updating a set of pivot tables that contain the current row numbering. 
 
96 
Write
Request
Buffer
Read
ControlReceive
Burst
Counter
Write
Arbitration
Write
Control
Write
Buffers
Column
Size
BRAM
Read
Request
Registers
Read Request Write Request
Request
Burst
Counter
Avalon
Master
From Avalon
SDRAM Controller
To Avalon
SDRAM Controller
Output
 
Figure 36.  Colmap Unit 
 
 
 
The colmap unit, illustrated in Figure 36, has two functions.  The first is to perform 
updates to the column-wise matrix representation resulting from fill-in during the sub-
matrix update.   The second function is a burst read of the column-wise matrix 
representation to form the pivot column during pivot search.  The two operations are 
handled exclusively since sub-matrix update and pivot search are performed serially and 
not simultaneously.  Writes to the colmap are only necessary when fill-in occurs.  These 
writes are buffered and are written to memory un-sorted in order by request.  The colmap 
data stores all entries in the matrix, including those that have already been eliminated.  
The rows that have been eliminated are not relevant to the current update and will be 
rejected by subsequent pivot search logic. 
 
97 
Physical
To
Virtual
Row
Index
BRAM
Virtual
To
Physical
Row
Index
BRAM
Swap
Register A
Swap
Register B
Swap
Control
Swap
Register A
Swap
Register B
Swap
ControlFrom Pivot To Pivot
From Control
From Cache
From Pivot
 
Figure 37.  Swap Unit 
 
 
 
The swap unit, illustrated in Figure 37, maintains a mapping of the pivoting operations 
that have occurred by using two lookup tables.  This is used only during pivot search to 
reject candidate rows from the colmap which have already been eliminated.  Rows which 
are not rejected are sent to the cache read queue as single word read requests.  The two 
tables store the physical to virtual and virtual to physical row mappings.  The physical 
row number is the initial index assigned to a given row.  The virtual row number is the 
current in process row number for a given row.  The values in these tables are modified 
only as a result of pivoting. 
 
98 
Read
Buffer Reject Compare
Control
Index
Register
Value
Register
From Colmap
Pivot Done
To Cache From Swap
From Control
Pivot Index
Pivot Value
 
Figure 38.  Pivot Unit 
 
 
 
The pivot unit, illustrated in Figure 38, performs a floating point magnitude 
comparison for the pivot search and is also only active during pivot search.  The pivot 
unit is fully pipelined and is able to accept, reject, and compare new operands at a rate of 
one per clock cycle.  This is the only unit that references virtual row numbers for the 
matrix.  The virtual row number is only used to reject those rows which have already 
been eliminated.  For all entries that have not been rejected, the current maximum 
magnitude entry’s physical row index and its associated value are registered as the 
comparison occurs.  In parallel a separate set of buffers stores all of the indices and 
values read from memory.  Once an exhaustive search of the pivot column is finished, the 
pivot unit indicates that the current pivot search is complete so the swap unit can update 
the row mappings and the sub-matrix update can begin.  The pivot column element 
recorded in the index and value registers is the pivot entry, and the indices and values 
stored in the separate buffer comprise the pivot column. 
 
4.8 Sub-matrix Update Logic 
 
99 
The sub-matrix update unit has two main computations, the normalization of the pivot 
column by the pivot element, followed by a reduction of the remaining sub-matrix by the 
product of the pivot row and the pivot column.    Within the sub-matrix update logic there 
are two types of units, data flow control units and computational units.  The sub-matrix 
update unit uses the results of the pivot search: the elements of the pre-normalized pivot 
column and the pivot element.  These operands are sufficient to compute the values 
necessary to reduce the sub-matrix.  In order to determine whether or not these values are 
used to update an existing value or create a fill-in, a comparison between the indices of 
the pivot row and the updated row is required.  The loading of the row indices and values 
is handled by the data management units.   
 
FILTER
MER_MEM
Control
Issue
Logic
Issue
CounterControl
Row Index
Buffer
Port
Buffer
Result
Buffer
Pivot Done
Pivot Index
Lower Indices
Read Port
Register
Row Index
Register
Lower Result
To Cache
To Cache
From Merge
 
Figure 39.  Sub-Matrix Update Filter and Merge_Memory Units 
 
 
 
The memory read requests required to complete the sub-matrix reduction are controlled 
by two units called filter and merge memory, illustrated in Figure 39.  The filter unit and 
100 
the mer_mem unit also control the flow of rows for the sub-matrix update.  Valid 
operands are sent by the filter unit to the divide unit, illustrated in Figure 40, which 
performs floating point division of the pivot column values by the pivot element value in 
single instruction multiple data (SIMD) fashion.  If the filter unit detects the pivot 
element in the pivot column it skips that entry and proceeds to the next entry in the 
buffer.  The first row requested by the mer_mem unit is always the pivot row.  This 
request is marked as the pivot row which notifies the cache to broadcast the returned data 
to all of the sub-matrix row update units.  After the pivot row is requested, the other 
entries in the pivot column buffer, after passing through the filter unit, are requested in 
first-in-first-out order.  In the case of multiple sub-matrix update units, the mer_mem unit 
also handles round-robin row assignments to distribute the row updates to each unit, and 
marks where the requested data should be sent.  The mer_mem unit also performs flow 
control by regulating the maximum number of rows in flight in the merge unit(s) to 
prevent data buffers from overflowing.   
 
101 
LU_FMUL
Buffer
Buffer
Buffer
FMULFlowControl
From
Cache
To Select
From
FDIV
LU_FDIV
Buffer
FDIV
To Merge
From
Pivot
Port
Number
 
Figure 40.  LU Floating Point Multiply and Divide Units 
 
 
 
The merge unit performs the sub-matrix reduction for the remaining matrix in a row-
by-row fashion.  Each element of the sub-matrix that needs to be reduced can be 
computed in parallel so the use of fully pipelined floating point units and logic allows 
increased performance by taking advantage of this data parallelism.  One such computing 
paradigm that can take advantage of this parallelism is called stream computation.  
Stream computation involves sending data to be operated on through a common 
computation kernel which performs the necessary sequence of computations in order to 
provide the desired result.  Stream computing is also used to describe general purpose 
graphics processor unit (GPGPU) based high performance computing.  The difference 
102 
here is that the FPGA hardware is the computation kernel.  Another benefit of stream 
computing is hiding the latency of future memory operations behind the computation of 
previously fetched data.  With enough data computation the computational units can be 
fully utilized. 
Once the necessary pivot row data has been loaded into data buffers, the computational 
units can accept the streaming data input in the form of a vector of values to be 
normalized or a row vector to be updated.  This part of the sparse LU decomposition 
algorithm is where all of the floating point computations occur and it is also the most 
time consuming portion of the sparse LU decomposition.  All of the logic designed is 
fully pipelined and capable of accepting and generating operands at a rate of one per 
clock cycle for maximum throughput.  Computations are performed on a row-by-row 
basis, with the possibility of multiple rows occupying the sub-matrix update logic, to 
fully take advantage of the pipelined floating point units.   
 
103 
Control
LU_FMUL
FADDTEST SELECT
Offset
Counter
From
Cache
To Cache
To Cache
To Colmap
To 
LU_Control  
Figure 41.  Sub-Matrix Update Merge Unit 
 
 
 
The merge unit, illustrated in Figure 41, performs three tasks which can all be 
performed in parallel.  The first is calculating the product of the pivot row and an element 
of the pivot column.   The second is a comparison of the pivot row indices to the sub-
matrix row indices to determine the non-zero structure of the reduced row.  The third task 
is the subtraction for sub-matrix reduction which is performed by the floating point 
addition unit.   
The product of the elements in the pivot row and an element of the pivot column are 
computed by the multiplication unit, illustrated in Figure 40.  The computation is handled 
by the floating point multiplication unit.  This unit is supported by several buffers to 
allow maximum throughput.  Circular buffers are used to recycle the pivot row to 
maximize data re-use to reduce cache requests.  An output buffer is used to queue results 
104 
for low latency access to operands.  When the buffer is near full, a signal prevents the 
issuing of additional multiplications to insure the buffer never overflows. 
 
Multiply Indices
Row Indices
Register Y
Register B
Register A
Register X
Compare
Compare OP
Row Values
Index
Compare
Circular
Buffer
Update
Buffer
Row
Register
Done
Register
Flags
Control/
Read Flags
Merge
Logic
Row Number
Buffer
Row
Value
Cache EOLs To
Select
Buffer
 
Figure 42.  Sub-Matrix Update Test Unit 
 
 
 
The test unit, illustrated in Figure 42, performs the index comparison for the row 
merge.  Row indices are maintained in sorted order to allow fast comparison and 
operation determination.  The three possible operations are element copy, element fill-in, 
or element update.  A copy is a pre-existing value in the sub-matrix row that is not 
required to be updated.  An update is a pre-existing value that needs to be reduced by the 
product of an element of the pivot row and an element of the pivot column.  Finally a fill-
in is a non pre-existing value that will become non-zero as a result of zero being reduced 
by the product of an element of the pivot row and an element of the pivot column.  A fill-
105 
in also reqires a signal be sent to the colmap in order to update the column-wise matrix 
representation. 
 
Full
Empty
Used
Counter
Read
Pointer
Write
Pointer
Block
RAM
Read
Write
Data In
Data Out
Empty
Full
 
Figure 43.  First-In-First-Out Buffer 
 
 
 
To increase performance, the test unit uses high speed first-in-first-out (FIFO) buffers, 
illustrated in Figure 43, implemented using embedded memory on the FPGA.  A circular 
buffer containing the pivot row indices is used to take advantage of data re-use and 
reduces cache memory requests.  The read signal to these buffers is de-coupled from the 
comparison logic to increase the operating frequency of the test unit.  This requires 
additional registers and additional comparison logic, but does not impact the ability of the 
unit to generate operands at a rate of one per clock cycle provided valid input data is 
available. 
 
106 
Issue
Control
From
TEST
To FADD
From
FMUL
A
B
Fill
Read Commands
Copy
FADD
 
Figure 44.  Sub-Matrix Update Select Unit 
 
 
 
 The select unit, illustrated in Figure 44, issues the necessary operands to the floating 
point addition unit to calculate the subtraction.  Once the appropriate operation has been 
determined for a particular update element by the test unit and the product which 
determines the value of the update has been computed by the multiplication unit, the 
select unit can issue a subtraction to the floating point addition unit.  Results of the 
floating point addition unit are sent immediately to the cache as write requests.  The write 
offset for these results is calculated by an offset counter.  The last element written to a 
row is used to update the row size table in the cache. 
Additional parallelism can be exploited by increasing the bandwidth and number of 
request ports to the cache and instantiating multiple merge units to allow multiple row 
reductions in parallel.  Since the rows for a given sub-matrix update can be operated on 
independently, multiple merge units take advantage of this data and computational 
107 
parallelism.  The floating point division unit is not duplicated since it is able to provide a 
valid result on every clock cycle.  Once the sub-matrix update has completed, the 
hardware signals the control unit so the pivot logic can begin the search for the next pivot 
element. 
 
4.9 Hardware Debugging 
 
Debugging the hardware design on FPGA required the design and insertion of 
additional logic to monitor and record data values during operation.  A special debug 
mode of operation was added to the overall sparse LU control logic which stalled the 
sparse LU hardware’s operation after each sub-matrix update operation was completed.  
A register maintained the current state of the control unit and was used to detect hardware 
deadlocks.  Several banks of FIFO buffers were instantiated along with the sparse LU 
hardware design to record data during the operation of the sparse LU hardware.  Writes to 
these FIFOs were triggered by events such as cache requests, valid computation results, 
and valid cache data output.  The status register was polled until the hardware reached the 
debug state during which the data stored in the FIFOs was flushed.  The data recorded by 
these FIFOs was compared to expected values based on a concurrent software simulation 
to verify that the hardware was operating correctly.  The use of these integrated hardware 
debugging buffers allowed monitoring of the internal operation of the sparse LU 
hardware without the use of external probes.  The status register also reported 
information such as clock cycle count which is useful for performance evaluation. 
 
108 
CHAPTER 5.  BENCHMARK RESULTS 
 
5.1 Introduction 
 
In this section, the results of the hardware prototype are presented.  An explanation of 
the benchmark system that is used to evaluate the relative performance of the hardware is 
detailed.  Synthesis results for the hardware prototype implementation are detailed.  A 
software performance model is used to project the performance of the hardware design 
for systems that exceed the hardware resources of the prototype board.  The accuracy of 
this software is verified using the hardware prototype on systems which fit within the 
constraints of the prototype board used.  A presentation of the amount level of fine-
grained parallelism and high-level parallelism available in the solution of power system 
matrices is presented.  The performance of the hardware design on power system 
matrices is used to calculate the relative amount of speedup that can be obtained through 
the application of this hardware to the computation of Newton power flow relative to the 
benchmark system. 
 
5.2 Benchmark System 
 
In order to provide a reference for the performance of the sparse LU decomposition 
hardware, several state of the art linear solver packages were tested for their performance 
using a system based on a general purpose microprocessor.  The benchmark system used 
was an Intel Pentium 4 2.6 GHz personal computer running Mandrake Linux 9.2 as the 
operating system.  All linear solver packages were compiled using GCC 3.3.1 (flags –O3 
109 
–fPIC) and utilize the ATLAS (Automatically Tuned Linear Algebra Software) C BLAS 
libraries (Whaley and Petitet 2005) for the dense matrix subroutines that these packages 
require. Three state of the art solvers were tested as packaged using all defaults, 
UMFPACK v4.1 (Davis and Duff 1997), SuperLU v3.0 (Demmel, Eisenstat et al. 1999), 
and WSMP v4.04.30 (Gupta 2000).  The solve time reported by these linear solver 
packages on power system Jacobian matrices is summarized in Table 12. 
 
Table 12.  Solve Time in Seconds for Sparse Linear Solver Packages 
 UMFPACK SuperLU WSMP 
1648Bus 0.07 0.10 0.28 
7917Bus 0.37 0.67 1.78 
10278Bus 0.47 0.66 2.48 
26829Bus 1.39 2.57 7.43 
 
 
 
For the LU decomposition of our benchmark matrices, UMFPACK provided the best 
performance on all of the tested systems, so it was chosen to be the reference for 
comparisons to the performance of the sparse LU hardware.  All references to the 
performance of the benchmark system use the CPU time reported by UMFPACK for 
numerical solve, and not the wall clock time.  It should also be noted that the pre-LU 
decomposition ordering used by UMFPACK is AMD, the same ordering chosen for the 
sparse LU hardware performance tests. 
 
5.3 Hardware Prototype 
 
110 
The Sparse LU hardware was implemented using SBS Technologies Tsunami FPGA 
computing platform interfaced with an Intel Xeon processor based personal computer 
system.  The Tsunami is a 64-bit/66 MHz PCI board containing an Altera 1S25 FPGA, 
1MB ZBT SRAM, 256MB SDRAM, and 4 FPGA expansion sites.  Each FPGA 
expansion site can house a Tsunami expansion module which contains an Altera 1S25 
FPGA, 2MB ZBT SRAM, and 192MB SDRAM.    The Tsunami PCI board operates at a 
default 50 MHz clock rate for all FPGA and memory devices and is generated by an on 
board oscillator. 
Sparse LU hardware implementations of single and dual merge units were verified 
using a single Tsunami board expansion module.  Both are capable of performing LU 
decomposition for matrices in size up to our 1648 Bus benchmark system.  This is due to 
limitations on the available embedded memory blocks on the FPGA for the prototype 
design.  Implementations for greater than two parallel merge units are also not possible 
due to insufficient logic elements on the Tsunami board’s Altera 1S25 FPGAs.  The 
designs were tested against software operating on the host personal computer which was 
verified using Matlab sparse LU solve routines.  The software computes the LU 
decomposition in an identical fashion to the hardware and compares the lower triangular 
and upper triangular indices and single precision values to the hardware results.  The 
floating point values were compared exactly, insuring bit identical results between the 
FPGA floating point computations and the single precision floating point values 
computed on the host personal computer. 
 
111 
Debug 0 Slave
Debug 1 Slave
Result to X-BUSLUHW Control
Cache Master
Cache Slave
Colmap Master
Colmap Slave 64MB SDRAM
64MB SDRAM
64MB SDRAM
Tsunami X-BUS
Avalon SDRAM
Controller
Avalon SDRAM
Controller
Avalon DMA0
Read Master
Avalon
Block RAM
PLX Bridge
Memory Master
PLX Bridge
Register Master
PLX PCI BRIDGE (to PC)
Avalon DMA0
Write Master
Avalon DMA0
Control Slave
Stratix 1S25 FPGA
LU Hardware
 
Figure 45.  Tsunami Sparse LU Hardware Prototype 
 
 
 
Figure 45 is a block diagram of the components in the sparse LU hardware prototype 
design.  The prototype is a fully synchronous design which runs all of the logic on the 
FPGA and the external memory devices run at the same clock frequency.  Altera’s 
Avalon bus fabric, which implements a generalized point to point bus with byte 
addressing and arbitration features, was used to connect the hardware to the external 
memory devices as well as the host PC.  The arbitration features of the Avalon Bus allow 
the connected slave peripherals to receive commands from multiple master devices.  This 
is necessary since the memory contents need to be initialized by the host PC prior to 
calculation and then accessed by the sparse LU hardware during calculation.   
The interface between the host PC and the Tsunami board is via a PLX PCI bridge 
chip.  This chip connects the host’s PCI bus to the PLX Bridge Avalon bus interface 
112 
components instantiated on the FPGAs using a memory-mapped architecture.  The FPGA 
interface to external memory utilizes Altera’s Avalon SDRAM controller modules.   The 
sparse LU hardware contains two Avalon bus master devices that can initiate memory 
requests to the SDRAM controllers.  The Avalon SDRAM controller module is capable 
of providing one data word per clock cycle in the case of linear burst memory reads.  The 
sparse LU hardware lower and upper triangular results are recorded by the neighboring 
FPGA which is connected by the Tsunami X-Bus.  The Tsunami X-Bus is a set of 59 
wires that connect neighboring FPGAs on the Tsunami PCI board. 
Operation of the sparse LU hardware prototype is via a C program that utilizes the 
Tsunami device driver libraries.  These libraries contain function calls that allow users to 
program board specific devices such as the frequency of operation as well as access the 
Tsunami memory-mapped device address space through single word register and multi-
word direct memory access (DMA) reads and writes.  This allows the user program to 
access the hardware devices instantiated on the FPGA(s).  The operation of the prototype 
is summarized in Figure 46.  The program also includes debugging, error detection, and 
performance information. 
 
113 
1. Open Tsunami Device Handle
2. Set Board Frequency
3. Program FPGAs
a. Onboard: Base config for PLX PCI bridge
b. Module 1: LU Hardware
c. Module 2: Result Storage
4. Open Tsunami DMA channel
5. Run LU HW
a. Load data from file
b. Initialize hardware (DMA/Register Write)
i. Write colmap column size to colmap size table on FPGA
ii. Write sdram row size to sdram size table on FPGA
iii. Write colmap data to colmap data storage
iv. Write sdram data to sdram data storage
v. Verify colmap data and sdram data
c. Send LU HW matrix size and start signal
d. Poll LU HW status register for done flag
e. Run software based LU
f. Verify LU HW results using software based LU
g. Read clock cycle timer from hardware and report
6. Close DMA channel
7. Close Tsunami Device Handle
 
Figure 46.  Sparse LU Hardware Prototype Operation 
 
 
 
The current prototype design, using Altera Quartus II synthesis, reports  ~120MHz 
Fmax post-fit for both the single and dual merge unit versions of the sparse LU hardware 
targeted to an Altera Stratix (-5 speed grade) FPGA.  Post-fit results for Altera Stratix II 
(-3 speed grade) FPGA report ~170 MHz Fmax.  Preliminary timing model post-fit 
results for Altera Stratix III (-3 speed grade) report ~200 MHz Fmax.  The FPGA 
resource utilization for both designs on a Stratix FPGA is summarized in Figure 47.  
These figures include resources used for debugging logic as well as other components 
specific to the Tsunami PCI board. 
 
114 
Single Merge Dual Merge
Logic Cells Memory Bits DSP Elem Logic Cells Memory Bits DSP Elem
25660 1944576 80 25660 1944576 80
Used 17468 1540912 56 24233 1678128 64
68% 79% 70% 94% 86% 80%
LU System 14048 1506864 56 20794 1644080 64
col_filter 247 5888 0 343 6144 0
lu_cache 3911 1226800 0 5268 1270960 0
lu_colmap 336 46080 0 505 50432 0
lu_control 149 0 0 215 98 0
lu_debug 924 27520 0 887 24960 0
lu_divide 2819 2048 48 2868 2048 48
lu_merge 4292 59520 8 4255 59520 8
lu_pivot 539 7936 0 535 7936 0
lu_result 548 16384 0 557 16384 0
lu_swap 188 114688 0 188 114688 0
mer_mem 44 0 0 38 0 0
Other 3420 34048 0 3253 34048 0
sdram_0 868 0 0 820 0 0
sdram_2 644 0 0 597 0 0
zbt_sram_0 271 0 0 270 0 0
onchip_memory_0 2 32768 0 2 32786 0
dma_0 370 256 0 369 256 0
plx9656_avalon_bridge 511 1024 0 487 1024 0  
Figure 47.  Sparse LU Hardware Prototype Resource Utilization 
 
 
 
The majority of the increase in FPGA resource utilization, which totals roughly 40%, 
when adding an additional merge unit is consumed by the additional merge unit and by 
the cache unit.  The addition of a merge unit requires additional cache logic to service the 
extra read and write ports as well as to perform parallel-to-serial and serial-to-parallel 
data handling.  Each merge unit consumes roughly 4,300 logic cells of which roughly 
2,500 are used to implement the floating point multiplier and adder.  Each floating point 
multiplier also uses eight DSP elements which contain the embedded multipliers.  
Additional embedded memory (59,520 bits) for the merge unit buffers is also consumed.  
The increased resources required for scaling the number of merge units is easily 
115 
accommodated by today’s FPGAs which contain over 100k logic cells, thousands of 
kilobits of embedded memory, and hundreds of embedded multipliers. 
Part of the hardware prototype design includes performance monitoring.  After LU 
decomposition has been completed, the hardware prototype registers the total number of 
clock cycles required to compute the solution using a hardware counter.  The counter is 
controlled by the status registers of the sparse LU hardware control unit and increments 
on every clock cycle that the sparse LU hardware is not idle.  The resulting number of 
clock cycles and the frequency of operation for the system can be used to calculate the 
solve time with greater accuracy than running a performance timer on the host system.  
The number of clock cycles for the hardware prototype to compute the LU decomposition 
of several power system Jacobian matrices is summarized in Table 13 and in Table 14.  
The number of clock cycles required to compute the LU solution varies from run to run 
due to the variable access time of the external SDRAM memories.  The performance 
improvement that is obtained by adding an additional merge unit is roughly 30% for the 
largest system tested on the hardware prototype.  The latencies of the floating point units 
in the computational pipeline can bottleneck the performance if the sub-matrix update 
does not contain enough operations to keep all of the units busy, which is illustrated by 
the minimal gains obtained for the smaller power system Jacobian matrices.  The increase 
in merge units also does not impact the number of cycles required for pivot search. 
 
116 
Table 13.  Number of Clock Cycles to Perform Sparse LU Decomposition with Two Merge Units 
 Run #1 Run #2 Run #3 Average 
6 Bus 742 730 735 736 
14 Bus 2,468 2,477 2,465 2,470 
30 Bus 6,424 6,403 6,423 6,417 
57 Bus 13,444 13,460 13,460 13,455 
118 bus 20,759 20,765 20,780 20,768 
1648 Bus 628,333 628,381 628,483 628,399 
 
 
Table 14.  Number of Clock Cycles to Perform Sparse LU Decomposition with One Merge Unit 
 Run #1 Run #2 Run #3 Average 
6 Bus 751 746 760 752 
14 Bus 2,690 2,662 2,666 2,673 
30 Bus 7,079 7,081 7,089 7,083 
57 Bus 14,861 14,867 14,870 14,866 
118 bus 21,984 21,993 21,978 21,985 
1648 Bus 808,945 808,742 808,733 808,807 
 
 
 
The efficiency of the hardware design can be measured by counting the number of 
floating point operations performed per clock cycle or by counting the number of merge 
unit operations performed per clock cycle.  Figure 48 illustrates the efficiency of the 
hardware design using these two measures.  As the size of the systems increases the 
additional computations increases the efficiency of the sparse LU hardware design by 
reducing the impact of the latency of the computational units.  This is also due to the 
amount of cycles spent in pivot search relative to the amount of time spent performing 
the sub-matrix update.  The number of operations to perform sparse LU decomposition 
117 
increases at a higher rate than the number of non-zeroes in the sparse matrix which 
results in more time being spent performing computations rather than performing pivot 
search. 
 
Hardware Efficiency
# of sparse matrix operations per cycle
# of floating point operations per cycle
0.00
0.20
0.40
0.60
0.80
1.00
6 Bus 14
Bus
30
Bus
57
Bus
118
Bus
1648
Bus
OPs/Cycle
FLOPs/Cycle
 
Figure 48.  Sparse LU Hardware Computation Efficiency 
 
 
 
The performance of the hardware prototype design relative to the Pentium 4 based 
benchmark system is governed by the clock speed at which the hardware operates.  The 
performance of the hardware prototype will scale linearly with clock rate since it is a 
fully synchronous design.  There is nothing that precludes multiple clock domains within 
the design or other more complex performance optimizations.  Figure 49 plots the 
achieved number of floating point operations per second and the relative speedup the 
hardware prototype would provide for computing the LU solution of the 1648 Bus 
Jacobian power system matrix relative to the performance of the Pentium 4 based 
benchmark system.  The left axis corresponds to the solid line which represents the FLOP 
rate that the hardware achieves. The right axis corresponds to the dashed line which is the 
118 
corresponding hardware speedup.  The range of clock frequencies plotted covers a range 
of realistic operating frequencies for the current design. 
 
LU Hardware Performance Dual Merge Units
0.00
50.00
100.00
150.00
200.00
250.00
50 100 150 200 250
LU Hardware Clock Frequency
M
FL
O
Ps
0.00
5.00
10.00
15.00
Sp
ee
du
p 
vs
. P
4
MFLOPs
Speedup
 
Figure 49.  Frequency Scaling of Sparse LU Hardware Prototype with Dual Merge Units 
 
 
 
5.4 Performance Modeling 
 
A parameterized software model was created to accurately simulate the performance of 
the sparse LU hardware design.  A software model of the hardware design can be used to 
quickly determine the impact of changing the design without having to design and debug 
additional hardware.  The software model can also be used to simulate the performance 
of the hardware for power system matrices that exceed the resources of the hardware 
prototype.  Figure 50 plots the relative performance reported by the software model 
compared to the actual performance reported by the hardware prototype.  The results 
indicate, for the systems tested, the software model is accurate to within 4% of the actual 
performance.  The parameters in the software model include SDRAM latency, cache 
latency, floating point unit latency, cache size, and number of merge units. 
119 
 
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
14 Bus 30 Bus 57 Bus 118 Bus 1648 Bus
N
or
m
al
iz
ed
 H
ar
dw
ar
e 
Pe
rfo
rm
an
ce
Model
Single
Dual
 
Figure 50.  Sparse LU Hardware Prototype Performance Normalized to Software Model 
 
 
 
The performance of the sparse LU hardware design on the larger benchmark power 
system matrices, which are unable to be run on the prototype due to resource limitations, 
can be projected using the software model.  The scaling of the sparse LU hardware design 
with greater than two merge units can also be projected.  These projections are illustrated 
in Figure 51 for the 1648 Bus, 7917 Bus, and 10278 Bus power system Jacobian 
matrices.   
 
120 
0
5
10
15
1 2 4 8 16
# of Merge Units
(c)
S
pe
ed
up
150 MHz 200 MHz 250 MHz
 
0
5
10
15
1 2 4 8 16
# of Merge Units
(b)
S
pe
ed
up
150 MHz 200 MHz 250 MHz
 
0
5
10
15
1 2 4 8 16
# of Merge Units
(a)
S
pe
ed
up
150 MHz 200 MHz 250 MHz
 
Figure 51.  Projected Multiple Merge Unit Scaling (a) 1648 Bus (b) 7917 Bus (c) 10278 Bus 
 
 
 
The scaling of the hardware is limited by two factors.  The first is the amount of work 
available to be distributed amongst the multiple merge units, i.e. the number of rows in 
121 
the sub-matrix update.  Assuming an infinite number of merge units the limitation 
becomes the latency of the functional units and the size of the largest row to be updated.  
The second limitation, illustrated in Figure 52, shows that the amount of time spent 
during pivot search becomes more significant as the number of merge units increases 
relative to the amount of time spent computing the sub-matrix update. 
 
Single Merge Unit Breakdown
Pivot
21%
Update
79%
Dual Merge Unit Breakdown
Pivot
32%
Update
68%
4x Merge Unit Breakdown
Pivot
41%
Update
59%
8x Merge Unit Breakdown
Pivot
47%Update
53%
 
Figure 52.  Relative Time Spent During LU Decomposition 
 
 
 
Performing the LU decomposition without partial pivoting is possible for load flow 
computation since the Jacobian matrix is guaranteed to have a large non-zero on the 
diagonal due to the way the matrix is constructed from the Y-bus matrix.  This is simply 
an application of Tinney’s optimally ordered factorization (Tinney and Walker 1967).  
122 
This allows a reduction in the hardware solve time by removing the time spent by the 
hardware performing pivot search.  This optimization can only be used with a 
symmetrical LU decomposition pre-ordering if the ordering is performed separately from 
the decomposition, otherwise a non-zero diagonal entry is not necessarily guaranteed.  
Figure 53 illustrates the projected speedup of forcing diagonal pivoting on the LU 
hardware for one to eight merge units for three of the benchmark power system matrices 
relative to LU hardware performing partial pivoting.  By forcing diagonal pivoting, the 
hardware no longer has to perform pivot search which becomes significant as the number 
of merge units increases.  Another benefit to forcing diagonal pivoting is a maintaining of 
the symmetry in the power system matrix, which in the case of the 10278 bus system also 
significantly reduces the fill-in.  It is also possible to eliminate the colmap for a matrix 
that is symmetrical if diagonal pivoting is also used since the row and column structures 
will be identical. 
 
0
0.5
1
1.5
2
1p 2p 4p 8p
Number of merge units
S
pe
ed
up
1648 Bus 7917 Bus 10278 Bus
 
Figure 53.  Diagonal Pivoting Speedup vs. Number of Merge Units by Power System 
 
123 
 
 
Another aspect of the hardware design which is of importance is the scaling of the 
hardware’s performance with respect to matrix size.  Figure 54 shows the CPU time 
reported by UMFPACK for matrix ordering, LU decomposition, and forward/backward 
solve.  Also included is the solve time for the lu hardware with one merge unit.  This 
graph illustrates that the lu hardware scales similarly to the ordering time.  All of the 
components plotted display a fairly linear scaling with respect to matrix size.   
 
0.0000
0.0500
0.1000
0.1500
0.2000
0.2500
0.3000
0.3500
0.4000
0 10000 20000 30000 40000 50000 60000
Matrix Size
Ti
m
e 
(s
)
Ordering lu solve luhw
 
Figure 54.  Scaling of UMFPACK and LU Hardware with Respect to Matrix Size 
 
 
 
Figure 55 shows the cache hit rate as a function of the number of rows in the cache for 
various power system matrices.  The results indicate that beyond a certain point, 
increasing the size of the cache does not have a significant impact on the cache row hit 
rate regardless of size.  As the size of the power system grows, and consequently the size 
124 
of the sparse matrix, the minimum number of rows to reach the optimal point grows 
slightly. 
 
50
55
60
65
70
75
80
85
90
95
100
0 50 100 150
Number of Rows
C
ac
he
 H
it 
R
at
e
1648 Bus 7917 Bus 10278 Bus
 
Figure 55.  Cache Row Hit Rate vs. Number of Rows in the Cache by Power System Matrix 
 
 
 
Figure 56 shows that the number of clock cycles required to perform the LU 
decomposition using a single merge unit LU hardware implementation scales linearly 
with the number of FLOPs that is required to compute the solution.  This empirical data 
suggests that the hardware implementation can perform sparse LU decomposition in time 
proportional to FLOPs. 
 
125 
FLOPs vs. Clock Cycles y = 0.7359x
R2 = 0.994
0
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
0 5E+06 1E+07 2E+07 2E+07 3E+07 3E+07
Clock Cycles
FL
O
P
s
 
Figure 56.  Flops vs. Number of Clock Cycles for LU Decomposition 
 
 
 
5.5 Parallelism 
 
The amount of computational parallelism during sparse LU decomposition is affected 
by the ordering method used prior to decomposition as well as the method used to 
perform the decomposition.  Parallelism in computation typically falls into two 
categories, fine-grained and coarse-grained.  Fine-grained parallelism is exploited by 
arranging computations so that they can be computed in SIMD or vector fashion.  
Coarse-grained parallelism divides the whole problem into several non-dependent sub-
problems that can be distributed to multiple processors and then re-assembled to form the 
overall solution.  Use of one form of parallelism does not preclude the use of another, 
however use of coarse-grained parallelism can affect the amount of fine-grained 
parallelism. 
126 
Extraction of fine-grained parallelism involves tailoring the method used to perform 
the sparse LU decomposition, e.g. right looking uni-frontal, to allow parallel 
computation.  In the case of the sparse LU hardware design, fine-grained parallelism is 
exploited through the use of fully-pipelined streaming multiple merge units for the 
computation of the sub-matrix update.  Pipelined computational units are used to take 
advantage of the element-to-element parallelism within each row.  Multiple merge units, 
each assigned to a sub-matrix update row, takes advantage of the row-to-row parallelism 
within each sub-matrix update.  The limitation in the amount of parallelism that can be 
extracted this way is the number of rows per sub-matrix update and the pipeline length.  
Figure 57 plots the performance scaling achieved by using multiple merge units.  The 
performance begins to tail off significantly between four and eight merge units, which is 
roughly the limit that was predicted by the initial profiling study performed on the 
benchmark power system matrices.  The profiling results from the study of the pivot 
column size for a given sub-matrix update reported the mean and median size both fall 
within this range.  Scaling the number of merge units beyond eight results in diminished 
returns as a majority of the sub-matrix updates are unable to take advantage of the 
additional computational resources which results in idle computational units. 
 
127 
1
1.1
1.2
1.3
1.4
1.5
1.6
1 4 7 10 13 16
# Merge Units
Sp
ee
du
p
 
Figure 57.  Sparse LU Hardware Projected Speedup Relative to Single Merge Unit 
 
 
 
Coarse-grained parallelism is not specifically built into the prototype sparse LU 
hardware design unlike fine-grained parallelism.  The use of AMD for the sparse matrix 
ordering prior to LU decomposition does not result in a sparse matrix that is particularly 
amenable to exploitation through coarse-grained parallelism since the only goal of this 
ordering is reduction of fill-in.  Instead a different matrix ordering strategy should be 
used which results in a sparse matrix with a non-zero pattern that is easier to decompose 
into larger tasks without a significant increase in fill-in relative to AMD orderings which 
would offset the performance gained through increased computational parallelism.  
Maximizing the performance obtained through coarse-grained parallelism requires 
equally distributing the computational work load among the multiple processors and 
keeping all processors busy computing instead of idling. 
128 
A sparse matrix with a BBD structure is amenable to coarse-grained parallel 
decomposition and can be solved using several sparse LU hardware units, with each unit 
assigned to compute the LU decomposition of one of the independent diagonal blocks of 
non-zeros in the matrix.  The contributions of these computations to the final diagonal 
block, due to the coupling equations, would need to be computed before the 
decomposition of the final block can occur.  A sparse matrix with a nested BBD structure 
with each level having two diagonal blocks can be efficiently decomposed by using a tree 
structure.  An illustration of this computation tree for a two-level recursive BBD structure 
is provided in Figure 58 and Figure 59.   
 
129 
A1
A2
A1
A2
B1A2A1
A3
A4
A3
A4
B3A4A3
A1
A1 C1
A2
A2
B1
B1
A3
A4
B3
B3A4A3
 
Figure 58.  Two Level Recursively Partitioned BBD Sparse Matrix 
 
 
 
130 
A1 A1
A1
A2 A2
A2B1 B2
A1
B1
A1 B1 C1
A2
B2
A2 B2 C2
A3 A3
A3
A3
A3
A4 A4
A4 B4
A4
B4
A4 B4 C4
B12 B12
B12 C12
B3 B3
B3 C3
B34 B34
B34 C34
C12
34
+ +
+
 
Figure 59.  Tree Breakdown of Recursively Partitioned BBD Sparse Matrix 
 
 
 
At the top level of the tree illustrated in Figure 59 for the matrix illustrated in Figure 58 
there are four diagonal blocks.  With N levels of sub-division there will be 2N diagonal 
blocks and N edge separators for each block.  In the illustration, the letters A, B, and C 
refer to the level associated with the block.  The number refers to the processing unit 
responsible for that block.  Clear blocks indicate a block that is initially empty.  
Factorization of the diagonal block A1 results in modification of blocks labeled B1 and 
C1 due to the edge separators associated with block A1.  After decomposition of the 
level-A diagonal blocks is complete, the contributions to the edge separator blocks are 
summed for the next level of the tree.  In the next level, diagonal blocks B12 and B34 are 
decomposed and their contributions to the final block C1234 are summed.  At the top of 
the tree, block C1234 is decomposed and the final solution is complete. 
131 
Augmentation of the sparse LU hardware design to accommodate coarse grained 
parallel decomposition of a nested BBD structure matrix is straight forward.  Distribution 
of the matrix elements involves only the distribution of data to different hardware units.  
The summation of contributions from the neighboring hardware units can be 
accomplished by using the sub-matrix update units already a part of each unit.  The row 
merge operation for a sub-matrix update is the same as what is required to merge the 
contribution of the neighboring blocks to the existing data.  So the summation of the 
contributions can be handled by existing logic within each sub-matrix update unit. 
 
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
2 4 8 16
# of Diagonal Blocks
S
pe
ed
up 1648
7917
10279
 
Figure 60.  Coarse-grained Parallel Decomposition Speedup for Sparse LU Hardware 
 
 
 
  The projected speedup from coarse-grained parallel sparse LU decomposition of a 
nested BBD structured sparse power system Jacobian matrix, relative to a single merge 
unit sparse LU hardware based decomposition with partial pivoting, is illustrated in 
Figure 60.  The BBD structure was generated by using METIS to recursively dissect the 
sparse matrix, followed by a constrained AMD ordering algorithm (Davis, Gilbert et al. 
132 
2004) to order the independent partitions of the dissected matrix for fill-in reduction.  
This figure plots the speedup for four benchmark power system matrices and indicates 
that a speedup of over 20x is possible for these power systems.  The plot also indicates 
that as the size of the sparse matrix increases, so does the speedup obtained from scaling 
the number of sparse LU hardware units.   
The increased speed obtained through fine-grained and coarse-grained parallel sparse 
LU decomposition is not mutually exclusive and the benefits of these two techniques can 
be combined.  The combination of fine and coarse-grained parallelism would involve 
multiple sparse LU hardware solver units, each with multiple merge units. 
  
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
2 4 8 16
# of Diagonal Blocks
S
pe
ed
up 1648
7917
10279
 
Figure 61.  Speedup of Combined Coarse-grain and Fine-grain Parallel LU Decomposition 
 
 
 
The combination of these two techniques can results in impressive speedup, almost two 
orders of magnitude, as illustrated by Figure 61.  This figure plots the speedup relative to 
the Pentium 4 benchmark system for the LU decomposition of a nested BBD structured 
sparse matrix by number of sparse LU hardware units, with each unit containing eight 
133 
row-merge units.  The speedup can be further enhanced by incorporating diagonal 
pivoting, since the BBD and CAM ordering is symmetrical. 
 
5.6 Application 
 
The goal of the sparse LU decomposition hardware was to provide a means to speed up 
power flow computation relative to a uni-processor desktop personal computer system.  
The prototype results demonstrated that the LU decomposition of an ordered Jacobian 
power system matrix can be sped up through the use of an application specific sparse LU 
hardware design.  A flow-chart illustrating application of the sparse LU decomposition 
hardware as a co-processor designed to accelerate LU decomposition for a Pentium 4 
based Newton power flow computation is illustrated in Figure 62, an explanation of the 
flow chart and breakdown is below. 
Using the FPGA-based sparse LU hardware as a co-processor to off load the 
computation of the LU decomposition of the Jacobian matrix requires the transfer of the 
Jacobian matrix prior to computation, as well as transfer of the Lower and Upper 
triangular factors back to the host for continued processing.  The transfer of the solution 
from FPGA to the host processor can be overlapped with the actual LU solve.  The 
forward substitution computation can also be overlapped as the data is received from the 
accelerator.  The symbolic analysis of the Jacobian matrix, i.e. the pre-LU decomposition 
ordering step, only needs to be performed once. 
The performance of this solution relative to a Pentium 4 only solution is illustrated by 
the pie chart in Figure 62.  The pie chart breakdown is color coded to match the flow 
134 
chart.  This comparison is for the computation of a single Newton power flow solution, 
and assumes the only difference between the Pentium 4 only solver and the co-processor 
assisted solver is in the LU decomposition portion of the breakdown.  The same matrix 
ordering (AMD) is used prior to LU decomposition for the two cases.  A 125 MHz PCI-X 
interconnect is used to compute the data transfer time. 
 
1. Problem Formulation &
Symbolic Analysis
2. Jacobian Construction
3a. Transfer Matrix to LU HW
4. Solve Backward Substitution
Data Input
5. Does Solution
Converge ?
Load Flow Solution
YES
Host Processor
(Pentium 4)
FPGA based
LU Hardware
Solve Forward
Substitution
3b. LU in 
HW
LU HW + PCI-X
Symbolic
3.2%
LU HW
9.3%
Back Sub. 
Solve
4.4%
Transfer
1.4%
Jacobian
0.0%
 
Figure 62.  Breakdown of Newton Power Flow Computation Using FPGA Based Accelerator 
 
 
 
The performance breakdown indicates that transfer of data to and from the accelerator, 
using an interconnect such as Hypertransport, is not a bottleneck to performance given 
the one to one co-processor setup.  The overlap of the forward substitution calculation 
with the sparse LU hardware decomposition results in a small savings.  A total speedup 
135 
of 5-6x is capable of being realized by the order of magnitude improvement obtained by 
using the sparse LU hardware to accelerate the computation of Newton power flow.  
Amdahl’s law, i.e. the law of diminishing returns, indicates that additional performance 
in the LU solver beyond 20x would result in the bottleneck of the computation shifting to 
the forward and backward substitution routines in this co-processor setup.  A potential 
solution could be additional hardware on the FPGA designed to perform the sparse 
forward and backward substitution. 
In the previous comparison, the time to order the sparse matrix using the AMD 
algorithm is smaller than the time required for computation of the LU decomposition 
because the ordering for the sparse matrix need only be performed once for a given 
Jacobian matrix, whereas the repeated solution of an updated Jacobian matrix is required 
for a single Newton power flow computation.  Compared to ordering using AMD, 
generating the BBD structured power system Jacobian matrix using a combination of 
METIS and CAMD takes over 4x the same amount of time.  This significant increase in 
ordering time more than offsets the greater than 10x speedup in LU decomposition time 
obtained through the combination of fine-grained parallelism and coarse-grained 
parallelism for the computation of a single Newton power flow solution.  For a single 
power flow computation, even though the BBD method provides better performance for 
the LU decomposition, the BBD ordering + multiple LU hardware setup does not provide 
better performance than the AMD ordering + single LU hardware solution, as illustrated 
in Figure 63 and Figure 64. 
 
136 
1648 Bus Power Flow Computation
0.0000
0.0050
0.0100
0.0150
0.0200
0.0250
0.0300
0.0350
0.0400
0.0450
Total First Iteration
Se
co
nd
s
BackSub
LU HW
HT
Symbolic
Jacobian
 
Figure 63.  Power Flow Computation without BBD 
 
 
 
1648 Bus Power Flow Computation
0.0000
0.0100
0.0200
0.0300
0.0400
0.0500
0.0600
0.0700
Total First Iteration
Se
co
nd
s
BackSub
LU HW
HT
Symbolic
Jacobian
 
Figure 64.  Power Flow Computation with BBD 
 
 
 
For contingency analysis, where the repeated solution of a perturbed power system is 
required, the time to order the matrix is amortized by the cost of thousands of LU 
decompositions.  In this case the performance relative to the benchmark system of the 
single LU hardware accelerator will approach 7x, and the speedup of the BBD LU 
137 
hardware accelerator will approach 10x.  The speedup obtained by the BBD LU solver in 
this case is bottlenecked by the time to perform forward and backward substitution. 
 
Power Flow Speedup for Contingency Analysis
0
10
20
30
40
50
60
70
1648 Bus 7917 Bus 10279 Bus
Sp
ee
du
p
 
Figure 65.  Speedup in Power Flow without Host Bottlenecks 
 
 
 
If the time to perform forward and backward substitution could be reduce by an order 
of magnitude then a speedup of over 50x, as indicated in Figure 65, could be obtained. 
138 
CHAPTER 6.  CONCLUSIONS AND FUTURE WORK 
 
6.1 Summary of Contributions 
 
This research is the first study and implementation of an application specific direct 
sparse LU decomposition hardware design based on a fine-grained empirical study.  This 
work presents the following results: 
• A methodology which can be used to identify and evaluate the potential of 
algorithms which could benefit from FPGA based hardware acceleration. 
• An empirical analysis of power system Jacobian matrix sparse LU 
decomposition at the fine-grained operation level. 
• A prototype implementation of an FPGA based sparse LU solver. 
• An empirical study of parallelism in power system Jacobian matrix sparse LU 
decomposition, fine-grained and coarse-grained using the sparse LU hardware. 
• Application of the sparse LU hardware design to power flow computation as a 
hardware accelerator. 
 
6.2 Future Work 
 
The conclusion to this particular work would be the incorporation of this sparse LU 
hardware solver into a commercial power flow computational package.  Within power 
system computations however, the application of FPGA hardware to sparse LU 
decomposition is just one area that could benefit from additional computational ability.   
139 
There are many other computations necessary for the operation of power systems such as 
state estimation which may also benefit from application specific hardware design to 
provide additional performance. 
Outside of power systems, the positive results shown by this research suggests that 
there may be other applications that could benefit from the application of this sparse LU 
hardware design.  A broad feasibility and benchmarking study of sparse matrices from 
varying fields and from different applications could prove useful in finding additional 
applications for this particular hardware design.  The reconfigurable nature of FPGAs 
allows modifications to be made to the hardware design which would allow the 
application of this hardware to other target applications if the original design isn’t 
suitable. 
Beyond sparse LU decomposition and power system computations there are a vast 
number of computationally intensive problems that could benefit from the use of an 
FPGA based accelerator.  The methodology used in the design of this sparse LU 
hardware can be used to identify these computations and evaluate the potential of a 
FPGA accelerator. 
 
 
140 
LIST OF REFERENCES 
 
 
(1985). "IEEE standard for binary floating-point arithmetic." ANSI/IEEE Std 754-1985. 
  
Acton, F. S. (1970). Numerical methods that work. New York,, Harper & Row. 
  
Altera (2006). Stratix III, Device Handbook. 
  
Alves, A. B., E. N. Asada, et al. (1999). "Critical evaluation of direct and iterative 
methods for solving Ax=b systems in power flow calculations and contingency analysis." 
Power Systems, IEEE Transactions on 14(2): 702-708. 
  
Aykanat, C., A. Pinar, et al. (2002). Permuting sparse rectangular matrices into block-
diagonal form, Bilkent University. 
  
Balu, N., T. Bertram, et al. (1992). "On-line power system security analysis." 
Proceedings of the IEEE 80(2): 262-282. 
  
Barrett, R., M. Berry, et al. (1994). Templates for the Solution of Linear Systems: 
Building Blocks for Iterative Method., SIAM. 
  
Bergen, A. R. (1986). Power systems analysis. Englewood Cliffs, N.J., Prentice-Hall. 
  
Brown, H. E. (1985). Solution of large networks by matrix methods. New York, Wiley. 
  
Brugera, J. and T. Lang (1998). Rounding in Floating Point Addition using a Compound 
Adder, University of Santiago de Compostela. 
  
Bruguera, J. D. and T. Lang (1999). "Leading-one prediction with concurrent position 
correction." Computers, IEEE Transactions on 48(10): 1083-1097. 
  
Caliga, D. and D. P. Barker (2001). Delivering Acceleration: The Potential for Increased 
HPC Application Performance Using Reconfigurable Logic. 
  
Chan, K. W. (2001). "Parallel algorithms for direct solution of large sparse power system 
matrix equations." Generation, Transmission and Distribution, IEE Proceedings- 148(6): 
615-622. 
  
Chan, K. W., R. W. Dunn, et al. (1995). "Efficient heuristic partitioning algorithm for 
parallel processing of large power systems network equations." Generation, Transmission 
and Distribution, IEE Proceedings- 142(6): 625-630. 
  
141 
Chassin, D. P., P. R. Armstrong, et al. (2006). Gauss-Seidel accelerated: implementing 
flow solvers on field programmable gate arrays. 
  
Cuthill, E. and J. McKee (1969). Reducing the bandwidth of sparse symmetric matrices. 
Proceedings of the 1969 24th national conference, ACM Press. 
  
Dag, H. and A. Semlyen (2003). "A new preconditioned conjugate gradient power flow." 
Power Systems, IEEE Transactions on 18(4): 1248-1255. 
  
Davis, T. A. (2006). Direct Methods for Sparse Linear Systems (Fundamentals of 
Algorithms 2), Society for Industrial and Applied Mathematics. 
  
Davis, T. A. and I. S. Duff (1997). "An Unsymmetric-Pattern Multifrontal Method for 
Sparse LU Factorization." SIAM J. Matrix Anal. Appl. 18(1): 140-158. 
  
Davis, T. A., J. R. Gilbert, et al. (2004). "A column approximate minimum degree 
ordering algorithm." ACM Trans. Math. Softw. 30(3): 353-376. 
  
Davis, T. A. and W. W. Hager (2005). "Row Modifications of a Sparse Cholesky 
Factorization." SIAM J. Matrix Anal. Appl. 26(3): 621-639. 
  
de Leon, F. and A. Sernlyen (2002). "Iterative solvers in the Newton power flow 
problem: preconditioners, inexact solutions and partial Jacobian updates." Generation, 
Transmission and Distribution, IEE Proceedings- 149(4): 479-484. 
  
deLorimier, M., Andr, et al. (2005). Floating-point sparse matrix-vector multiply for 
FPGAs. Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-
programmable gate arrays. Monterey, California, USA, ACM Press. 
  
Demmel, J. W., S. C. Eisenstat, et al. (1999). "A Supernodal Approach to Sparse Partial 
Pivoting." SIAM J. Matrix Anal. Appl. 20(3): 720-755. 
  
Duff, I. S. (2000). "The impact of high-performance computing in the solution of linear 
systems: trends and problems." J. Comput. Appl. Math. 123(1-2): 515-530. 
  
Duff, I. S., A. M. Erisman, et al. (1989). Direct methods for sparse matrices, Clarendon 
Press. 
  
Erisman, A. M. (1977). "Decomposition and Sparsity with Application to Distributed 
Computing." EPRI(EL566-SR): 183-208. 
  
Feng, T. and A. J. Flueck (2002). A message-passing distributed-memory parallel power 
flow algorithm. IEEE Power Engineering Society Winter Meeting. 
  
Flynn, M. J. (1970). "On Division by Functional Iteration." Computers, IEEE 
Transactions on C-19(8): 702-706. 
142 
  
Foertsch, J., J. Johnson, et al. (2005). Jacobi load flow accelerator using FPGA. 
Proceedings of the 37th Annual North American Power Symposium. 
  
George, A. and J. Liu (1981). Computer Solution of Large Sparse Positive Definite 
Systems. Englewood Cliffs, Prentice-Hall. 
  
George, A. and W. H. Liu (1989). "The evolution of the minimum degree ordering 
algorithm." SIAM Rev. 31(1): 1-19. 
  
Gilbert, J. R., C. Moler, et al. (1992). "Sparse matrices in matlab: design and 
implementation." SIAM J. Matrix Anal. Appl. 13(1): 333-356. 
  
Gilbert, J. R. and T. Peierls (1986). Sparse Partial Pivoting in Time Proportional to 
Arithmetic Operations, Cornell University. 
  
Gokul, G., L. Zhuo, et al. (2004). Analysis of high-performance floating-point arithmetic 
on FPGAs. Proceedings of the 18th International Parallel and Distributed Processing 
Symposium. . 
  
Govindu, G., S. Choi, et al. (2004). A high-performance and energy-efficient architecture 
for floating-point based LU decomposition on FPGAs. Proceedings of the 18th 
International Parallel and Distributed Processing Symposium. 
  
Gupta, A. (2000). WSMP: Watson Sparse Matrix Package. Yorktown Heights, IBM T.J. 
Watson Research Center. 
  
Heydt, G. T. (1986). Computer analysis methods for power systems. New York 
London, Macmillan Pub. Co. ; 
Collier Macmillan. 
  
Hokenek, E. and R. K. Montoye (1990). "Leading-zero anticipator (LZA) in the IBM 
RISC System/6000 floating-point execution unit." IBM J. Res. Dev. 34(1): 71-77. 
  
Hu, Y. F. and J. A. Scott (2003). Ordering Techniques for Singly Bordered Block 
Diagonal Forms for Unsymmetric Parallel Sparse Direct Solvers, Rutherford Appleton 
Laboratory. 
  
Ito, M., N. Takagi, et al. (1995). Efficient Initial Approximation and Fast Converging 
Methods for Division and Square Root. Proceedings of the 12th Symposium on 
Computer Arithmetic, IEEE Computer Society. 
  
Ito, M., N. Takagi, et al. (1997). "Efficient initial approximation for multiplicative 
division and square root by a multiplication with operand modification." Computers, 
IEEE Transactions on 46(4): 495-498. 
  
143 
Jian, L., R. Tessier, et al. (2003). Floating point unit generation and evaluation for 
FPGAs. 11th Annual IEEE Symposium on Field-Programmable Custom Computing 
Machines. 
  
Jouppi, N. P. (1998). Improving direct-mapped cache performance by the addition of a 
small fully-associative cache prefetch buffers. 25 years of the international symposia on 
Computer architecture (selected papers). Barcelona, Spain, ACM Press. 
  
Jun Qiang, W. and A. Bose (1995). "Parallel solution of large sparse matrix equations 
and parallel power flow." Power Systems, IEEE Transactions on 10(3): 1343-1349. 
  
Kang, S.-M. and Y. Leblebici (2003). CMOS digital integrated circuits : analysis and 
design. Boston, McGraw-Hill. 
  
Karypis, G. and V. Kumar (1995). A Fast and High Quality Multilevel Scheme for 
Partitioning Irregular Graphs. Minneapolis, University of Minnesota. 
  
Kernighan, B. W. and S. Lin (1970). "An efficient heuristic procedure for partitioning 
graphs." Bell System Technical Journal 49: 291-307. 
  
Keyhani, A., A. Abur, et al. (1989). "Evaluation of power flow techniques for personal 
computers." Power Systems, IEEE Transactions on 4(2): 817-826. 
  
Khaira, M. S., G. L. Miller, et al. (1992). Nested Dissection: A Survey, Carnegie Mellon 
University. 
  
Kincaid, D. R., J. R. Respess, et al. (1982). "Algorithm 586: ITPACK 2C: A FORTRAN 
Package for Solving Large Sparse Linear Systems by Adaptive Accelerated Iterative 
Methods." ACM Trans. Math. Softw. 8(3): 302-322. 
  
Koester, D. P., S. Ranka, et al. (1993). Parallel block-diagonal-bordered sparse linear 
solvers for electrical power system applications. Proceedings of the Scalable Parallel 
Libraries Conference. 
  
Kroft, D. (1998). Lockup-free instruction fetch/prefetch cache organization. 25 years of 
the international symposia on Computer architecture (selected papers). Barcelona, Spain, 
ACM Press. 
  
Lau, K., D. J. Tylavsky, et al. (1991). "Coarse grain scheduling in parallel triangular 
factorization and solution of power system matrices." Power Systems, IEEE Transactions 
on 6(2): 708-714. 
  
Ligon, W. B., III, S. McMillan, et al. (1998). A re-evaluation of the practicality of 
floating-point operations on FPGAs. Proceedings of the IEEE Symposium on FPGAs for 
Custom Computing Machines. 
  
144 
Lucas, R. F., T. Blank, et al. (1987). "A Parallel Solution Method for Large Sparse 
Systems of Equations." Computer-Aided Design of Integrated Circuits and Systems, 
IEEE Transactions on 6(6): 981-991. 
  
Markowitz, H. M. (1957). "The Elimination Form of the Inverse and Its Application to 
Linear Programming." Management Science 3(3): 255-269. 
  
Montagna, M., G. P. Granelli, et al. (1996). "Levelwise algorithms for vector processing 
of sparse power system matrices." Power Systems, IEEE Transactions on 11(1): 239-245. 
  
Morris, G. R. and V. K. Prasanna (2005). An FPGA-Based Floating-Point Jacobi Iterative 
Solver. Proceedings of the 8th International Symposium on Parallel 
Architectures,Algorithms and Networks, IEEE Computer Society. 
  
Ness, J. E. V. (1959). "Iteration methods for digital load flow studies." Trans. AIEE 78: 
583-588. 
  
Ness, J. E. V. and J. H. Griffin (1961). "Elimination methods for load flow studies." 
Trans. AIEE 80: 299-304. 
  
Oberman, S. F. and M. J. Flynn (1996). Fast IEEE Rounding for Division by Functional 
Iteration, Stanford University. 
  
Obermann, S. F. and M. J. Flynn (1997). "Division algorithms and implementations." 
Computers, IEEE Transactions on 46(8): 833-854. 
  
Ogbuobiri, E. C., W. F. Tinney, et al. (1970). "Sparsity-Directed Decomposition for 
Gaussian Elimination on Matrices." IEEE Transactions on Power Apparatus and Systems 
PAS-89(1): 141-150. 
  
Oklobdzija, V. G. (1994). "An algorithmic and novel design of a leading zero detector 
circuit: comparison with logic synthesis." Very Large Scale Integration (VLSI) Systems, 
IEEE Transactions on 2(1): 124-128. 
  
Oppe, T. C., W. Joubert, et al. (1988). NSPCG's user's guide: A package for solving large 
linear systems by various iterative methods. Austin, University of Texas. 
  
Pai, M. A. and H. Dag (1997). Iterative solver techniques in large scale power system 
computation. Proceedings of the 36th IEEE Conference on Decision and Control. 
  
Patterson, D. A., J. L. Hennessy, et al. (2003). Computer architecture : a quantitative 
approach. San Francisco, Morgan Kaufmann Publishers. 
  
Press, W. H., S. A. Teukolsky, et al. (1992). Numerical Recipes in C: The Art of 
Scientific Computing, Cambridge University Press. 
  
145 
Quach, N. and M. J. Flynn (1991). Leading One Prediction --- Implementation, 
Generalization, and Application, Stanford University. 
  
Quach, N. T. and M. J. Flynn (1990). An improved algorithm for high-speed floating-
point addition, Stanford University. 
  
Rose, D. J. (1970). Symmetric elimination on sparse positive definite systems and 
potential flow network problem, Harvard University. PhD. 
  
Saad, Y. (1990). SPARSKIT: A Basic Tool Kit for Sparse Matrix Computation, 
University of Illinois at Urbana Champaign. 
  
Saad, Y. (2003). Iterative methods for sparse linear systems. Philadelphia, SIAM. 
  
Sato, N. and W. F. Tinney (1963). "Techniques for Exploiting the Sparsity of the 
Network Admittance Matrix." IEEE Transactions on Power Apparatus and Systems 
82(69): 944-950. 
  
Scheurich, C. and M. Dubois (1988). The design of a lockup-free cache for high-
performance multiprocessors. Proceedings Supercomputing. 
  
Semlyen, A. (1996). "Fundamental concepts of a Krylov subspace power flow 
methodology." Power Systems, IEEE Transactions on 11(3): 1528-1537. 
  
Shirazi, N., A. Walters, et al. (1995). Quantitative analysis of floating point arithmetic on 
FPGA based custom computing machines. Proceedings of the IEEE Symposium on 
FPGAs for Custom Computing Machines. 
  
Shyan-Lung, L. and J. E. Van Ness (1994). "Parallel solution of sparse algebraic 
equations." Power Systems, IEEE Transactions on 9(2): 743-749. 
  
Smith, M., J. Vetter, et al. (2005). Scientific Computing Beyond CPUs: FPGA 
implementations of common scientific kernels. MAPLD. 
  
Stott, B. (1972). "Decoupled Newton Load Flow." IEEE Transactions on Power 
Apparatus and Systems PAS-91(5): 1955-1959. 
  
Stott, B. (1974). "Review of load-flow calculation methods." Proceedings of the IEEE 
62(7): 916-929. 
  
Stott, B. and O. Alsac (1974). "Fast Decoupled Load Flow." IEEE Transactions on Power 
Apparatus and Systems PAS-93(3): 859-869. 
  
Stott, B., O. Alsac, et al. (1987). "Security analysis and optimization." Proceedings of the 
IEEE 75(12): 1623-1644. 
  
146 
Synopsys (2006) Predictable Success.  Volume,  DOI:  
  
Thirumalai, B. and G. Krishnan (2005). "Advantages Abound for a Conversion-Free, 
Low-Cost Path to Volume Production." Chip Design Magazine(July 2005). 
  
Tinney, W. F. and C. E. Hart (1967). "Power Flow Solution by Newton's Method." IEEE 
Transactions on Power Apparatus and Systems PAS-86(11): 1449-1460. 
  
Tinney, W. F. and J. W. Walker (1967). "Direct solutions of sparse network equations by 
optimally ordered triangular factorization." Proceedings of the IEEE 55(11): 1801-1809. 
  
Underwood, K. (2004). FPGAs vs. CPUs: trends in peak floating-point performance. 
Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field 
programmable gate arrays. Monterey, California, USA, ACM Press. 
  
Wallach, Y. (1986). Calculations & programs for power system networks, Prentice-Hall, 
Inc. 
  
Ward, J. B. and H. W. Hale (1956). "Digital computer solution of power flow problems." 
Trans. AIEE (Power App. Syst.) 75: 398-404. 
  
Wendel, H. (1987). "Optimal pivot strategies for load-flow calculation." Electrical Power 
& Energy systems 9(2). 
  
Whaley, R. C. and A. Petitet (2005). "Minimizing development and maintenance costs in 
supporting persistently optimized BLAS." Softw. Pract. Exper. 35(2): 101-121. 
  
Xiaofang, W. and S. G. Ziavras (2003). Parallel direct solution of linear equations on 
FPGA-based machines. Proceedings of the International Parallel and Distributed 
Processing Symposium. 
  
Xilinx (2002) Using Virtex-II Block RAM for High Performance Read/Write CAMs. 
XAPP260 Volume,  DOI:  
  
Xilinx (2007). Virtex-5 Users Guide. 
  
Yannakakis, M. (1981). "Computing the Minimum Fill-In is NP-Complete." SIAM 
Journal on Algebraic and Discrete Methods 2(1): 77-79. 
  
Zecevic, A. I. and D. D. Siljak (1994). "Balanced decompositions of sparse systems for 
multilevel parallel processing." Circuits and Systems I: Fundamental Theory and 
Applications, IEEE Transactions on [see also Circuits and Systems I: Regular Papers, 
IEEE Transactions on] 41(3): 220-233. 
  
147 
Zhuo, L. and V. K. Prasanna (2005). Sparse Matrix-Vector multiplication on FPGAs. 
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-
programmable gate arrays. Monterey, California, USA, ACM Press. 
  
 
 
148 
Vita 
 
 
Petya Vachranukunkiet is a research assistant in the Electrical and Computer 
Engineering department at Drexel University.  He received a B.S. in Microelectronic 
Engineering from Rochester Institute of Technology in 1999 and a M.S. in Electrical 
Engineering from Drexel University in 2002.  He is a member of the Institute of 
Electrical and Electronics Engineers. 
 
 
 
