Development and optimization of a combinational multigrid algorithm for large scale circuit simulation on massively parallel architectures by Γαρυφάλλου, Δημήτριος Κ.
  
ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΕΣΣΑΛΙΑΣ 
ΠΟΛΥΤΕΧΝΙΚΗ ΣΧΟΛΗ 
ΤΜΗΜΑ ΗΛΕΚΤΡΟΛΟΓΩΝ 
ΜΗΧΑΝΙΚΩΝ ΚΑΙ 
ΜΗΧΑΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ 
 
 
 
 
Υλοποίηση και βελτιστοποίηση συνδυαστικού γραφοθεωρητικού 
αλγορίθμου για προσομοίωση πολύ μεγάλης κλίμακας γραμμικών 
κυκλωμάτων σε μαζικά παράλληλες αρχιτεκτονικές 
 
 
Development and Optimization of a combinatorial multigrid 
algorithm for large scale circuit simulation  
on massively parallel architectures  
 
Μεταπτυχιακή Διατριβή 
 
Δημήτριος Κ. Γαρυφάλλου 
 
 
  Επιβλέποντες Καθηγητές:    Νέστωρ Ευμορφόπουλος 
     Επίκουρος Καθηγητής 
 
     Γεώργιος Σταμούλης 
     Καθηγητής 
 
     Παναγιώτα Τσομπανοπούλου 
     Επίκουρος Καθηγήτρια 
 
       
 
Βόλος, Οκτώβριος 2015 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 
ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΕΣΣΑΛΙΑΣ 
ΠΟΛΥΤΕΧΝΙΚΗ ΣΧΟΛΗ 
ΤΜΗΜΑ ΗΛΕΚΤΡΟΛΟΓΩΝ 
ΜΗΧΑΝΙΚΩΝ  
ΚΑΙ ΜΗΧΑΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ 
 
 
 
Υλοποίηση και βελτιστοποίηση συνδυαστικού γραφοθεωρητικού 
αλγορίθμου για προσομοίωση πολύ μεγάλης κλίμακας γραμμικών 
κυκλωμάτων σε μαζικά παράλληλες αρχιτεκτονικές 
 
 
Μεταπτυχιακή Διατριβή 
 
 
Δημήτριος Κ. Γαρυφάλλου 
 
 
   Επιβλέποντες :    Νέστωρ Ευμορφόπουλος 
                Επίκουρος Καθηγητής 
 
    Γεώργιος Σταμούλης 
    Καθηγητής 
 
    Παναγιώτα Τσομπανοπούλου 
        Επίκουρος Καθηγήτρια 
 
 
Εγκρίθηκε από την τριμελή εξεταστική επιτροπή την 7η Οκτωβρίου 2015 
 
............................  ............................      ............................ 
Ν. Ευμορφόπουλος  Γ. Σταμούλης  Π. Τσομπανοπούλου 
Επίκουρος Καθηγητής Καθηγητής   Επίκουρος Καθηγήτρια 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
  
 
Μεταπτυχιακή Διατριβή για την απόκτηση του Μεταπτυχιακού διπλώματος 
Ειδίκευσης «Επιστήμη και Τεχνολογία Υπολογιστών, Τηλεπικοινωνιών και 
Δικτύων», στα πλαίσια του Προγράμματος Μεταπτυχιακών Σπουδών του 
Τμήματος Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών του 
Πανεπιστημίου Θεσσαλίας. 
 
 
........................................ 
 
Δημήτριος Γαρυφάλλου 
Διπλωματούχος Ηλεκτρολόγος Μηχανικός και Μηχανικός Υπολογιστών 
Πανεπιστημίου Θεσσαλίας  
 
 
 
 
 
 
 
Copyright © Dimitrios Garyfallou, 2015 
Με επιφύλαξη παντός δικαιώματος. All rights reserved. 
 
Απαγορεύεται η αντιγραφή, αποθήκευση και διανομή της παρούσας μεταπτυχιακής 
διατριβής, εξ ολοκλήρου ή τμήματος αυτής, για εμπορικό σκοπό. Επιτρέπεται η 
ανατύπωση, αποθήκευση και διανομή για σκοπό μη κερδοσκοπικό, εκπαιδευτικής 
ή ερευνητικής φύσης, υπό την προϋπόθεση να αναφέρεται η πηγή προέλευσης και 
να διατηρείται το παρόν μήνυμα. Ερωτήματα που αφορούν τη χρήση της διατριβής 
για κερδοσκοπικό σκοπό πρέπει να απευθύνονται προς τον συγγραφέα. 
 
 
 
 
 
 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
  
 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 i 
 
 
 
 
 
 
 
To my family and my friends 
Στην οικογένειά μου και στους φίλους μου  
 
 
 
 
 
 
 
  
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
  
 
 
 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 iii 
 
Ευχαριστίες  
 
Με την περάτωση της παρούσας μεταπτυχιακής διατριβής, θα ήθελα να ευχαριστήσω 
θερμά τους επιβλέποντές μου κ. Νέστορα Ευμορφόπουλο, κ. Γεώργιο Σταμούλη και κα 
Παναγιώτα Τσομπανοπούλου για την εμπιστοσύνη που επέδειξαν στο πρόσωπό μου με 
την ανάθεση του συγκεκριμένου θέματος, την άριστη συνεργασία και την συνεχή 
καθοδήγηση, η οποία διευκόλυνε την εκπόνηση της μεταπτυχιακής διατριβής μου. 
Επίσης, θα ήθελα να ευχαριστήσω τους φίλους και συνεργάτες μου του Εργαστηρίου Ε5 
για την υποστήριξη και την δημιουργία ενός ευχάριστου και δημιουργικού κλίματος και 
ιδιαίτερα τον  Dr. Κωνσταντή Νταλούκα για τις εύστοχες υποδείξεις και την συνεχή 
στήριξή του.  
Τέλος, οφείλω ένα μεγάλο ευχαριστώ στην οικογένειά μου και στους φίλους μου για την 
αμέριστη υποστήριξη και την ανεκτίμητη βοήθεια που μου παρείχαν τόσο κατά την 
διάρκεια των σπουδών μου όσο και κατά την εκπόνηση της μεταπτυχιακής μου διατριβής. 
Δημήτριος Γαρυφάλλου 
         Βόλος, 2015 
 
 
 
 
 
 
 
 
 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 iv 
 
Contents 
 
List of Tables .................................................................................................................... vi 
List of Figures .................................................................................................................. vii 
List of Acronyms ............................................................................................................. viii 
Περίληψη ............................................................................................................................ x 
Abstract ............................................................................................................................. xi 
1  Introduction ................................................................................................................... 1 
1.1 Motivation ................................................................................................................. 1 
1.2 Thesis Contribution ................................................................................................. 1 
1.3 Outline ...................................................................................................................... 1 
2  Linear System Solution Methods ................................................................................. 3 
2.1 Introduction .............................................................................................................. 3 
2.1.1 Sparsity Overview .............................................................................................. 3 
2.2 Overview of the Methods ........................................................................................ 4 
2.3 Stationary Methods .................................................................................................. 5 
2.3.1 The Jacobi Method ........................................................................................... 5 
2.3.2 The Gauss-Seidel Method ................................................................................ 6 
2.3.3 The Successive Overrelaxation Method (SOR) .............................................. 7 
2.4 Nonstationary Methods .......................................................................................... 8 
2.4.1 Generalized Minimal Residual (GMRES) ...................................................... 8 
2.4.2 Conjugate Gradient (CG) ................................................................................. 9 
2.4.3 BiConjugate Gradient (BiCG) ....................................................................... 10 
2.5 Computational Aspects of the Methods ............................................................... 11 
2.6 Multigrid Methods ................................................................................................. 12 
3  Introduction to Preconditioners ................................................................................. 13 
3.1 Introduction ............................................................................................................ 13 
3.2 Jacobi Preconditioner ............................................................................................ 14 
3.3 SSOR Preconditioner ............................................................................................. 14 
3.4 Incomplete Factorization Preconditioners .......................................................... 14 
4  A Multigrid-Like SDD solver ...................................................................................... 17 
4.1 Support Theory for Preconditioning .................................................................... 17 
4.1.1 Electric Networks as Graphs – Support Basics ............................................ 17 
4.1.2 Steiner Preconditioners ................................................................................... 18 
4.1.3 Predicting the performance of solvers ........................................................... 19 
4.2 The Combinatorial Multigrid Solver .................................................................... 20 
4.2.1 Related work on SDD solvers ......................................................................... 20 
4.2.2 SDD linear systems as graphs ....................................................................... 21 
4.2.3 A graph decomposition algorithm ................................................................. 22 
4.2.4 The Multigrid algorithm ................................................................................ 22 
5  GPU Architecture and the CUDA Programming Model.......................................... 25
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
Contents  v 
 
5.1 Introduction ............................................................................................................ 25 
5.2 Hardware Implementation ................................................................................... 26 
5.2.1 SIMT Architecture .......................................................................................... 26 
5.3 Device Memory Model .......................................................................................... 27 
5.3.1 Global Memory ................................................................................................ 27 
5.3.2 Local Memory ................................................................................................. 28 
5.3.3 Shared Memory ............................................................................................... 28 
5.3.4 Constant Memory ........................................................................................... 28 
5.3.5 Texture Memory ............................................................................................. 29 
5.4 The CUDA Programming Model ......................................................................... 29 
5.5 NVIDIA® TESLA™ C2075 .................................................................................. 30 
5.5.1 Engine Specifications ..................................................................................... 31 
5.5.2 Memory Specifications ................................................................................... 31 
6  Improving CMG solver performance ......................................................................... 33 
6.1 Introduction ............................................................................................................ 33 
6.2 System Specifications and Benchmarks .............................................................. 33 
6.3 Implementation and Results................................................................................. 34 
7  Conclusion ................................................................................................................... 39 
7.1 Future Work ........................................................................................................... 39 
References ........................................................................................................................ 41 
 
 
 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 vi 
 
List of Tables 
 
 
2.1  Summary of Operations for Iteration i: ”𝑎/𝑏”  means ”𝛼” multiplications with the 
matrix and ”𝑏”with its transpose, and storage requirements for the methods in iteration 
i: n denotes the order of the matrix. ................................................................................. 11 
 
6.1  Test platform specifications ....................................................................................... 33 
6.2  IBM Power Grid Benchmarks for DC Analysis ............................................................ 33 
6.3  Matrix size and non-zero elements of the MNA arrays ............................................. 34 
6.4  ISPD 2005/2006 Placement Benchmarks ................................................................... 34 
6.5  CMG preconditioner-solve step runtime speedup...................................................... 37 
6.6  PCG runtime speedup ................................................................................................. 37 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
               
 
 
 
 
 
 
 
 
 
 
 
 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 
 
vii 
 
List of Figures  
 
2.1  The Jacobi Method ....................................................................................................... 6 
2.2  The Gauss-Seidel Method ............................................................................................. 7 
2.3  The SOR Method .......................................................................................................... 8 
2.4  The Preconditioned GMRES(m) Method ...................................................................... 9 
2.5  The Preconditioned Conjugate Gradient Method ...................................................... 10 
2.6  The Preconditioned BiConjugate Gradient Method ................................................... 11 
 
4.1  A graph and its spanning tree - obtained by deleting the dashed edges .................. 18 
4.2  A graph and its Steiner preconditioner. ..................................................................... 19 
4.3  A bad clustering. ........................................................................................................ 20 
4.4  Decompose Graph Algorithm ..................................................................................... 22 
4.5  Two-level Combinatorial Multigrid ............................................................................ 23 
4.6  Full Combinatorial Multigrid ...................................................................................... 23 
 
5.1  How GPU Acceleration Works .................................................................................... 25 
5.2  CPU vs GPU Architecture ............................................................................................ 26 
5.3  Memory Hierarchy ..................................................................................................... 27 
5.4  2D Grid of Thread Blocks ............................................................................................ 30 
5.5  Fermi SM .................................................................................................................... 31 
  
6.1  CMG recursive flow .................................................................................................... 35 
6.2  CMG iterative flow ..................................................................................................... 36 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 
 
viii 
 
List of Acronyms 
 
AMG  Algebraic Multigrid 
BiCG       BiConjugate Gradient 
CCS       Compressed Column Storage 
CG       Conjugate Gradient 
CMG       Combinatorial Multigrid 
COO   Coordinate 
CPU       Central Processing Unit 
CSR      Compressed Row Storage 
CUDA      Compute Unified Device Architecture 
DC  Direct Current 
DIA  Diagonal Format 
DRAM      Dynamic Random-Access Memory 
ECC       Error Checking & Correction 
EDA   Electronic Design Automation 
ELL  ELLPACK 
FLOPs      Floating-Point Operations Per second 
GCC  GNU Compiler Collection 
GMG  Geometric Multigrid 
GMRES    Generalized Minimal Residual 
GPGPU     General Purpose Graphics Processing Unit 
GPU        Graphics Processing Unit 
HPC        High Performance Computing 
HYB  Hybrid 
ICC  Intel C++ Compiler 
IBM  International Business Machines Corporation  
IC  Integrated Circuit   
MG       Multigrid 
MNA  Modified Nodal Analysis 
OP  Operating System 
NVCC  Nvidia CUDA Compiler 
PKT  Packet 
PCG  Preconditioned Conjugate Gradient 
SDD       Symmetric Diagonally Dominant 
segSpMV Segmented Sparse Matrix Vector Multiplication 
SIMD   Single Instruction, Multiple Data 
SIMT   Single-Instruction Multiple-Thread
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
ix                                                                                                           List of Acronyms 
 
 
 
SM  Streaming Multiprocessor  
SpMV  Sparse Matrix Vector Multiplication 
SOR  Successive Overrelaxation 
SSOR   Symmetric Successive Overrelaxation 
SPD  Symmetric Positive Definite 
 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 
 
x 
 
Περίληψη 
  
Η επίλυση γραμμικών συστημάτων της μορφής Ax = b, για συμμετρικούς 
πίνακες με κυρίαρχη διαγώνιο αποτελεί πρόβλημα θεμελιώδους θεωρητικής 
σημασίας καθώς επίσης χρησιμοποιείται σε αμέτρητες εφαρμογές στην 
αριθμητική ανάλυση, τη μηχανική και τις επιστήμες. Ο πρώτος αξιόπιστα 
αποδοτικός επιλυτής τέτοιων συστημάτων για γενικές και αυθαίρετα 
σταθμισμένες τοπολογίες, προτάθηκε μόλις πριν λίγα χρόνια και. Ο επιλυτής 
αυτός στηρίζεται στις αρχές της θεωρίας γράφων και επιτυγχάνει εξαιρετικά 
αποτελέσματα ενώ παράλληλα παρέχει ισχυρές εγγυήσεις για την ταχύτητα 
σύγκλισης.  
 
Σκοπός μας είναι η επιτάχυνση της απόδοσης του συγκεκριμένου επιλυτή 
για συστήματα τα οποία εμφανίζονται στην προσομοίωση κυκλωμάτων πολύ 
μεγάλης κλίμακας. Οι πίνακες που εμφανίζονται σε αυτά τα μεγάλα 
συστήματα έχουν αραιή δομή με αποτέλεσμα οι μέθοδοι για τον αποδοτικό 
χειρισμό τους να είναι συχνά κρίσιμες για την επίδοση πολλών εφαρμογών 
συμπεριλαμβανομένης και της προσομοίωσης κυκλωμάτων. Ο 
συγκεκριμένος επιλυτής είναι βασισμένος σε μια από τις επαναληπτικές 
μεθόδους επίλυσης, η επιτάχυνσή των οποίων παραμένει πρόκληση για την 
επιστημονική κοινότητα. Στην παρούσα διατριβή μελετάμε τις επιπτώσεις 
υλοποίησης του γραφοθεωρητικού επιλυτή CMG σε μαζικά παράλληλες 
αρχιτεκτονικές (GPUs). 
 
Λέξεις Κλειδιά: 
Γραμμικά συστήματα, Προρυθμιστές, Μέθοδοι επίλυσης, Κάρτα Γραφικών, 
Προγραμματισμός Υψηλών Επιδόσεων 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 
 
xi 
 
Abstract 
 
The solution of linear systems in the form Ax = b, on symmetric diagonal 
dominant matrices (SDDs) is a problem of fundamental theoretical 
importance but also one with a myriad of applications in numerical 
mathematics, engineering and science. The first reliably efficient SDD 
solver for general and arbitrary weighted topologies was first proposed in 
recent years. The solver is based on support theory principles and it 
achieves state of the art empirical results while providing robust 
guarantees on the speed of convergence.  
 
Our purpose is to accelerate the performance of this solver for systems 
that occur in very large scale circuit simulation. Matrices that arise in those 
very large systems are sparse matrices, and as a result, methods for 
efficiently manipulating them are often crucial to the performance of 
many applications including circuit simulation. This solver is based on one 
of the iterative solution methods, which have proven to be of particular 
importance in computational science. This master thesis studies the 
implications of a CMG implementation on massively parallel architectures 
(GPUs). 
 
Keywords: 
Linear Systems, Preconditioners, Solution Methods, Graphics Processing Unit, 
High Performance Computing  
 
 
 
 
 
 
 
 
 
 
 
 
  
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
  
 
1 
 
Chapter 1 
Introduction 
 
1.1 Motivation 
Circuit simulation is a technique where a computer software is used to simulate the behavior 
of an electronic circuit or system, using mathematical models. New designs can be tested, 
evaluated and diagnosed without actually constructing the circuit or device. It is used across a 
wide spectrum of applications, ranging from integrated circuits(IC) and microelectronics to 
electrical power distribution networks and power electronics. Simulating a circuit’s behavior 
before actually building it can greatly improve design efficiency by making faulty designs 
known as such, and providing insight into the behavior of electronics circuit designs. In 
particular, for integrated circuits, the tooling is expensive, breadboards are impractical, and 
probing the behavior of internal signals is extremely difficult. Therefore almost all integrated 
circuits design relies heavily on simulation. 
  
1.2 Thesis Contribution      
The core of circuit simulation is based on the solution of linear systems in the form 𝐴𝑥 = 𝑏 
.Those systems arise after the Modified Nodal Analysis. In Electrical Engineering, Modified 
Nodal Analysis or MNA is an extension of nodal analysis which not only determines the 
circuit's node voltages (as in classical nodal analysis), but also some branch currents [1]. Several 
algorithms are based on solving such sort of linear systems. The contribution of this thesis is 
the acceleration of the CMG solver for SDD systems that arise in circuit simulation. 
 
Starting from a C implementation [2] of the algorithm, we ported the most time consuming 
part on a GPU with a view to improve the performance of the solution phase of the CMG 
solver. The current implementation is also based on the work have been done in my diploma 
thesis where we implemented a GPU kernel called segSpMV to accelerate CMG solver [3] [4]. 
The evaluation results showed that using our GPU implementation we can achieve CMG and 
PCG execution time speedups up to 10x and 5.7x over the sequential versions respectively. 
 
For the implementation we have used the Compute Unified Device Architecture (CUDA) [5], 
which is an open-source programming and interfacing tool provided by NVIDIA. The GPU 
device used for the benchmarking is the NVIDIA® TESLA™ C2075 with 448 CUDA cores. 
 
1.3 Outline 
 
In chapter 2 we give background material on the existing solution methods of linear systems. 
We begin with a review of what sparsity means and we describe the most useful sparse matrix 
storage. Then, we refer to stationary, nonstationary and multigrid methods.                            
 
In chapter 3 we review some basic notions of preconditioner matrices. We discuss about the 
importance of the preconditioning, how it is used and how it improves the convergence of the 
methods. 
 
Chapter 4 provides some background material on support theory for graphs and describes the 
Steiner preconditioner. In the second section we give some background material on solvers 
and we present CMG.  
 
In chapter 5 we present the GPU architecture and the CUDA programming model. Also, we 
mention the basic specifications of the NVIDIA® TESLA™ C2075.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
1.  INTRODUCTION  
———————————————————————————————————— 
2 
 
In chapter 6 we present our attempt to improve the performance of CMG. Firstly, we make a 
brief introduction, we describe our workstation and we present our benchmarks. In the last 
section, we describe our implementation and we present the experimental evaluation results. 
 
Finally, chapter 7 concludes the thesis and gives some future directions.   
 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 3 
 
Chapter 2 
Linear System Solution Methods 
 
2.1 Introduction 
 
There are two broad categories of methods for solving linear equations of the form 𝐴𝑥 = 𝑏 
when A is large and sparse: direct and iterative. While for some techniques such as direct 
solvers, we only provide brief descriptions, for iterative solvers, we go into more depth to 
describe the algorithms, since they are of interest to us here. 
 
A direct method for solving the system of equations 𝐴𝑥 = 𝑏 is any method that produces the 
solution x after a finite number of operations. An example of a direct method is using Gaussian 
elimination to factor A into matrices L and U where L is lower triangular and U is upper 
triangular, then solving the triangular systems by forward and back substitution. Direct 
methods are typically preferred for dense linear systems. The problem with direct methods for 
sparse systems is that the amount of computational effort and storage required can be 
prohibitive [6] . 
 
An alternative to direct methods of solution are iterative methods, which involve the 
construction of a sequence {𝑥(𝑖)} of approximations to the solution 𝑥, for which 𝑥(𝑖)→𝑥. 
Iterative methods for solving general, large sparse linear systems have been gaining popularity 
in many areas of scientific computing. Until recently, direct solution methods were often 
preferred to iterative methods in real applications because of their robustness and predictable 
behavior. However, a number of efficient iterative solvers were discovered and the increased 
need for solving very large linear systems triggered a noticeable and rapid shift toward iterative 
techniques in many applications [7] 
 
In this master thesis we are interested only in iterative methods on sparse matrices. But before 
we analyze some of the most well-known, let’s see what the term sparse refers to. 
 
2.1.1 Sparsity Overview 
 
Consider the solution of linear systems of the form  
 
𝐴𝑥 = 𝑏,           (2.1) 
 
where A is an 𝑛x 𝑛 matrix, and both 𝑥 and 𝑏 are 𝑛x1 vectors. Of special interest is the case 
where A is large and sparse. The term sparse above refers to the relative number of non-zeros 
in the matrix A. An 𝑛x 𝑛 matrix A is considered to be sparse if A has only O(𝑛) non-zero 
entries. In this case, the majority of the entries in the matrix are zeros, which do not have to 
be explicitly stored. An 𝑛x 𝑛 dense matrix has Ω(𝑛2) non-zeros. There are many ways of 
storing a sparse matrix. Whichever method is chosen, some form of compact data is required 
that avoids storing the numerically zero entries in the matrix. It needs to be simple and flexible 
so that it can be used in a wide range of matrix operations. This need is met by the primary 
data structure in CSparse1, a compressed-column matrix [8]. Some basic operations that 
operate on this data structure are matrix-vector multiplication, matrix-matrix multiplication, 
matrix addition, and transpose. 
 
The simplest sparse matrix data structure is a list of the nonzero entries in arbitrary order. The 
list consists of two integer arrays 𝑖 and 𝑗 and one real array x of length equal to the number of 
entries in the matrix.  
 
———————————— 
1CSparse is a C library which implements a number of direct methods for sparse linear systems. 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
2.  LINEAR SYSTEM SOLUTION METHODS  
———————————————————————————————————— 
4 
 
For example, the matrix [9] 
 
A = [
4.5 0 3.2 0
3.1 2.9 0 0.9
0
3.5
1.7
0.4
3.0 0
0 1.0
]                             (2.2) 
 
is presented in zero-based triplet form below. A zero-based data structure for an m-by-n 
matrix contains row and column indices in the range 0 to m-1 and n-1, respectively. 
 
 
 𝑖 = {2,  1,  3,  0,  1,  3,  3,  1,  0,  2}  
 𝑗 = {2,  0,  3,  2,  1,  0,  1,  3,  0,  2} 
                                              𝑥 = {2,  1,  3,  0,  1,  3,  3,  1,  0,  2} 
 
 
The triplet form is simple to create but difficult to use in most sparse matrix algorithms. The 
compressed-column storage (CCS) is more useful and is used in almost all functions in 
CSparse. An m-by-n sparse matrix that can contain up to 𝑛𝑧𝑚𝑎𝑥 entries is represented with 
an integer array 𝑝 of length 𝑛 + 1, an integer array 𝑖 of length 𝑛𝑧𝑚𝑎𝑥, and a real array 𝑥 of 
length 𝑛𝑧𝑚𝑎𝑥. Row indices of entries in column 𝑗 are stored in 𝑖[𝑝[𝑗]] through 
𝑖[𝑝[𝑗 + 1] − 1], and the corresponding numerical values are stored in the same locations in 
𝑥. The first entry 𝑝[0] is always zero, and 𝑝[𝑛] ≤ 𝑛𝑧𝑚𝑎𝑥 is the number of actual entries in 
the matrix. The example matrix (2.2) is represented as 
 
𝑝 = {   0,    3,    6,   8,    10} 
𝑖  = {   0,    1,    3,   1,     2,    3,    0,    2,     1,   3} 
𝑥 = {4.5,  3.1, 3.5, 2.9, 1.7,  0.4, 3.2, 3.0,  0.9, 1.0} 
 
One of the goals of dealing with sparse matrices is to make efficient use of the sparsity in order 
to minimize storage throughout the computations, as well as to minimize the required number 
of operations. Sparse linear systems are often solved using different computational techniques 
than those employed to solve dense systems. 
 
There are many sparse matrices formats such as DIA, ELL, CSR, HYB, PKT and COO for 
both structure and unstructured matrices [28]. The Combinatorial Multigrid Solver is based 
on Compressed Column Storage (CCS) and as it is designed for symmetric matrices, we focus 
on the CSR format. It is easy to see that CSR is equal to CCS for symmetric matrices with the 
difference that we use row major storing-access. 
 
2.2 Overview of the Methods 
 
Below are short descriptions of each of the methods to be discussed, along with brief notes 
on the classification of the methods in terms of the class of matrices for which they are most 
appropriate. In later sections of this chapter more detailed descriptions of these methods are 
given [10]. 
 
 
• Stationary Methods 
 
– Jacobi. 
The Jacobi method is based on solving for every variable locally with respect 
to the other variables; one iteration of the method corresponds to solving for 
every variable once. The resulting method is easy to understand and 
implement, but convergence is slow. 
 
– Gauss-Seidel 
The Gauss-Seidel method is like the Jacobi method, except that it uses updated 
values as soon as they are available. In general, if the Jacobi method converges, 
the Gauss-Seidel method will converge faster than the Jacobi method, though 
still relatively slowly.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
    2.2  Overview of the Methods 
———————————————————————————————————— 
5 
 
 
– SOR 
Successive Overrelaxation (SOR) can be derived from the Gauss-Seidel 
method by introducing an extrapolation parameter ω. For the optimal choice 
of ω, SOR may converge faster than Gauss-Seidel by an order of magnitude. 
 
 
• Nonstationary Methods 
 
– Conjugate Gradient (CG). 
The conjugate gradient method derives its name from the fact that it generates 
a sequence of conjugate (or orthogonal) vectors. These vectors are the 
residuals of the iterations. They are also the gradients of a quadratic functional, 
the minimization of which is equivalent to solving the linear system. Conjugate 
gradient (CG) is an extremely effective method when the coefficient matrix is 
symmetric positive definite (SPD), since storage for only a limited number of 
vectors is required. 
 
– Generalized Minimal Residual (GMRES). 
The Generalized Minimal Residual method computes a sequence of 
orthogonal vectors, and combines these through a least-squares solve and 
update. However, it requires storing the whole sequence, so that a large 
amount of storage is needed. For this reason, restarted versions of this method 
are used. In restarted versions, computation and storage costs are limited by 
specifying a fixed number of vectors to be generated. This method is useful 
for general nonsymmetric matrices. 
 
– BiConjugate Gradient (BiCG). 
The biconjugate gradient (BiCG) method generates two CG-like sequences of 
vectors, one based on a system with the original coefficient matrix A, and one 
on AT. Instead of orthogonalizing each sequence, they are made mutually 
orthogonal, or “bi-orthogonal”. This method, like CG, uses limited storage. It 
is useful when the matrix is nonsymmetric and nonsingular; however, 
convergence may be irregular, and there is a possibility that the method will 
break down. BiCG requires a multiplication with the coefficient matrix and 
with its transpose at each iteration. 
 
 
2.3 Stationary Methods 
 
Iterative methods that can be expressed in the simple form 
  
           𝑥(𝑘) = 𝐵𝑥(𝑘−1) + 𝑐,                         (2.3) 
 
(where neither 𝐵 nor 𝑐 depend upon the iteration count 𝑘) are called stationary iterative 
methods. In this section, we present the three main stationary iterative methods: the Jacobi 
method, the Gauss-Seidel method and the Successive Overrelaxation (SOR) method. 
 
2.3.1 The Jacobi Method 
 
The Jacobi method is easily derived by examining each of the n equations in the linear system 
𝐴𝑥 = 𝑏 in isolation. If in the 𝑖𝑡ℎ  equation 
 
∑ 𝛼𝑖,𝑗
𝑛
𝑗=1 𝑥𝑗 = 𝑏𝑖, 
 
 
we solve for the value of x𝑖 while assuming the other entries of x remain fixed, we obtain 
 
𝑥𝑖 = (𝑏𝑖 − ∑ 𝛼𝑖,𝑗𝑗≠𝑖 𝑥𝑗) 𝛼𝑖,𝑖⁄  .                    (2.4)
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
2.  LINEAR SYSTEM SOLUTION METHODS     
———————————————————————————————————— 
6 
 
 
This suggests an iterative method defined by 
 
     𝑥𝑖
(𝑘) = (𝑏𝑖 − ∑ 𝛼𝑖,𝑗𝑗≠𝑖 𝑥𝑗
(𝑘−1)) 𝛼𝑖,𝑖⁄                (2.5) 
 
which is the Jacobi method. Note that the order in which the equations are examined is 
irrelevant, since the Jacobi method treats them independently. For this reason, the Jacobi 
method is also known as the method of simultaneous displacements, since the updates 
could in principle be done simultaneously.  
 
In matrix terms, the definition of the Jacobi method in (2.3) can be expressed as 
  
 𝑥(𝑘) = 𝐷−1(𝐿 + 𝑈)𝑥(𝑘−1) + 𝐷−1𝑏,                (2.6) 
 
where the matrices D, −L and −U represent the diagonal, the strictly lower-triangular, and the 
strictly upper-triangular parts of A, respectively. 
 
The pseudocode for the Jacobi method is given in below in Figure 2.1. Note that an auxiliary 
storage vector, 𝑥 is used in the algorithm. It is not possible to update the vector 𝑥 in place, 
since values from 𝑥(𝑘−1)are needed throughout the computation of 𝑥(𝑘) 
 
 
 
Figure 2.1: The Jacobi Method 
  
 
2.3.2 The Gauss-Seidel Method 
 
Consider again the linear equations (2.2). If we proceed as with the Jacobi Method, but now 
assume that the equations are examined one at a time in sequence, and the previously 
computed results are used as they are available, we obtain the Gauss-Seidel method 
pseudocode in Figure 2.2. 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
   2.3.2.  The Gauss-Seidel Method  
———————————————————————————————————— 
   7  
 
 
 
Figure 2.2: The Gauss-Seidel Method 
 
   𝑥𝑖
(𝑘) = (𝑏𝑖 − ∑ 𝛼𝑖,𝑗𝑖>𝑗 𝑥𝑗
(𝑘) − ∑ 𝛼𝑖,𝑗𝑗>𝑖 𝑥𝑗
(𝑘−1)) 𝛼𝑖,𝑖⁄           (2.7) 
 
Two important facts about the Gauss-Seidel method should be noted. First, the computations 
in (2.5) appear to be serial. Since each component of the new iterate depends upon all 
previously computed components, the updates cannot be done simultaneously as in the Jacobi 
method. Second, the new iterate 𝑥
(𝑘)
depends upon the order in which the equations are 
examined. The Gauss-Seidel method is sometimes called the method of successive 
displacements to indicate the dependence of the iterates on the ordering. If this ordering is 
changed, the components of the new iterate (and not just their order) will also change. 
 
These two points are important because if A is sparse, the dependency of each component of 
the new iterate on previous components is not absolute. The presence of zeros in the matrix 
may remove the influence of some of the previous components. Using a judicious ordering of 
the equations, it may be possible to reduce such dependence, thus restoring the ability to make 
updates to groups of components in parallel. However, reordering the equations can affect 
the rate at which the Gauss-Seidel method converges. A poor choice of ordering can degrade 
the rate of convergence; a good choice can enhance the rate of convergence. 
 
In matrix terms, the definition of the Gauss-Seidel method in (2.5) can be expressed as 
 
𝑥(𝑘) = (𝐷 − 𝐿)−1(𝑈𝑥(𝑘−1) + 𝑏)                  (2.8) 
 
As before D, −L and −U represent the diagonal, lower-triangular, and upper-triangular parts 
of A, respectively. 
 
2.3.3 The Successive Overrelaxation Method (SOR) 
 
The Successive Overrelaxation Method, or SOR, is devised by applying extrapolation to the 
Gauss-Seidel method. This extrapolation takes the form of a weighted average between the 
previous iterate and the computed Gauss-Seidel iterate successively for each component: 
 
𝑥𝑖
(𝑘) = 𝜔𝑥𝑖
(𝑘)
+ (1 − 𝜔)𝑥𝑖
(𝑘−1)
. 
 
(where 𝑥𝑖 denotes a Gauss-Seidel iterate, and 𝜔 is the extrapolation factor). The idea is to 
choose a value for ω that will accelerate the rate of convergence of the iterates to the solution.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
2.  LINEAR SYSTEM SOLUTION METHODS     
———————————————————————————————————— 
   8  
 
In matrix terms, the successive overrelaxation (SOR) algorithm can be written as follows: 
 
𝑥(𝑘) = (𝐷 − 𝜔𝐿)−1(𝜔𝑈 + (1 − 𝜔)𝐷)𝑥(𝑘−1) + 𝜔(𝐷 − 𝜔𝐿)−1𝑏.             (2.9) 
 
The pseudocode for the SOR algorithm is given above above in Figure 2.3. 
 
 
Figure 2.3: The SOR Method 
 
 
2.4 Nonstationary Methods 
 
Nonstationary methods differ from stationary methods in that the computations involve 
information that changes at each iteration. Typically, constants are computed by taking inner 
products of residuals or other vectors arising from the iterative method. 
 
2.4.1 Generalized Minimal Residual (GMRES) 
 
The GMRES method generates a sequence of orthogonal vectors, but in the absence of 
symmetry this can no longer be done with short recurrences; instead, all previously computed 
vectors in the orthogonal sequence have to be retained. For this reason are used restarted 
versions of the method. GMRES algorithm has the property that residual norm ∥ 𝑏 − 𝐴𝑥𝑖 ∥ 
can be computed without the iterate having been formed. Thus, the expensive action of 
forming the iterate can be postponed until the residual norm is deemed small enough. The 
GMRES iterates are constructed as: 
 
𝑥𝑖 = 𝑥0 + 𝑦1𝑢
1+. . . +𝑦𝑖𝑢
𝑖 ,        (2.10) 
 
The GMRES method retains orthogonality of the residuals by using long recurrences, at the 
cost of a larger storage demand. 
 
The pseudocode for the restarted GMRES algorithm with preconditioner M is given in 
Figure 2.4.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
                                                           2.4.1.  Generalized Minimal Residual (GMRES)   
———————————————————————————————————— 
   9  
 
 
 
Figure 2.4: The Preconditioned GMRES(m) Method 
 
2.4.2 Conjugate Gradient (CG) 
 
The Conjugate Gradient method is an effective method for symmetric positive definite 
systems. It is the oldest and best known of the nonstationary methods discussed here. The 
method proceeds by generating vector sequences of iterates (i.e., successive approximations 
to the  solution), residuals corresponding to the iterates, and search directions used in 
updating the iterates and residuals. Although the length of these sequences can become large, 
only a small number of vectors needs to be kept in memory. In every iteration of the method, 
two inner products are performed in order to compute update scalars that are defined to make 
the sequences satisfy certain orthogonality conditions. On a symmetric positive definite linear 
system these conditions imply that the distance to the true solution is minimized in some 
norm. 
 
The pseudocode for the Preconditioned Conjugate Gradient (PCG) Method is given below in 
Figure 2.5. It uses a preconditioner M; for M = I one obtains the unpreconditioned version 
of the Conjugate Gradient Algorithm. 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 
2.  LINEAR SYSTEM SOLUTION METSHODS     
———————————————————————————————————— 
   10  
 
 
 
Figure 2.5: The Preconditioned Conjugate Gradient Method 
 
 
 
2.4.3 BiConjugate Gradient (BiCG) 
 
The Conjugate Gradient method is not suitable for nonsymmetric systems because the residual 
vectors cannot be made orthogonal with short recurrences. The GMRES method retains 
orthogonality of the residuals by using long recurrences, at the cost of a larger storage demand. 
The BiConjugate Gradient method takes another approach, replacing the orthogonal sequence 
of residuals by two mutually orthogonal sequences, at the price of no longer providing a 
minimization. 
 
The update relations for residuals in the Conjugate Gradient method are augmented in the 
BiConjugate Gradient method by relations that are similar but based on 𝐴𝑇 instead of A. The 
pseudocode for the Preconditioned BiConjugate Gradient Method with preconditioner M is 
given in the top of the next page in Figure 2.6.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
                                                                                   2.4.3 BiConjugate Gradient (BiCG) 
———————————————————————————————————— 
   11  
 
 
 
Figure 2.6: The Preconditioned BiConjugate Gradient Method 
 
2.5 Computational Aspects of the Methods 
 
Efficient solution of a linear system includes the selection of the proper choice of iterative 
method. However, to obtain good performance, consideration must also be given to the 
computational kernels of the method and how efficient they can be executed on the target 
architecture. The performance of direct methods, is largely that of the factorization of the 
matrix. However, this lower efficiency of execution does not imply anything about the total 
solution time for a given system. Furthermore, iterative methods are usually simpler to 
implement than direct methods, and since no full factorization has to be stored, they can 
handle much larger systems than direct methods. Table 2.1 lists the type of operations 
performed per iteration and the storage required for each method (without preconditioning). 
 
 
Method Inner 
Product 
SAXPY Matrix-Vector 
Product 
Precond 
Solve 
Storage 
Requirements 
JACOBI   1α  Matrix + 3n 
Gauss 
Seidel 
 1 1α   
SOR  1 1α  Matrix + 2n 
GMRES i+1 i+1 1 1 Matrix + (i+5)n 
CG 2 3 1 1 Matrix + 6n 
BiCG 5 5 1/1 1/1 Matrix + 10n 
Table 2.1: Summary of Operations for Iteration i: ”𝑎/𝑏”  means ”𝛼” multiplications with 
the matrix and ”𝑏”with its transpose, and storage requirements for the methods in 
iteration i: n denotes the order of the matrix.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 
2.  LINEAR SYSTEM SOLUTION METSHODS     
———————————————————————————————————— 
   12  
 
 
2.6 Multigrid Methods 
 
Before closing this chapter we would like to discuss about the multigrid (MG) methods. MG 
methods in numerical analysis is defined as a group of algorithms for solving differential 
equations using a hierarchy of discretizations. They are an example of a class of techniques 
called multiresolution methods, very useful in problems exhibiting multiple scales of behavior. 
For example, many basic relaxation methods exhibit different rates of convergence for short- 
and long-wavelength components, suggesting these different scales be treated differently, as 
in a Fourier analysis approach to multigrid. MG methods can be used as solvers as well as 
preconditioners. 
 
The main idea of MG is to accelerate the convergence of a basic iterative method by global 
correction from time to time, accomplished by solving a coarse problem2. This principle is 
similar to interpolation between coarser and finer grids. The typical application for multigrid 
is in the numerical solution of elliptic partial differential equations in two or more dimensions. 
 
Multigrid can be applied in combination with any of the common discretization 
techniques.MG methods are among the fastest solution techniques known today. In contrast 
to other methods, multigrid methods are general in that they can treat arbitrary regions and 
boundary conditions. They do not depend on the separability of the equations or other special 
properties of the equation. 
 
 
———————————— 
2 Coarse problem is an auxiliary system of equations used in an iterative method for the 
solution of a given larger system of equations. It is basically a version of the same problem at 
a lower resolution, retaining its essential characteristics, but with fewer variables. 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 13 
 
Chapter 3 
Introduction to Preconditioners 
 
3.1 Introduction 
 
In chapter 2 we discussed about many iterative methods. The convergence rate of iterative 
methods depends on spectral properties of the coefficient matrix. Hence one may attempt to 
transform the linear system into one that is equivalent in the sense that it has the same solution, 
but that has more favorable spectral properties. A preconditioner  is a matrix that effects 
such a transformation. For SPD systems, the rate of convergence of the conjugate gradient 
method depends on the distribution of the eigenvalues of A. The purpose of preconditioning 
is that the transformed matrix in question will have a smaller spectral condition number, 
and/or eigenvalues clustered around 1. For nonsymmetric problems the situation is more 
complicated, and the eigen-values may not describe the convergence of nonsymmetric matrix 
iterations like GMRES. On parallel machines there is a further tradeoff between the efficacy 
of a preconditioner in the classical sense, and its parallel efficiency. Many of the traditional 
preconditioners have a large sequential component. 
 
If M is a nonsingular matrix that approximates A, then the linear system (3.1) has the same 
solution as (2.1) but must be significantly easier to solve. 
 
𝑀−1𝐴𝑥 = 𝑀−1𝑏 ,                      (3.1) 
 
     𝐴𝑀−1𝑦 = 𝑏, 𝑥 = 𝑀−1𝑦            (3.2) 
 
          𝑀1
−1𝐴𝑀2
−1𝑦 = 𝑀1
−1𝑏, 𝑥 = 𝑀2
−1𝑦            (3.3) 
 
The system (3.1) is preconditioned from the left, (3.2) is preconditioned from the right. At 
(3.3) is performed split preconditioning where the preconditioner is 𝑀 = 𝑀1𝑀2. 
 
Iterative algorithms such as the Conjugate Gradient method, converge to a solution using only 
matrix-vector products with A. It is well known that iterative algorithms suffer from slow 
convergence properties when the condition number of A, κ(A), which is defined as the ration 
of the largest over the minimum eigenvalue of A, is large. What preconditioned iterative 
methods attempt to do is to remedy the problem by changing the linear system to 𝑀−1𝐴𝑥 =
𝑀−1𝑏. In this case, the algorithms use matrix-vector products with A, and solve linear systems 
of the form 𝑀𝑦 = 𝑧. So now the speed of convergence depends on the condition number  
𝜅(𝐴,𝑀). 
 
The condition number is defined as: 
 
𝜅(𝐴,𝑀) = 𝑚𝑎𝑥𝑥
𝑋𝑇𝐴𝑥
𝑋𝑇𝑀𝑥
∙ 𝑚𝑎𝑥𝑥
𝑋𝑇𝑀𝑥
𝑋𝑇𝐴𝑥
.           (3.4) 
 
where 𝑥 is taken to be outside the null space of A. There are two contradictory goals one has 
to deal in constructing a preconditioner M: (i) The linear systems in M must be easier than 
those in A to solve, (ii) The condition number must be small so it will minimize the number 
of iterations. 
 
Historically, preconditioners were natural parts of the matrix A. We analyze some of the most 
well-known preconditioners below. 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 
3.  INTRODUCTION TO PRECONDITIONERS     
———————————————————————————————————— 
14 
 
3.2 Jacobi Preconditioner 
 
The simplest preconditioner consists of just the diagonal of the matrix 
 
    𝑚𝑖,𝑗 = {
𝛼𝑖,𝑖,
0,
𝑖𝑓 𝑖 = 𝑗
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
 
 
This is known as the (point) Jacobi preconditioner. 
 
For the model problem, 𝜅(𝐵−1𝐴) = 𝑂(𝑛) = 𝜅(𝐴), so the asymptotic rate of convergence is 
not improved with diagonal scaling. B in this case does not need to be factored. The storage 
required for the preconditioner is 𝑂(𝑛) since it is a sparse matrix. And, the preconditioner 
system is very easy to solve, since it simply requires dividing each vector entry by the 
corresponding diagonal entry of B. 
 
Even through the asymptotic rate of convergence is not improved, diagonal scaling can 
sometimes make the difference between convergence and non-convergence for an ill-
conditioned matrix 𝐴. Moreover, diagonal scaling generally achieves some reduction in the 
number of iterations, and is so cheap to apply that it might as well be done. 
 
 
3.3 SSOR Preconditioner 
 
Another example of a preconditioner is the SSOR preconditioner which like the Jacobi 
preconditioner, can be easily derived from the coefficient matrix without any work. 
 
Assume we have a symmetric matrix A. If this matrix is decomposed as 
 
𝐴 = 𝐷 + 𝐿 + 𝐿𝑇 
 
in its diagonal, lower, and upper triangular part, the SSOR matrix is defined as 
 
𝑀 = (𝐷 + 𝐿)𝐷−1(𝐷 + 𝐿)𝑇 
 
or, parametrized by ω 
 
   𝑀(𝜔) =
1
2−𝜔
(
1
𝜔
𝐷 + 𝐿) (
1
𝜔
𝐷)
−1
(
1
𝜔
𝐷 + 𝐿)
𝑇
. 
 
The SSOR matrix is given in factored form, so this preconditioner shares many properties of 
other factorization-based methods. For example, its suitability for vector processors or parallel 
architectures depends strongly on the ordering of the variables. 
 
 
3.4 Incomplete Factorization Preconditioners 
 
A broad class of preconditioners is based on incomplete factorizations of the coefficient 
matrix. We call a factorization incomplete if during the factorization process certain fill 
elements, nonzero elements in the factorization in positions where the original matrix had a 
zero, have been ignored. Such a preconditioner is then given in factored form 𝑀 + 𝐿𝑈 with L 
lower and U upper triangular. The efficacy of the preconditioner depends on how well 𝑀−1 
approximates A−1 . 
 
When a sparse matrix is factored by Gaussian elimination, fill-in usually takes place. In that 
case, sparsity-preserving pivoting techniques can be used to reduce it. The triangular factors L 
and U of the coefficient matrix A are considerably less sparse than A. 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 
                                                              3.4.  Incomplete Factorization Preconditioners 
———————————————————————————————————— 
15 
 
Sparse direct methods are not considered viable for solving very large linear systems due to 
time and space limitations, however, by discarding part of the fill-in in the course of the 
factorization process, simple but powerful preconditioners can be obtained in the form M = 
LU where L and U are the incomplete (approximate) LU factors.  
 
Summarizing, it can be said that existing solutions to the problem for incomplete factorization 
preconditioners for general SPD matrices follow one of two cases: simple inexpensive fixes 
that result in low quality preconditioners in terms of convergence rating, or sophisticated, 
expensive strategies that produce high quality preconditioners. 
 
 
 
 
 
 
 
 
 
 
 
 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
  
 
 
 
 
 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 
 
17 
 
Chapter 4 
A Multigrid-Like SDD solver 
    
In this chapter we give some background material on support theory of preconditioning 
and we describe the CMG, the solver that we studied and tried to optimize. The CMG 
solver was proposed by I. Koutis and Gary Miller and is characterized by the form of the 
preconditioner [11] [12]. The first implementation was in MATLAB [13] and later it 
transformed into C code. The basis of our implementation is the C code of the CMG solver 
[2] [3]. 
 
 
4.1 Support Theory for Preconditioning 
 
The main goal of the support theory is to provide techniques to bound the generalized 
eigenvalues and condition number for a matrix pencil (A, B) where B  is  a preconditioner for 
A. In this section we review fragments of support theory that are relevant to the design of the 
CMG. We refer the reader to [14] for an extensive explosion of support theory. 
 
4.1.1 Electric Networks as Graphs – Support Basics 
 
The cornerstone of combinatorial preconditioners is the following intuitive yet paradigm-
shifting idea explicitly proposed by Vaidya [15]: A preconditioner for the Laplacian of a 
graph A should be the Laplacian of a simpler graph B, derived in a principled fashion 
from A. 
 
There is a fairly well known analogy between graph Laplacians and resistive networks [16]. If 
G is seen as an electrical network with the resistance between nodes 𝑖 and 𝑗 being 1 𝑤𝑖,𝑗⁄  , 
then in the equation 𝐴𝑣 = 𝑖, if vis the vector of voltages at the node, 𝑖 is the vector of currents. 
Also, the quadratic form 𝑣𝑇𝐴𝑣 = ∑ 𝑤𝑖,𝑗𝑖,𝑗 (𝑣𝑖 − 𝑣𝑗)
2
 expresses the power dissipation on 
G, given the node voltages 𝑣. In view of this, the construction of a good preconditioner B 
amounts to the construction of a simpler resistive network (for example by deleting some 
resistances) with an energy profile close to that of 𝐴. 
 
The support of 𝐴 by 𝐵, defined as 𝜎(𝐴 𝐵⁄ ) = 𝑚𝑎𝑥𝑣
𝑣𝑇𝐴𝑣
𝑣𝑇𝐵𝑣
  is the number of copies of 𝐵 that 
are needed to support the power dissipation in 𝐴, for all settings of voltages. The principal 
reason behind the introduction of the notion of support, is to express its local nature, captured 
by the Splitting Lemma. 
 
Lemma 4.1 (Splitting Lemma) If 𝐴 = ∑ 𝐴𝑖
𝑚
𝑖=1  and 𝐵 = ∑ 𝐵𝑖
𝑚
𝑖=1  where 𝐴𝑖, 𝐵𝑖 are 
Laplacians, then 𝜎(𝐴, 𝐵) ≤ 𝑚𝑎𝑥𝑖𝜎(𝐴𝑖, 𝐵𝑖) 
 
The Splitting Lemma allows us to bound the support of A by B, by splitting the power 
dissipation in A into small local pieces, and “supporting” them by also local pieces in B. 
 
For example, in his work Vaidya proposed to take B as the maximal weight spanning tree of 
A. Then, it is easy to show that 𝜎(𝐵, 𝐴) ≤ 1, intuitively because more resistances always 
dissipate more power. In order to bound 𝜎(𝐴, 𝐵), the basic idea to let the 𝐴𝑖be edges on A(the 
ones not existing in B), and let 𝐵𝑖 be the unique path in the tree that connects the two end-
points of 𝐴𝑖 . Then one can bound separately each 𝜎(𝐴𝑖, 𝐵𝑖). In fact, it can be shown that any 
edge in A that doesn’t exist in B, can be supported only by the path 𝐵𝑖.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 
4.  A MULTIGRID-LIKE SDD SOLVER     
———————————————————————————————————— 
18 
 
As an example, consider the example in Figure 4.1 of the two (dashed) edges A1, A2 and their 
two paths in the spanning tree (solid) that share one edge “e”. 
 
In this example, the dilation of the mapping is equal to 3, i.e. the length of the longest of two 
paths. Also, as “e” is uses two times, we say that the congestion of the mapping is equal to 2. 
A core Lemma in Support Theory [14] [17] is that the support can be upper bounded by the 
product congestion ∗ dilation. 
 
 
Figure 4.1: A graph and its spanning tree - obtained by deleting the dashed edges 
  
 
4.1.2 Steiner Preconditioners 
 
Steiner preconditioners, introduced in [18] and extended [19] introduce external nodes into 
preconditioners. The proposed preconditioner is based on a partitioning of the n vertices in 
V  into m vertex disjoint clusters 𝑉𝑖 For each 𝑉𝑖, the preconditioner contains a star graph 
𝑆𝑖with leaves corresponding to the vertices in 𝑉𝑖 rooted at a vertex 𝑟𝑖. The roots 𝑟𝑖 are 
connected and form the quotient graph Q. This general setting is illustrated in Figure 4.2 at 
next page. 
 
Let D′ be the total degree of the leaves in the Steiner preconditioner S. Let the restriction R 
be an 𝑛 ×𝑚 matrix, where 𝑅(𝑖, 𝑗) = 1 if vertex 𝑖 is in cluster 𝑗 and 0 otherwise. Then, the 
Laplacian of S has 𝑛 +𝑚 vertices, and the algebraic form 
  
𝑆 = (
𝐷′ −𝐷′𝑅
−𝑅𝑇𝐷′ 𝑄 + 𝑅𝑇𝐷′𝑅
).         (4.1) 
 
A troublesome feature of the Steiner preconditioner S is the extra number of 
dimensions/vertices. Gremban and Miller [18] proposed that every time a system of the form 
𝐵𝑧 = 𝑦 is solved in a usual preconditioned method, the system 
 
𝑆 (
𝑧
𝑧′
) = (
𝑦
0
) 
  
should be solved instead, for a set of don't care variables 𝑧′.  They also showed that the 
operation is equivalent to preconditioning with the dense matrix 
  
𝐵 = 𝐷′ − 𝑉(𝑄 + 𝐷𝑄)
−1
𝑉𝑇          (4.2) 
 
where 𝑉 = 𝐷′𝑅 and 𝐷𝑄 = 𝑅
𝑇𝐷′𝑅. The matrix B is called the Schur complement of S with 
respect to the elimination of the roots 𝑟𝑖. It is a well-known fact that B is also a Laplacian. 
 
The analysis of the support 𝜎(𝐴 𝑆⁄ ), is identical to that for the case of subgraph 
preconditioners. For example, going back to Figure 4.2 the edge (v1, v4) can only be 
supported by the path (v1, r1, v4), and the edge (v4, v7) only by the path (v4, r1, r2, v7). 
Similarly we can see the mappings from edges in A to paths in S for every edge in A. In the 
example, the dilation of the mapping is 3, and it can be seen that to minimize the congestion 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
     4.1.2.  Steiner Preconditioners 
———————————————————————————————————— 
19 
 
on every edge of S (i.e. make it equal to 1), we need to take D′ = D, where D are the total 
degrees of the nodes in A, and 𝑤(𝑟1, 𝑟2) = 𝑤(𝑣3, 𝑣5) + 𝑤(𝑣4, 𝑣7). More generally, for two 
roots 𝑟𝑖,𝑟𝑗 we should have 
 
𝑤(𝑟𝑖, 𝑟𝑗) = ∑ 𝑤𝑖,𝑗𝑖′∈𝑉𝑖,𝑗′∈𝑉𝑗 . 
 
Under this construction, the algebraic form of the quotient Q can be seen to be 𝑄 = 𝑅𝑇𝐴𝑅. 
 
In [19] it was shown that the support 𝜎(𝑆 𝐴⁄ ) reduces to bounding the support 𝜎(𝑆𝑖, 𝐴[𝑉𝑖]), 
for all 𝑖, where 𝐴[𝑉𝑖] denotes the graph induced in A by the vertices 𝑉𝑖. The key behind 
bounding 𝜎(𝑆𝑖, 𝐴[𝑉𝑖]) is called conductance. Let us give the definition of conductance. 
 
Definition 4.1 The conductance 𝛷(𝛢) of a graph 𝐴 =  (𝑉, 𝐸, 𝑤) is defined as 
 
𝛷(𝛢) = 𝑚𝑖𝑛𝑆⊆𝑉
𝑤(𝑆, 𝑉 − 𝑆)
𝑚𝑖𝑛 (𝑤(𝑆), 𝑤(𝑉 − 𝑆))
 
 
where w(S, V − S) denotes the total weight connecting the sets S and V − S, and where w(S) 
denotes the total weight incident to the vertices in S. 
 
The main result of [19] is captured by the following Theorem. 
 
Theorem 4.1 The support 𝜎(𝑆 𝐴⁄ ) is bounded by a constant c independent from n, if and 
only if for all 𝑖 the conductance of the graph 𝐴0[𝑉𝑖] induced by the nodes in 𝑉𝑖 augmented by 
the edges leaving 𝑉𝑖  is bounded by a constant c′. 
 
 
 
Figure 4.2: A graph and its Steiner preconditioner. 
 
4.1.3 Predicting the performance of solvers  
 
Theorem 4.1 doesn’t give a way to pick clusters, but it does provide a way to avoid bad 
clustering. In recent work [20], Grady proposed a multigrid method where the construction 
of the “coarse” grid follows exactly the construction of the quotient graph in the previous 
section. Specifically, Grady’s algorithm proposes a clustering such that every cluster contains 
exactly one pre-specified ‘coarse’ nodes. It then defines the restriction matrix R and he lets the 
coarse grid be Q = RTAR, identically to the construction of the previous Section. The 
algorithm is iterated to construct a hierarchy of grids. The question then is whether the 
proposed clustering provides the guarantees that by Theorem 4.1 are necessary for the 
construction of a good Steiner preconditioner. The following Figure, is the Figure 2 of [20], 
with a choice of weights that force the depicted clustering.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
4.  A MULTIGRID-LIKE SDD SOLVER      
———————————————————————————————————— 
20 
 
 
Figure 4.3: A bad clustering. 
 
Every cluster in Figure 4.3 contains exactly one black/coarse node. The problem with the 
clustering is that the top left cluster, has a very low conductance when M ≫ 1. In general, in 
order to satisfy the requirement of the previous Theorem, there are cases where the clustering 
has to contain clusters with no coarse nodes in them. As we will discuss in later the behavior 
of the multigrid algorithm proposed in [20] is closely related to the quality of the Steiner 
preconditioner induced by the clustering. This implies that the multigrid of [20] can suffer bad 
convergence. 
 
The canonical clustering in Grady’s algorithm is very suitable for GPU implementations, when 
other solvers may be less suitable. This gives to it an advantage on this type of hardware. Even 
in the presence of a number of relatively bad clusters, it can be faster relative to a solver that 
uses better clusters. However the advantage is lost when the computed clusters cross a negative 
threshold in quality, a threshold that depends on several hardware-dependent factors. The 
value of Support Theory is evident in this case. Grady’s algorithm can be instrumented with a 
very fast routine that measures the quality of the formed clusters and predicts its performance, 
and reverts to another solver when needed. One can also imagine hybrid clustering algorithms 
where the majority clusters are formed using the algorithm [20] and the ‘sensitive’ parts of the 
system are treated separately. 
 
4.2 The Combinatorial Multigrid Solver 
 
The present chapter describes the Combinatorial Multigrid Solver (CMG).At the beginning, we 
give a short review of multigrid solvers and then we describe the basic components of CMG. 
 
4.2.1 Related work on SDD solvers 
 
Multigrid was originally conceived as a method to solve linear systems that are generated by 
the discretization of the Laplace (Poisson) equation over relatively nice domains [21]. The 
underlying geometry of the domain leads to a hierarchy of grids 𝐴 = 𝐴0, . . . , 𝐴𝑑 that look 
similar at different levels of detail; the picture that the word multigrid often invokes to mind 
is that of a tower of 2D grids, with sizes  2𝑑−𝑖 × 2𝑑−𝑖 for 𝑖 = 0, . . . , d. Its provably 
asymptotically optimal behavior for certain classes of problems soon lead to an effort, known 
as Algebraic Multigrid (AMG), to generalize its principles to arbitrary matrices. In contrast to 
classical Geometric Multigrid (GMG) where the hierarchy of grids is generated by the 
discretization process, AMG constructs the hierarchy of “coarse” grids/matrices based only 
on the algebraic information contained in the matrix. Various flavors of AMG, based on 
different heuristic coarsening strategies, have been proposed in the literature. AMG has been 
proven successful in solving more problems than GMG, though sometimes at the expense of 
robustness, a by-product of the limited theoretical understanding. 
 
A solver with provable properties for arbitrary SDD matrices, perhaps the “holy grail” of the 
multigrid community, was discovered only recently. The path to it was Support Theory [14], a 
set of mathematical tools developed for the study of combinatorial subgraph preconditioners, 
originally introduced by Vaidya [15] [22].It has been at the heart of the seminal work of 
Spielman and Teng [23] who proved that SDD systems can be solved in nearly-linear time. 
Koutis and Miller [24] proved that SDD matrices with planar connection topologies (e.g. 4-
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
                      
4.2.1.  Related work on SDD solvers 
———————————————————————————————————— 
21 
 
connectivity in the image plane) can be solved asymptotically optimally, in 𝑂(𝑛) time for n-
dimensional matrices. The complexity of the Spielman and Teng solver was recently 
significantly improved by Koutis, Miller and Peng [25] [26], who described an O(𝑚𝑙𝑜𝑔𝑛) 
algorithm for the solution of general SDD systems with m non-zero entries. 
 
It is fair to say that these theoretically described solvers are still impractical due to the large 
hidden constants, and the complicated nature of the underlying algorithms. Combinatorial 
Multigrid (CMG) [11] is a variant of multigrid that reconciles theory with practice. Similarly to 
AMG, CMG builds a hierarchy of matrices/graphs. The essential difference from AMG is that 
the hierarchy is constructed by viewing the matrix as a graph, and using the discrete geometry 
of the graph, for example notions like graph separators and expansion. It is, in a way, a hybrid 
of GMG and AMG, or a discrete-geometric MG. The re-introduction of geometry into the 
problem allows us to prove sufficient and necessary conditions for the construction of a good 
hierarchy and claim strong convergence guarantees for symmetric diagonally dominant (SDD) 
matrices based on recent progress in Steiner preconditioning [18] [19] [22]. 
 
4.2.2 SDD linear systems as graphs 
 
In this subsection we discuss how SDD linear systems can be viewed entirely as graphs. 
Combinatorial preconditioning advocates a principled approach to the solution of linear 
systems. The core of CMG and all other solvers designed in the context of combinatorial 
preconditioning is in fact a solver for a special class of matrices, graph Laplacians. The 
Laplacian A of a graph 𝐺 =  (𝑉, 𝐸, 𝑤) with positive weights, is defined by: 
 
𝐴𝑖,𝑗 = 𝐴𝑗,𝑖 = −𝑤𝑖,𝑗 ∧ 𝐴𝑖,𝑖 = −∑𝐴𝑖,𝑗
𝑖≠𝑗
 
 
More general systems are solved via light-weight transformations to Laplacians. Consider for 
example the case where the matrix A has a number of positive off-diagonal entries, and the 
property  𝐴𝑖,𝑖 = ∑ |𝐴𝑖, 𝑗|𝑖≠𝑗 . Positive off-diagonal entries have been a source of confusion for 
AMG solvers, and various heuristics have been proposed. Instead, CMG uses a reduction 
known as double-cover [18]. Let A = 𝐴𝑝 + 𝐴𝑛 + 𝐷, where D is the diagonal of 𝐴 and 𝐴𝑝 is 
the matrix consisting only of the positive off-diagonal entries of 𝐴. It is easy to verify that 
  
 
𝐴𝑥 = 𝑏 ⟺ (
𝐷 + 𝐴𝑛 −𝐴𝑝
  −𝐴𝑝 𝐷 + 𝐴𝑛
) (
   𝑥
−𝑥
 ) = (
   𝑏
−𝑏
  ) 
 
In this way, the original system is reduced to a Laplacian system, while at most doubling the 
size. In practice it is possible to exploit the obvious symmetries of the new system, to solve it 
with an even smaller space and time overhead. 
  
Matrices of the form 𝐴 + 𝐷𝑒 , where A is a Laplacian and 𝐷𝑒 is a positive diagonal matrix have 
also been addressed in various ways by different AMG implementations. In CMG, we again 
reduce the system to a Laplacian. If 𝑑𝑒 is the vector of the diagonal elements of D, we have 
 
𝐴𝑥 = 𝑏 ⟺
(
 
 
 𝐴 + 𝐷𝑒       0               −𝑑𝑒
0
−𝑑𝑒
𝑇
  𝐴 + 𝐷𝑒       −𝑑𝑒
      −𝑑𝑒
𝑇       ∑𝑑𝑒(𝑖)
𝑖
 
)
 
 
(
   𝑥
−𝑥
   0
 ) = (
   𝑏
−𝑏
   0
 ) 
 
Again it’s possible to implement the reduction in a way that exploits the symmetry of the new 
system, and with a small space and time overhead work only implicitly with the new system. 
 
A symmetric matrix A is called diagonally dominant (SDD), if  𝐴𝑖,𝑖 ≥ ∑ |𝐴𝑖, 𝑗|𝑖≠𝑗 .  The two 
reductions above can reduce any SDD linear system to a Laplacian system. Symmetric positive 
definite matrices (SPD) with non-positive off-diagonals are known as M-matrices. It is well 
known that if A is an M-matrix, there is a positive diagonal matrix D such that A = DLD  where 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
4.  A MULTIGRID-LIKE SDD SOLVER                       
———————————————————————————————————— 
22 
 
L is a Laplacian. Assuming D  is known, an M -system can also be reduced to a Laplacian 
system via a simple change of variables. In many application D is given, or it can be recovered 
with some additional work [23]. 
 
There is a one-to-one correspondence between Laplacians and graphs, so we will be often 
using the terms interchangeably. 
 
4.2.3 A graph decomposition algorithm 
 
The crucial step for the construction of a good Steiner preconditioner is the computation of 
a group decomposition that satisfies, as best as possible, the requirements of Theorem 4.1. 
Before the presentation of the Decompose-Graph algorithm, that extends the ideas of [19], 
we need to introduce a couple of definitions. Let 𝑣𝑜𝑙𝐺(𝑣) denote the total weight incident to 
node 𝑣 in graph G. The 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑑𝑒𝑔𝑟𝑒𝑒 of a vertex v is defined as the ratio 
 
𝑤𝑑(𝑣) =
𝑣𝑜𝑙(𝑣)
𝑚𝑎𝑥𝑢∈𝑁(𝑣)𝑤(𝑢, 𝑣)
 
 
The average weighted degree of the graph is defines as 
 
 𝑎𝑤𝑑(𝐺) = (
1
𝑛
)∑ 𝑤𝑑(𝑣)𝑦∈𝑁 . 
 
 
Figure 4.4:  Decompose Graph Algorithm 
 
 
It is not very difficult to prove that the algorithm Decompose-Graph presented in Figure 
4.4 produces a partitioning where the conductance of each cluster depends only on 𝑎𝑤𝑑(𝐴) 
and the constant 𝜅. In fairly general sparse topologies that allow high degree nodes, 𝑎𝑤𝑑(𝐴) 
is constant and the number of clusters m returned by the algorithm is such that 𝑛 𝑚⁄ > 2 
(and in practice larger than 3 or 4).  
 
4.2.4 The Multigrid algorithm 
 
In this subsection we outline the intuition behind Steiner preconditioners and multigrid. 
Details and proofs can be found in [22]. Algebraically, any of the classic preconditioned 
iterative methods, such as the Jacobi and Gauss-Seidel iteration, is nothing but a matrix S, 
which gets applied implicitly to the current error vector 𝑒, to produce a new error vector 𝑒′ =
𝑆𝑒. For example, in the Jacobi iteration we have 𝑆 = (𝐼 − 𝐷−1𝐴). This has the effect that it 
reduces effectively only part of the error in a given iterate, namely the components that lie in 
the low eigenspaces of S (usually referred to as high frequencies of A). The main idea behind 
a two-level multigrid is that the current smooth residual error 𝑟 = 𝑏 − 𝐴𝑥, can be used to 
calculate a correction 𝑅𝑇𝑄−1𝑅𝑟, where Q is a smaller graph and R is an m×n restriction   
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
                      
4.2.4.  The Multigrid algorithm 
———————————————————————————————————— 
23 
 
operator. The correction is then added to the iterate 𝑥. The hope here is that for smooth 
residuals, the low-rank matrix 𝑅𝑇𝑄−1𝑅𝑟 is a good approximation of 𝐴−1. Algebraically, this 
correction is the application of the operator 𝑇 = (𝐼 − 𝑅𝑇𝑄−1𝑅𝐴) to the error vector e. The 
choice of Q is most often not independent from that of R, as the Galerkin condition is 
employed: 
𝑄 = 𝑅𝐴𝑅𝑇 
 
The Galerkin condition ensures that T is a projection operator with respect to the A-inner 
product. Two level convergence proofs are then based on bounds on the angle between the 
subspace Null(P) and the high frequency subspace of S. 
 
At a high level, the key idea behind CMG is that the provably small condition number 𝜅(𝐴, 𝐵) 
where B is given in expression 4.2, is equal to the condition number 𝜅(?̂?, ?̂?) where ?̂? =
𝐷−1 2⁄ 𝐴𝐷−1 2⁄  and  ?̂? = 𝐷−1 2⁄ 𝐵𝐷−1 2⁄ . This in turn implies a bound on the angle between 
the low frequency of ?̂? and the high frequency of ?̂? [19]. The latter subspace is Null(𝑅𝑇𝐷1 2⁄ ). 
This fact suggests to choose 𝑅𝑇𝐷1 2⁄  as the projection operator while performing relaxation 
with (𝐼 − ?̂?) on the system ?̂?𝑦 = 𝐷−1 2⁄ 𝑏, with 𝑦 = 𝐷1 2⁄ 𝑥. Combining everything, we get 
the following two-level algorithm in Figure 4.5. 
 
 
Figure 4.5: Two-level Combinatorial Multigrid 
 
The two-level algorithm can naturally be extended into a full multigrid algorithm, by 
recursively calling the algorithm when the solution to the system with Q is requested. This 
produces a hierarchy of graphs 𝐴 = 𝐴0, . . . , 𝐴𝑑 . The full multigrid algorithm we use, after 
simplifications in the algebra of the two-level scheme is as follows in Figure 4.6. 
 
 
Figure 4.6: Full Combinatorial Multigrid 
  
 
If 𝑛𝑛𝑧(𝐴)  denotes the number of non-zero entries in matrix 𝐴, we pick 
 
𝑡𝑖 = 𝑚ax { ⌈
𝑛𝑛𝑧(𝐴𝑖)
𝑛𝑛𝑧(𝐴𝑖+1)
− 1⌉,1}
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
4.  A MULTIGRID-LIKE SDD SOLVER                       
———————————————————————————————————— 
24 
 
This choice for the number of recursive calls, combined with the fast geometric decrease of 
the matrix sizes, targets a geometric decrease in the total work per level, while optimizing the 
condition number. 
 
As we can see at the above Figure 4.6, the operation of sparse matrix-vector multiplication 
(SpMV) occurs in steps 3, 7 and 11 of the CMG algorithm. Those multiplications are the 
bottleneck in CMG solver and my diploma thesis [3] was only focused on solving those 
bottlenecks accelerating the time required for SpMV operations. The full Combinatorial 
Multigrid is executed in PCG method every time we have to solve 𝑀𝑧𝑖−1  = 𝑟𝑖−1  in 
preconditioner solve step. In this thesis we accelerate both the PCG and CMG solver by 
porting them on an Nvidia GPU. Details are given in Chapter 6. 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 25 
 
Chapter 5  
GPU Architecture and the CUDA     
Programming Model 
 
5.1 Introduction 
 
General-Purpose Graphics Processing Unit (GPGPU) Computing only became practical and 
popular after ca. 2001, with the advent of both programmable shaders and floating point 
support on graphics processors. GPGPU computing is the use of a GPU together with a CPU 
to accelerate scientific, analytics, engineering, consumer, and enterprise applications.  
 
GPU-accelerated computing offers unprecedented application performance by offloading 
compute-intensive portions of the application to the GPU, while the remainder of the code 
still runs on the CPU as illustrated by Figure 5.1. From a user's perspective, applications 
simply run significantly faster. 
 
 
 
Figure 5.1: How GPU Acceleration Works 
  
A simple way to understand the difference between a CPU and GPU is to compare how they 
process tasks. A CPU consists of a few cores optimized for sequential serial processing while 
a GPU has a massively parallel architecture consisting of thousands of smaller, more efficient 
cores designed for handling multiple tasks simultaneously as shown by Figure 5.2 at the top 
of following page. 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
5.  GPU ARCHITECTURE AND THE CUDA PROGRAMMING MODEL                
———————————————————————————————————— 
26 
 
 
Figure 5.2: CPU vs GPU Architecture 
 
 
5.2 Hardware Implementation 
 
The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming 
Multiprocessors (SMs). When a program on the host CPU invokes a kernel grid, the blocks of 
the grid are enumerated and distributed to multiprocessors with available execution capacity. 
The threads of a thread block execute concurrently on one multiprocessor, and multiple thread 
blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new 
blocks are launched on the vacated multiprocessors. 
A multiprocessor is designed to execute hundreds of threads concurrently. To manage such a 
large amount of threads, it employs a unique architecture called SIMT (Single-Instruction, 
Multiple-Thread). The instructions are pipelined to leverage instruction-level parallelism 
within a single thread, as well as thread-level parallelism extensively through simultaneous 
hardware multithreading as detailed in Hardware Multithreading. Unlike CPU cores they are 
issued in order however and there is no branch prediction and no speculative execution. 
 
5.2.1 SIMT Architecture  
The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel 
threads called warps. Individual threads composing a warp start together at the same program 
address, but they have their own instruction address counter and register state and are 
therefore free to branch and execute independently. The term warp originates from weaving, 
the first parallel thread technology. A half-warp is either the first or second half of a warp. 
A quarter-warp is either the first, second, third, or fourth quarter of a warp. 
When a multiprocessor is given one or more thread blocks to execute, it partitions them into 
warps and each warp gets scheduled by a warp scheduler for execution. The way a block is 
partitioned into warps is always the same; each warp contains threads of consecutive, 
increasing thread IDs with the first warp containing thread 0. Thread hierarchy, which 
describes how thread IDs relate to thread indices in the block, is described in a later section. 
A warp executes one common instruction at a time, so full efficiency is realized when all 32 
threads of a warp agree on their execution path. If threads of a warp diverge via a data-
dependent conditional branch, the warp serially executes each branch path taken, disabling 
threads that are not on that path, and when all paths complete, the threads converge back to 
the same execution path. Branch divergence occurs only within a warp; different warps execute 
independently regardless of whether they are executing common or disjoint code paths.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
              5.2.1.  SIMT Architecture               
———————————————————————————————————— 
27 
 
The SIMT architecture is akin to SIMD (Single Instruction, Multiple Data) vector 
organizations in that a single instruction controls multiple processing elements. A key 
difference is that SIMD vector organizations expose the SIMD width to the software, whereas 
SIMT instructions specify the execution and branching behavior of a single thread. In contrast 
with SIMD vector machines, SIMT enables programmers to write thread-level parallel code 
for independent, scalar threads, as well as data-parallel code for coordinated threads. For the 
purposes of correctness, the programmer can essentially ignore the SIMT behavior; however, 
substantial performance improvements can be realized by taking care that the code seldom 
requires threads in a warp to diverge. In practice, this is analogous to the role of cache lines in 
traditional code: Cache line size can be safely ignored when designing for correctness but must 
be considered in the code structure when designing for peak performance. Vector 
architectures, on the other hand, require the software to coalesce loads into vectors and 
manage divergence manually.  
 
5.3 Device Memory Model 
Threads may access data from multiple memory spaces during their execution as illustrated 
by Figure 5.3. Each thread has private local memory. Each thread block has shared memory 
visible to all threads of the block and with the same lifetime as the block. All threads have 
access to the same global memory. 
There are also two additional read-only memory spaces accessible by all threads: the constant 
and texture memory spaces. The global, constant, and texture memory spaces are optimized 
for different memory usages. Those memory spaces are persistent across kernel launches by 
the same application. Texture memory also offers different addressing modes, as well as data 
filtering, for some specific data formats. 
An instruction that accesses addressable memory (i.e., global, local, shared, constant, or texture 
memory) might need to be re-issued multiple times depending on the distribution of the 
memory addresses across the threads within the warp. How the distribution affects the 
instruction throughput this way is specific to each type of memory and described in the 
following sections. For example, for global memory, as a general rule, the more scattered the 
addresses are, the more reduced the throughput is.                                   
 
 
 
Figure 5.3: Memory Hierarchy 
5.3.1 Global Memory 
 
Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-
byte memory transactions. These memory transactions must be naturally aligned: Only the 32-
, 64-, or 128-byte segments of device memory that are aligned to their size (i.e., whose first 
address is a multiple of their size) can be read or written by memory transactions. 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
5.  GPU ARCHITECTURE AND THE CUDA PROGRAMMING MODEL                
———————————————————————————————————— 
28 
 
When a warp executes an instruction that accesses global memory, it coalesces the memory 
accesses of the threads within the warp into one or more of these memory transactions 
depending on the size of the word accessed by each thread and the distribution of the memory 
addresses across the threads. In general, the more transactions are necessary, the more unused 
words are transferred in addition to the words accessed by the threads, reducing the instruction 
throughput accordingly. For example, if a 32-byte memory transaction is generated for each 
thread’s 4-byte access, throughput is divided by 8. 
 
How many transactions are necessary and how much throughput is ultimately affected varies 
with the compute capability of the device. For devices of compute capability 1.1, the 
requirements on the distribution of the addresses across the threads to get any coalescing at 
all are very strict. For devices of compute capability 2.x, like the Tesla C2075 we use, and 
higher, the memory transactions are cached, so data locality is exploited to reduce impact on 
throughput. 
 
To maximize global memory throughput, it is therefore important to maximize coalescing by: 
 
 Following the most optimal access patterns based on the Compute Capability of the 
device being used 
 Using data types that meet the size and alignment requirement  
 Padding data in some cases, for example, when accessing a two-dimensional array  
 
5.3.2 Local Memory 
 
Local memory accesses only occur for some automatic variables. Automatic variables that the 
compiler is likely to place in local memory are: 
 Arrays for which it cannot determine that they are indexed with constant quantities 
 Large structures or arrays that would consume too much register space 
 Any variable if the kernel uses more registers than available (this is also known as 
register spilling) 
 
The local memory space resides in device memory, so local memory accesses have same high 
latency and low bandwidth as global memory accesses. Local memory is however organized 
such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are 
therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., 
same index in an array variable, same member in a structure variable). 
 
5.3.3 Shared Memory 
 
Because it is on-chip, shared memory has much higher bandwidth and much lower latency 
than local or global memory. 
To achieve high bandwidth, shared memory is divided into equally-sized memory modules, 
called banks, which can be accessed simultaneously. Any memory read or write request made 
of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, 
yielding an overall bandwidth that is n times as high as the bandwidth of a single module. 
However, if two addresses of a memory request fall in the same memory bank, there is a bank 
conflict and the access has to be serialized. The hardware splits a memory request with bank 
conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a 
factor equal to the number of separate memory requests. If the number of separate memory 
requests is n, the initial memory request is said to cause n-way bank conflicts. 
5.3.4 Constant Memory 
 
The constant memory space resides in device memory and is cached in the constant cache. A 
constant memory fetch costs one memory read from the device memory only on a cache miss, 
otherwise it costs one read from the constant cache. The memory bandwidth is best utilized 
when all instructions that are executed in parallel access the same address of the constant 
memory.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
                                                                                                         5.3.4.  Constant Memory               
———————————————————————————————————— 
29 
 
5.3.5 Texture Memory 
The texture memory space reside in device memory and is cached in texture cache, so a texture 
fetch costs one memory read from device memory only on a cache miss, otherwise it just costs 
one read from texture cache. The texture cache is optimized for 2D spatial locality, so threads 
of the same warp that read texture addresses that are close together in 2D will achieve best 
performance. Also, it is designed for streaming fetches with a constant latency; a cache hit 
reduces DRAM bandwidth demand but not fetch latency. 
Reading device memory through texture fetching present some benefits that can make it an 
advantageous alternative to reading device memory from global or constant memory: 
 If the memory reads do not follow the access patterns that global or constant memory 
reads must follow to get good performance, higher bandwidth can be achieved 
providing that there is locality in the texture fetches 
 Addressing calculations are performed outside the kernel by dedicated units 
 Packed data may be broadcast to separate variables in a single operation 
 8-bit and 16-bit integer input data may be optionally converted to 32 bit floating-point 
values in the range [0.0, 1.0] or [-1.0, 1.0] 
 
5.4 The CUDA Programming Model 
 
This chapter introduces the main concepts behind the CUDA programming model by 
outlining how they are exposed in C. 
 
CUDA stands for Compute Unified Device Architecture. It is a parallel programming 
paradigm released in 2007 by NVIDIA. It is used to develop software for graphics processors 
and is used to develop a variety of general purpose applications for GPUs that are highly 
parallel in nature and run on hundreds of GPU’s processor cores.  
 
CUDA C extends C by allowing the programmer to define C functions, called kernels, that, 
when called, are executed N times in parallel by N different CUDA threads, as opposed to 
only once like regular C functions. A kernel is defined using the global declaration specifier 
and the number of CUDA threads that execute that kernel for a given kernel call is specified 
using a new <<< … >>> execution configuration syntax (see C Language Extensions). Each 
thread that executes the kernel is given a unique thread ID that is accessible within the kernel 
through the built-in threadIdx variable. 
 
For convenience, threadIdx is a 3-component vector, so that threads can be identified using a 
one-dimensional, two-dimensional, or three-dimensional thread index, forming a one 
dimensional, two-dimensional, or three-dimensional thread block. This provides a natural way 
to invoke computation across the elements in a domain such as a vector, matrix, or volume. 
The index of a thread and its thread ID relate to each other in a straightforward way: For a 
one-dimensional block, they are the same; for a two-dimensional block of size (𝐷𝑥, 𝐷𝑦), the 
thread ID of a thread of index (𝑥, 𝑦) is (𝑥 + 𝑦 ∗ 𝐷𝑥); for a three-dimensional block of size 
(𝐷𝑥, 𝐷𝑦, 𝐷𝑧), the thread ID of a thread of index (𝑥, 𝑦, 𝑧) is (𝑥 + 𝑦 ∗ 𝐷𝑥 + 𝑧 ∗ 𝐷𝑥 ∗ 𝐷𝑦). 
 
There is a limit to the number of threads per block, since all threads of a block are expected 
to reside on the same processor core and must share the limited memory resources of that 
core. On current GPUs, a thread block may contain up to 1024 threads. However, a kernel 
can be executed by multiple equally-shaped thread blocks, so that the total number of threads 
is equal to the number of threads per block times the number of blocks. Blocks are organized 
into a one-dimensional, two-dimensional, or three-dimensional grid of thread blocks as 
illustrated by Figure 5.4 given in next page. 
 
The number of thread blocks in a grid is usually dictated by the size of the data being processed 
or the number of processors in the system, which it can greatly exceed. The number of threads 
per block and the number of blocks per grid specified in the <<< ... >>> syntax can be of
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
5.  GPU ARCHITECTURE AND THE CUDA PROGRAMMING MODEL                
———————————————————————————————————— 
30 
 
type int or dim3. Two-dimensional blocks or grids can be specified as in the example above. 
Each block within the grid can be identified by a one-dimensional, two-dimensional, or three-
dimensional index accessible within the kernel through the built-in blockIdx variable. The 
dimension of the thread block is accessible within the kernel through the built-in blockDim 
variable. 
 
A thread block size of 16x16 (256 threads), although arbitrary in this case, is a common choice. 
The grid is created with enough blocks to have one thread per matrix element as before. For 
simplicity, this example assumes that the number of threads per grid in each dimension is 
evenly divisible by the number of threads per block in that dimension, although that need not 
be the case. 
 
Thread blocks are required to execute independently: It must be possible to execute them in 
any order, in parallel or in series. This independence requirement allows thread blocks to be 
scheduled in any order across any number of cores, enabling programmers to write code that 
scales with the number of cores. 
 
Threads within a block can cooperate by sharing data through some shared memory and by 
synchronizing their execution to coordinate memory accesses. More precisely, one can specify 
synchronization points in the kernel by calling the syncthreads() intrinsic function; 
syncthreads() acts as a barrier at which all threads in the block must wait before any is 
allowed to proceed.  
 
For efficient cooperation, the shared memory is expected to be a low-latency memory near 
each processor core (much like an L1 cache) and syncthreads() is expected to be lightweight. 
 
 
 
Figure 5.4: 2D Grid of Thread Blocks 
 
5.5 NVIDIA® TESLA™ C2075 
 
Based on the NVIDIA Fermi architecture, the TESLA™ C2075 computing processor has 
been engineered from the ground up for High Performance Computing, as is capable of 
reaching 1.03 TFLOPs and 515 GFLOPs peak performance for single and double precision 
floating-point operations respectively. This was a good reason for us to choose this GPU card 
for our implementation because our algorithms are based on double precision arithmetic 
operations. 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
  5.5. NVIDIA® TESLA™ C2075              
———————————————————————————————————— 
31 
 
5.5.1 Engine Specifications 
 
The TESLA™ C2075 has 14 Multiprocessors with 32 CUDA cores each, which means 448 
CUDA cores with 1.15 GHz clock rate per core. It has compute capability 2.0. The warp size 
is 32 and it can support up to 1536 threads per multiprocessor and 1024 threads per block. 
The maximum sizes of each dimension of a block and grid are 1024 x 1024 x 64 and 65535 x 
65535 x 65535 respectively. Also it can support concurrent copy and kernel execution.  
 
5.5.2 Memory Specifications 
 
The total amount of global memory for this device is 6.144GB and with ECC support enabled 
the user available memory is 5.376GB. The total amount of constant memory is 65KB. Each 
block has available 49KB of shared memory and 32768 registers. The memory clock rate is 
1.57GHz. TESLA™ C2075 has available caching. The on-chip memory per multiprocessor is 
used for both L1 and shared memory, and how much of it is dedicated to L1 versus shared 
memory is configurable for each kernel call. Additionally, is has a unified L2 cache for all of 
the processor cores of 786KB.  The maximum texture dimension size for 1D is (65536), for 
2D is (65536, 65535) and for 3D is (2048, 2048, 2048).   
 
Figure 5.5 shows the architecture of Fermi Streaming Multiprocessor. TESLA™ C2075 is 
consists of 14 such SMs. 
 
 
               
Figure 5.5: Fermi SM 
 
  
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 33 
 
Chapter 6  
Improving CMG solver performance 
 
6.1 Introduction 
 
The solve phase of CMG algorithm is an extension of the preconditioned conjugate gradient 
method (PCG), described in subsection 2.4.2. Therefore, its core is the solution of the 
preconditioner solve step  𝑀𝑧𝑖−1 = 𝑟𝑖−1, where M is the Steiner preconditioner. The PCG 
iterations until the solution is satisfactory, in combination with the multigrid nature of CMG, 
lead to many matrix-vector multiplications which are the bottleneck of CMG solver. This 
bottleneck appeared also at the results of the performance profiling we done using Intel® 
VTuneTM Amplifier. In this section we present our effort and experimental results on 
accelerating the CMG solver using an Nvidia GPU. 
 
6.2 System Specifications and Benchmarks 
 
Hardware and software specifications of our system are included in Table 6.1. The host code 
is been compiled with Intel ICC compiler for better performance. For the compilation of our 
implementation which includes CUDA source code, we use the NVCC compiler which makes 
use of the GCC compiler to compile the host code together with the device code. 
 
CPU 6 core Intel(R) Xeon(R) E5645 @2.40GHz 
GPU Nvidia® Tesla™ C2075 
MEMORY 24GB DDR3 
OS Kubuntu 14.10 ( Linux 3.16.0-31 ) 
CUDA CUDA 7.0 
INTEL COMPILER ICC 15.0.2 
GNU COMPILER GCC 4.9.1 
NVIDIA COMPILER NVCC 6.0 
Table 6.1: Test platform specifications 
 
All the power grid benchmarks presented in this section are drawn from real designs, and vary 
over a reasonable range of size and difficulty. Those netlists are generated in Spice format. For 
more information for such Benchmarks we refer the reader to [27] [28]. Table 6.2 shows the  
IBM Power Grid Benchmarks we used for the DC Analysis. 
 
Benchmark #i #n #r #s #v #l 
ibmpg1 10.774 30.638 30.027 14.208 14.308 2 
ibmpg2 37.926 127.238 208.325 1.298 330 5 
ibmpg3 201.054 851.584 1.401.572 461 955 5 
ibmpg4 276.976 953.583 1.560.645 11.682 962 6 
ibmpgnew2 357.930 1.461.039 1.422.830 929.722 930.216 NA 
Table 6.2: IBM Power Grid Benchmarks for DC Analysis
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
             6.2.  System Specifications and Benchmarks               
———————————————————————————————————— 
34 
 
 
         i for current source 
         n for nodes (total number, does not take shorts into account) 
         r for resistors (include shorts) 
         s for shorts (zero value resistors and voltage sources) 
         v for voltage sources (include shorts) 
         l for metal layers 
 
For the MNA analysis of IBM netlists we used a software we had already implemented. This 
software parses the netlist file and creates the corresponding sparse MNA array “A” and right-
hand side vector “ 𝑏”, which will be used later for solving the system 𝐴𝑥 = 𝑏. Table 6.3 shows 
the dimensions and the number of non-zero elements of the MNA arrays corresponding to 
each IBM netlist.  
 
Benchmark Dimensions Non-zeros 
ibmpg1 44.943 × 44.943 147.315 
ibmpg2 127.565 × 127.565 544.545 
ibmpg3 852.536 × 852.536 3.656.107 
ibmpg4 954.542 × 954.542 4.058.866 
ibmpgnew2 1.461.993 × 1.461.993 6.167.130 
Table 6.3: Matrix size and non-zero elements of the MNA arrays 
 
Additionally, our implementation is evaluated on real designs included on a benchmark suite 
for the ISPD Placement contest of 2005/2006 [29]. Table 6.4 presents their dimensions and 
number of non-zero elements. 
 
Benchmark Dimensions Non-zeros 
ad1 210.904 × 210.904 2.112.590 
ad3 450.927 × 450.927 4.191.415 
bb2 534.782 × 534.782 4.407.059 
bb4 2.169.183 × 2.169.183 19.437.167 
nb6 1.248.150 × 1.248.150 11.591.932 
nb7 2.481.272 × 2.481.272  21.370.078 
Table 6.4:  ISPD 2005/2006 Placement Benchmarks 
 
6.3 Implementation and Results 
 
The running time of the CMG setup phase is negligible comparing to the actual MG iteration, 
so in this master thesis we decided to implement and optimize the second one. The reader can 
find more details in [12]. 
 
As is clearly shown in Figure 4.6, the full CMG solve phase algorithm is recursive. Although 
the recursion is supported since Fermi architectures, it is not efficiently implemented yet. Also, 
there is a limit to the recursion levels supported. Therefore, before we ported the entire MG 
iteration-algorithm on the GPU, we had transformed it into an iterative algorithm to avoid 
those constraints and have an improved performance. Figure 6.1 presents the recursive CMG
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
6.  IMPROVING CMG SOLVER PERFORMANCE               
———————————————————————————————————— 
35 
 
implementation describing all the including operations for each hierarchy level. The first call 
to CMG is for level 0 of the hierarchy (𝑥 = 𝐶𝑀𝐺(𝐴0,  𝑏0)), where 𝑥,  𝑏0 are identical to 𝑧, 𝑟 
of the preconditioner-solve step 𝑀𝑧 = 𝑟 respectively and 𝐴0 is the matrix corresponding to 
the coarser hierarchy level. Each hierarchy level calls the lower(child) level for 𝑀𝑎𝑥𝐼𝑡𝑙𝑒𝑣𝑒𝑙 
times until it returns its result to the upper(parent) level which adds a correction to its solution 
vector. The last level’s system is solved using Cholesky factorization because it is dense and 
small enough. 
 
 
 
START
Is the last level YES
xlevel  = ldl_solve(A_chollevel , blevel)
(Cholesky Factorization)
END
(Return xlevel)
NO
xlevel  = 0
iter = 1
iter == 1xlevel -= invDlevel * (Alevel * xlevel  - blevel) YES xlevel = invDlevel * blevel
blevel+1 = RT * (blevel - Alevel * xlevel)
z = CMG(Alevel+1, blevel+1)
xlevel += xlevel + RT  * z
xlevel  -= invDlevel *(Alevel * xlevel - blevel)
iter ++
iter > MaxItlevel
END
(Return xlevel)
YES
NO
NO
 
Figure 6.1: CMG recursive flow 
 
 
In Figure 6.2 we can observe the CMG flow after unrolling the recursion and transforming 
the algorithm into an iterative one. For this implementation, we have added some extra control 
flow to determine if the current level has already called once the lower level (if the recursion 
for this level has started), so we can choose between the corresponding operations. Every time 
a hierarchy level has to return its result to the upper level (return from the recursion), we 
reinitialize the level’s information and we compute the upper level’s solution vector or we 
return the solution to PCG when CMG algorithm is terminated. Also, we have to mention 
that the Cholesky factorization for the finest hierarchy level is implemented more efficiently 
on the CPU side because the system is small and dense.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
              6.3.  Implementation and Results 
———————————————————————————————————— 
36 
 
 
START
level = 0
Is the last level YES
xlevel = ldl_solve(A_chollevel, blevel)
(Cholesky factorization on CPU)
NO
xlevel  = 0
1st recursive
 call
NOxlevel -= invDlevel * (Alevel * xlevel - blevel) YES
recursionStartedlevel = 1
xlevel = invDlevel * blevel
blevel+1 = RT * (blevel - Alevel * xlevel)
itersLeftlevel --
level ++
recursionStartedlevel
NO
level == 0itersLeftlevel != 0
YES
recursionStartedlevel = 1
itersLeftlevel = MaxItlevel
(Reinitialize level’s information)
END
(Return x0)
YES
NO
xlevel-1 += RT * xlevel-1
xlevel-1 -=  invDlevel-1 * (Alevel-1 * xlevel-1 - blevel-1)
(Update upper level’s solution vector)
level --
YES
NO
 
Figure 6.2: CMG iterative flow 
 
Our implementation ported the entire CG iterative method (Figure 2.5) and the 
preconditioner solve step (Figure 4.6) on the GPU. This eliminates the need for additional 
memory transfers between the host and the GPU and reduces the communication overhead, 
provided that the GPU has sufficient memory to accommodate the algorithm’s working set 
(especially the system matrix A and the preconditioner hierarchy in sparse form). The only 
part of our algorithm that is implemented on the CPU side is the preconditioner’s 
construction. Once the preconditioner hierarchy has been created, the host side is responsible 
to allocate and copy the system matrix A, the right-hand side vector 𝑏, the initial guess 𝑥(0), 
the preconditioner hierarchy and some other helper vectors required by the PCG and the 
CMG algorithms, on the GPU. After some iterations and when the result is satisfactory, the 
solution 𝑥 of the linear system 𝐴𝑥 = 𝑏 is copied back to the CPU side. 
 
We have taken advantage of Intel Math Kernel Library (MKL) [30] for implementing the CPU 
version of the CMG algorithm and CUDA cuSPARSE and cuBLAS libraries [31] [32] for the 
mapping on the GPU. We note that both MKL and CUDA libraries contain implementations 
of BLAS-1, BLAS-2 and BLAS-3 kernels for handling vector operations and sparse matrices, 
all especially optimized for execution on multi-core and GPU architectures respectively. Also, 
it was necessary to implement some simple and optimized GPU kernels that are not part of 
the cuBLAS library. Those operations are vector-vector elementwise multiplication, projection 
operation and projection transpose operation. For those kernels we decided to use linear grids 
and blocks with 256 threads because resulted to the best performance. 
 
Furthermore, we have to mention that in our GPU implementation, although the system 
matrix A is symmetric, we decided to store the full matrix instead of storing the lower/upper 
part. We choose this implementation because the symmetric property does not show up any 
performance gain and cuSPARSE SPMV method is much slower. On the other hand, the 
CMG CPU algorithm stores only the upper part of A, because our experiments showed up a
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
6.  IMPROVING CMG SOLVER PERFORMANCE               
———————————————————————————————————— 
37 
 
slightly better SPMV performance in comparison with the implementation storing the full 
matrix. 
 
The experimental evaluation was made on the workstation described in Table 6.1, using the 
optimization flags that resulted to the lowest execution time on CPU and GPU 
implementation respectively. We have employed double-precision arithmetic for the 
benchmarks, while the iterative solvers were terminated when the solution residual was below 
10−6. 
 
Table 6.5 presents CMG preconditioner-solve step execution time speedups achieved using 
our implementation for the IBM and ISPD benchmarks. We can observe that the GPU MG 
algorithm is up to 10x faster than its CPU implementation. In Table 6.6 we give the runtime 
speedup of the PCG iterative method including the CMG preconditioner-solve step. The 
performance speedup for the hole CMG solve phase is up to 5.71x, and if we exclude the time 
required for the GPU memory copies, we can achieve up to 7.35x execution time speedup. 
 
 
Benchmark 
CMG 
Preconditioner 
Solve Step on 
CPU (sec) 
CMG 
Preconditioner 
Solve Step on 
GPU(sec) 
 
Speedup 
ibmpg1 3,79 1,82 2,08x 
ibmpg2 0,58 0,21 2,76x 
ibmpg3 5,98 1,23 4,86x 
ibmpg4 3,78 0,74 5,10x 
ibmpgnew2 9,60 1,91 5,02x 
ad1 0,42 0,09 4,66x 
ad3 0,99 0,12 8,25x 
bb2 1,06 0,13 8,15x 
bb4 11,49 1,15 10,00x 
nb6 3,41 0,36 9,47x 
nb7 7,55 0,77 9,81x 
Table 6.5: CMG preconditioner-solve step runtime speedup 
 
Benchmark PCG on   
CPU (sec) 
PCG on 
GPU(sec) 
Speedup Speedup( w/o 
memory copies ) 
ibmpg1 4,80 2,40 1,66x 2,00x 
ibmpg2 0,73 0,28 0,96x 2,60x 
ibmpg3 7,52 1,66 3,47x 4,53x 
ibmpg4 4,74 1,04 3,08x 4,56x 
ibmpgnew2 11,61 3,46 2,90x 3,36x 
ad1 0,56 0,10 0,93x 5,60x 
ad3 1,30 0,21 1,97x 6,20x 
bb2 1,40 0,23 1,91x 6,08x 
bb4 14,85 2,02 5,71x 7,35x 
nb6 4,50 0,62 3,81x 7,25x 
nb7 9,82 1,34 5,17x 7,33x 
Table 6.6: PCG runtime speedup 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
  
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 39 
 
Chapter 7  
Conclusion 
 
In conclusion, given the key role of circuit simulation in the design process, there has been a 
significant interest in accelerating the heart of simulation which is the solution of a very large 
system. This master thesis reports our efforts to accelerate the performance of a linear system 
multigrid-like solver called CMG using modern Nvidia GPUs. 
 
Specifically, we focused on the solution phase of CMG and the acceleration of PCG iterative 
method and the full CMG algorithm. We have shown maximum speedups of 5.71x and 10x 
for PCG and CMG preconditioning respectively.  
 
7.1 Future Work 
 
In the future, we plan to extend the research presented in this master thesis towards the 
following directions: 
 
 The SpMV operations including in CMG algorithm are so many and are still the 
bottleneck of our solver. However, this implementation can be extended by using an 
improved SpMV CUDA kernel like segSpMV [4] [3] or LightSpMV [33] which are 
proposed to be superior to cuSPARSE and would give a further acceleration for the 
CMG solver.  
 
 Mapping onto heterogeneous systems: Owing to the diverse nature of parallel 
computing architectures (such as multi-core processors, GPUs, or even Field 
Programmable Gate Arrays (FPGAs)), it is very appealing to have heterogeneous 
systems that comprise dissimilar processors, each one incorporating its own parallel 
capabilities. We plan to investigate efficient ways for mapping the CMG solver on 
heterogeneous architectures using a variety of programming models and languages. 
 
 
 
 
 
 
 
 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
  
 
 
  
 
 
 
 
  
 
 
 
 
 
 
 
 
 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
  
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 41 
 
References 
 
[1]  F. N. Najim, Circuit Simulation, Wiley,IEEE, 2010.  
[2]  D. Ntioudis, Development and Optimization of a combinatorial multigrid algorithm 
for large scale circuit simulation, Diploma Thesis, University of Thessaly, 2013.  
[3]  D. Garyfallou, Simulation of large-scale circuits with Steiner node preconditioners 
on parallel architectures, Diploma Thesis, University of Thessaly, 2014.  
[4]  K. He, S. X.-D. Tan, E. Tlelo-Cuautle, H. Wang and H. Tang, A new segmentation-
based GPU-accelerated sparse matrix-vector multiplication, IEEE 57th 
International Midwest Symposium on Circuits and Systems (MWSCAS), 2014.  
[5]  NVIDIA CUDA C Programming Guide.  
https://docs.nvidia.com/cuda/cuda-c-programming-guide/ 
[6]  C. M. U. C. S. D. a. K. Gremban, Combinatorial Preconditioners for Sparse, 
Symmetric, Diagonally Dominant Linear Systems. Research paper. School of 
Computer Science, Carnegie Mellon University, 1996.  
[7]  Y. Saad, Iterative Methods for Sparse Linear Systems. Society for Industrial and 
Applied Mathematics, 2003.  
[8]  T. Davis, CSPARSE: a concise sparse matrix package.  
[9]  T. Davis, Direct Methods for Sparse Linear Systems.Fundamentals of 
Algorithms.Society for Industrial and Applied Mathematics, 2006.  
[10]  R. Barrett, Templates for the Solution of Linear Systems: Building Blocks for 
Iterative Methods. Miscellaneous Titles in Applied Mathematics Series No 43. 
Society for Industrial and Applied Mathematics, 1994.  
[11]  I. Koutis and G. Miler, The Combinatorial multigrid solver, in : Conference 
Talk,March, 2009.  
[12]  I. Koutis, G. L. Miller and D. Tolliver, Combinatorial Preconditioners and 
Multilevel Solvers for problems in computer vision and image processing. 
Computer Vision and Image Understanding, 115(12):1638–1646, 2011.  
[13]  I. Koutis, Matlab implementation of the combinatorial multigrid algorithm  
http://www.cs.cmu.edu/~jkoutis/cmg.html 
[14]  E. G. Boman and B. Hendrickson, Support theory for preconditioning,SIAM J. 
Matrix Anal. Appl. 25 (3) (2003) 694-717.  
[15]  P. M. Vaidya, Solving linear equations with symmetric diagonally dominant 
matrices by constructing good preconditioners. A talk based on this manuscript was 
presented at the IMA Workshop on Graph Theory and Sparse Matrix Computation, 
October 1991.  
[16]  P. G. Doyle and J. L. Snell, Peter G. Doyle and J. Laurie Snell. Random Walks and 
Electric Networks, January 2000.  
[17]  M. Bern, J. R. Gilbert, B. Hendrickson, N. Nguyen and S. Toledo, Support-graph 
preconditioners. SIAM J. Matrix Anal. Appl., 27:930–951, 2005.  
[18]  K. Gremban, Combinatorial Preconditioners for Sparse, Symmetric, Diagonally 
Dominant Linear Systems. PhD thesis, Carnegie Mellon University, Pittsburgh, 
October 1996. CMU CS Tech Report CMU-CS-96-123.  
[19]  I. Koutis and G. L. Miller, Graph partitioning into isolated, high conductance 
clusters: theory, computation and applications to preconditioning. In Proceedings of 
the twentieth annual symposium on Parallelism in algorithms and architectures, 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
 42 
 
SPAA ’08, pages 137–145, New York, NY, USA, 2008. ACM.  
[20]  L. Grady, A lattice-preserving multigrid method for solving the inhomogeneous 
poisson equaequations used in image analysis. ECCV, 5303:252–264, 2008.  
[21]  U. Trottenberg, A. Schuller and C. Oosterlee, Multigrid. Academic Press, 1st 
edition, 2000.  
[22]  I. Koutis, Combinatorial and algebraic tools for optimal multilevel algorithms.PhD 
thesis, Carnegie Mellon University, Pittsburgh, May 2007. CMU CS Tech Report 
CMU-CS-07-131, 2007.  
[23]  D. A. Spielman and S. I. Daitch, Faster approximate lossy generalized flow via 
interior point algorithms. In Proceedings of the 40th Annual ACM Symposium on 
Theory of Computing, May 2008.  
[24]  I. Koutis and G. L. Miller, A linear work, O(n1/6) time parallel algorithm for 
solving planar Laplacians. In Proceedings of the eighteenth annual ACM-SIAM 
symposium on Discrete algorithms, SODA ’07, pages 1002–1011, Philadelphia, 
PA, USA, Society for Industrial and Applied Mathematics, 2007.  
[25]  I. Koutis and G. L. Miller, Approaching optimality for solving, August 2010.  
[26]  I. Koutis, G. L. Miller και R. Peng, Solving sdd linear systems in time 
O(mlognlog(1/ε)), April 2011.  
[27]  S. Nassif, Power Grid Analysis Benchmarks, Asia and South Pacific Design 
Automation Conference, pp 376-381, 2008.  
[28]  IBM Power Grid Benchmarks. 
http://dropzone.tamu.edu/~pli/PGBench/ 
[29]  ISPD 2005/2006 Placement Benchmarks.  
http://archive.sigda.org/ispd2005/contest.htm 
[30]  Intel Math Kernel Library.  
https://software.intel.com/en-us/intel-mkl 
[31]  NVIDIA cuBLAS Library. 
http://docs.nvidia.com/cuda/cublas/ 
[32]  NVIDIA cuSPARSE library. 
http://docs.nvidia.com/cuda/cusparse/ 
[33]  LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-
enabled GPUs, 2015 IEEE 26th International Conference on Application-specific 
Systems, Architectures and Processors (ASAP), 2015.  
[34]  A. Joshi, Topics in optimization and sparse linear systems. PhD thesis, Champaign, 
IL, USA, 1997. UMI Order No. GAX97-17289.  
[35]  D. A. Spielman and S.-H. Teng, Nearly-Linear Time Algorithms for 
Preconditioning and Solving Symmetric, Diagonally Dominant Linear Systems, 
May 2007.  
[36]  Y. Deng, B. Wang and S. Mu, Taming Irregular EDA Applications on GPUs. 
Computer-Aided Design - Digest of Technical Papers, 2009. ICCAD 2009. 
IEEE/ACM International Conference on, pp. 539–546, 2009.  
[37]  N. Bell and M. Garland, Efficient sparse matrix-vector multiplication on CUDA, 
NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, Dec. 2008.  
 
 
 
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 13:09:04 EET - 137.108.70.7
