Singular value computations on the AP1000 array computer by Czezowski, Adam
SINGULAR VALUE COMPUTATIONS 
ON THE 
APlOOO ARRAY COMPUTER 
A THESIS 
SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE 
OF THE AUSTRALIAN NATIONAL UNIVERSITY 
By 
Adam Czezowski 
April, 1995 

1, 
• 
Statement 
I hereby declare that this submission is my own work and that, to the best of my knowledge 
and belief, it contains no material previously published or written by another person nor 
material which to a substantial extent has been accepted for the award of any other 
degree or diploma of a university or other institute of higher learning, except where due 
acknowledgement is made in the text of the thesis. 
f;U~ 
(/ Adam Czezowski 
lll 
( 

Preface 
The topic of the thesis presented herein is computations of singular value decomposition 
on the APlOOO array computer. This topic is part of the joint ANU-Fujitsu research 
project on 'Linear Algebra Research on the APlOOO'. 
This research was undertaken as part-time study at the Computer Science Department 
of the Australian National University from April 1991 to April 1995. The supervision for 
this work was provided by Dr Peter Strazdins; and Professor Richard Brent acted as an 
advisor. 
Aim of this research work 
The primary drive to pursuit this research was the possibility of working on the experimen-
tal, at the time, pa allel computer APlOOO from Fujitsu. Thanks to Dr Peter Strazdins 
who proposed the subject of this research, this interest gained essence and provided me 
with the opportunity to learn about numerical linear algebra. 
The main aim of this research is to provide thorough investigations of porting a singular 
value decomposition algorithm from a single to a multi-processor environment. We would 
like to present the key factors which help to make such a move in the best possible manner. 
We wish to accomplish this task by analysing the performance of the SVD algorithm on 
each level of the parallel memory hierarchy. We also will address all those issues which 
may contribute to the degradation of performance of the algorithm at a particular level 
of memory hierarchy. 
The key issues which we will try to resolve are choice of the algorithm to implement, 
suitability of the APlOOO architecture for the implementation of the chosen algorithm, 
reduction of the floating point operations, best match between the implementation and 
the network topology, communication, cache performance, and register usage. 
V 
Ii 
Publications arising from this study 
Progress of this research was presented at the joint AND-Fujitsu Workshops and the 
Parallel Computing Workshops held in Japan, on an annual basis (91-94) and published 
in the respective proceedings [9, 3, 11]. The preliminary results were also presented at the 
5th Australian Transputer and OCCAM User Group Conference TAPA-92, in Melbourne, 
in November 1992, and were published in the conference proceedings [10]. 
The final results of this thesis were delivered in a poster session at the International 
Conference and Exhibition on High-Performance Computing and Networking HPCN-1994, 
held in Munich , April 1994 and were included in the conference proceedings [8]. 
Also due course of this research two seminars were delivered at the Computer Science 
Department of the Australian National University. 
Reading guide 
Chapter 1 presents the theory of singular value decomposition with its rich historical back-
ground, various SVD serial algorithms, and some scientific and engineering applications 
of the SVD. 
Chapter 2 discusses choices of the suitable algorithm and architecture, introduces the 
APlOOO architecture, and presents the basic block implementation of the Hestenes SVD 
algorithm on the APlOOO. 
Chapter 3 introduces the improvements of the floating point performance by reduc-
ing the number of floating point operations. This chapter also addresses the issues of 
measurement and comparison of the performance of different versions of the algorithm 
implemented. 
Chapter 4 demonstrates the optimisations on all levels of parallel memory hierarchy: 
other processors' memory (communication), cache and register levels. It introduces the 
reuse factors as a general measure of the performance on all parallel memory hierarchy 
levels. 
Chapter 5 addresses the issues of convergence and the accuracy of various implemen-
tations of the SVD algorithm. 
Chapter 6 draws conclusions from this research and proposes new research paths on 
the subject of SVD which emerged during this work. 
Vl 
... 
Acknowledgements 
Firstly, I am very grateful and wish to thank most of all Dr Peter Strazdins for his excellent 
supervision in the course of my study, especially for his high degree of professionalism, 
teaching quality, and his understanding of a part-time student. Also I would like to 
acknowledge his contribution to the most of the material presented in Chapter 3 and 4 of 
this thesis. I also appreciate and wish to thank my advisor Professor Richard Brent for 
the 'light in the darkness of errors propagation' and his support throughout the course of 
my research. 
I especially thank Professor Steve Redman from the Department of Neurophysiology 
of the John Curtin School of Medical Research for his encouragement to undertake this 
study. 
I am also very grateful to my wife, Loan, for her love, support and patience shown 
during this degree. 
Finally, I am much obliged to the Department of Computer Science for the generous 
financial assistance which enabled me to present the results of my research overseas and 
to the departmental staff, programmers and system administrators, for keeping both kaffa 
and APlOOO alive most of the time. 
Vll 

.... 
Abstract 
The increasing popularity of singular value decomposition algorithms, used as a tool in 
many areas of science and engineering, demands a rapid development of their fast and 
reliable implementations. No longer are those implementations bounded to the single 
processor environment since more and more parallel computers are available on the mar-
ket. This situation requires that often software need to be re-implemented on those new 
parallel architectures efficiently. 
In this thesis we show, on the example of a singular value decomposition algorithm, 
how this task of changing the working environments can be accomplished with non-trivial 
gains in performance. We show several optimisation techniques and their impact on 
the algorithm performance on all parallel memory hierarchy levels (register, cache, main 
memory and external processor memory levels). The central principle in all of the opti-
misations presented herein is to increase the number of columns (column-segments) being 
held in each level of the memory hierarchy and therefore increase the data reuse factors. In 
the optimisations for the parallel memory hierarchy the techniques used are, rectangular 
processor configuration, partitioning, and four-column rotation. 
The rectangular processor configuration technique is where the data were mapped onto 
a rectangular network of processors instead of a linear one. This technique improves the 
communication and cache performance such that on average, we reduced the execution 
time by a factor of 2 and, in the case of long column-segments, by a factor of 5. 
The partitioning technique involves rearranging data and the order of computations 
in the cells. This technique increases the cache hit ratio for large matrices. For the 
relatively modest improvements in the cache performance of 2 to 5%, we achieved a 
significant reduction in the execution times of 10 to 20%. 
The four-column rotation technique improves the performance by a better register 
reuse. For the cases of large number of columns stored per processor, this technique gave 
ix 
2 to 10% improvement in execution time over the 'classic', two column rotation. 
Apart from the optimisations on the memory hierarchy levels, several floating point 
optimisations are presented on the algorithm itself which can be applied in any archi-
tecture. The main ideas behind those optimisations are the reduction of the number of 
floating point instructions executed in a unit of time and the balance of the floating point 
operations. This was accomplished by reshaping the relevant parts of the code to use the 
APlOOO processors architecture (SPARC) to its full potential. 
After combining all of the optimisations, we achieved a sustained 60% reduction of 
the execution time which corresponds to the 2.5 fold reduction. In the cases where long 
columns of the input matrix were used, we achieved nearly 5 fold reduction in execution 
time without adversely affecting the accuracy of the singular values and maintaining the 
quadratic convergence of the algorithm. 
The algorithm was implemented on the Fujitsu's APlOOO Array Multiprocessor, but 
all optimisations described can be easily applied to any MIMD architecture with a mesh 
or hypercube topology, and all but one can be applied to register-cache uniprocessors also. 
Despite many changes in the structure of the algorithm we found that the convergence 
was not adversely affected and the accuracy of the orthogonalisation was no worse than 
for the uniprocessor implementation of the noted SVD algorithm. 
X 
u 
Contents 
Statement 
Preface 
Acknowledgements 
Abstract 
1 Introduction 
1.1 Singular value decomposition - theoretical approach 
1.1.1 Orthogonality ..... . . 
1.1.2 Vector and matrix norms 
1.1.3 SVD theorem . .. . . . . 
1.2 Historical background of the SVD . 
1.2.1 The roots of the SVD . 
1.2.2 Bilinear form and SVD 
1.2.3 Integral form and SVD . 
1.3 Uniprocessor algorithms for SVD computations 
1.3.1 Jacobi methods . . 
1.3.2 
1.3.3 
1.3.4 
Hestenes algorithm 
QR based algorithm 
Implicit zero-shift QR based algorithm . 
1.4 Scientific and engineering applications of the SVD 
1.4.1 Least Squares problems 
1.4.2 
1.4.3 
Image analysis 
Cryptoanalysis 
XI 
iii 
V 
vii 
ix 
1 
2 
2 
3 
3 
5 
5 
6 
8 
10 
10 
13 
16 
20 
22 
22 
24 
27 
"' 
2 Parallel Implementation of the SVD 33 
2.1 Introduction . 33 
2.2 General approaches to the linear algebra parallel, high performance com-
puting 34 
2.3 Which algorithm? . 35 
2.4 Which architecture? 37 
2.5 Parallel memory hierarchy 39 
2.6 APlOOO architecture 41 
- 2.6.1 APlOOO - hardware environment 41 
-
2.6.2 APlOOO - software environment 45 
2.6.3 APlOOO - performance 46 
2.7 Hestenes algorithm implementation 48 
2.7.1 Systolic array implementation 49 
2.7.2 Block implementation 53 
2.8 Conclusions 55 
3 Improvements of the Floating Point Performance 57 
3.1 Introduction . 57 
3.2 Measuring and comparing performance . 58 
3.3 Scaled columns 61 
3.3.1 Numerical results . 62 
3.4 Reduced inner product computations 65 
3.4.1 Numerical results . 66 
3.5 Assembly language optimisations 67 
3.5.1 Rotation procedures 67 
3.5.2 Inner product computation procedures 68 
3.5.3 Numerical results . 70 
3.6 Conclusions 70 
-
4 Optimisations for the Memory Hierarchy 73 
4.1 Introduction . 73 
4.1.1 General performance measures - reuse factors 74 
4.1.2 Measuring and comparing performance . 74 
Xll 
.. 
' 
4.2 Optimisations on other processors memory level . 75 
4.2.1 Rectangular cells configuration 75 
4.2.2 Numerical results 77 
4.3 Optimisations on cache level . 80 
4.3.1 Partitioning 80 
4.3.2 Numerical results 82 
4.4 Optimisations on register level 84 
4.4.1 Four-column rotation 84 
4.4.2 Register reuse . 86 
4.4.3 Effect on the convergence 86 
4.4.4 Numerical results 87 
4.5 Conclusions 89 
5 Convergence 93 
5.1 General error propagation issues 93 
5.2 General convergence rate issues 96 
5.3 Numerical results 97 
5.3.1 Initial data distribution 97 
5.3.2 Accuracy of the methods implemented 101 
5.4 Numerical results for convergence rate 102 
5.5 Conclusions 102 
6 Conclusions 105 
xiii 
i. 

' ' 
List of Tables 
2.1 Completion times for global functions (in µs). . . . . . . . . . . . . . . 47 
3.1 Results for the basic, unimproved Hestenes algorithm implementation. 60 
3.2 Results for the scaled implementation of the Hestenes algorithm. . . 63 
3.3 Results for the a-update implementation of the Hestenes algorithm. 67 
3.4 Scaled columns implementation after assembly language modifications. 69 
3.5 The reduced inner product implementation after assembly language modi-
fications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 
3.6 Comparison of efficiencies for the original and improved algorithms. 72 
4.1 Relative performance improvements for the rectangular configuration. 78 
4.2 Communication times in X and Y directions for the rectangular configuration. 78 
4.3 The best results for the rectangular configuration. . . . . . 79 
4.4 Cache performance comparison for the partitioning method. 82 
4.5 The best results for the rectangular configuration with partitioning. 83 
4.6 Register reuse in Q2 and Q4 schemes . . . . . . . . . . . . . . . . . . 86 
4.7 The best results for the four-column implementation without partitioning. 87 
4.8 The best results for the four-column implementation with partitioning. . 88 
4.9 The overall best results for Hestenes SVD algorithm implementation. . . 90 
5.1 Floating point parameters for the IEEE compliant computer and APlOOO. 94 
5.2 Test results for the basic block implementation. . 98 
5.3 Test results for the rectangular implementation. . 99 
5.4 Test results for the four-column implementation. 100 
5.5 Off(A)/N after each sweep for the rectangular implementation. 101 
5.6 Off(A)/N after each sweep for the four-column implementation. 102 
xv 

List of Figures 
1.1 Single processor, Hestenes algorithm implementation ........ . 
1.2 Selected singular value outer product images of Mandrill Baboon .. 
1.3 Selected partial sums of the SVD of Mandrill Baboon images. 
1.4 Linear SVD edge sharpening of Mandrill Baboon image. . . 
1.5 Nonlinear SVD edge sharpening of Mandrill Baboon image. 
1.6 Frequent letters in the Gettysburg Address. . ....... . 
2.1 The average memory access time as a function of block size. 
2.2 APlOOO - photograph. . . . . . 
2.3 APlOOO - system configuration. 
2.4 APlOOO - cell configuration ... 
2.5 APlOOO - pingpong performance. 
2.6 Binary tree and y ...sum performance. 
2.7 Example of inter-processor connections. 
2.8 Systolic array data flow for SVD algorithm. 
2.9 Communication in systolic array implementation. 
2.10 Execution time for the systolic implementation. 
2.11 Data flow in block implementation ...... . 
2.12 Execution time for the block implementation. 
2.13 Efficiency for systolic and block implementation. 
2.14 Execution time for the single processor run .... 
14 
24 
26 
27 
28 
31 
41 
42 
43 
44 
47 
48 
49 
49 
50 
51 
52 
53 
54 
55 
4.1 Distribution of columns for rectangular cells configuration. . 76 
4.2 Main loop for the rectangular configuration and the partitioning method. 81 
xvn 
II 
Chapter 1 
Introduction 
The word algebra, in its original Arabic form (al dzabr, al gabr), was used for the first 
time by Muhammed ibn Musa al Chwarizmi in the ninth century (30]. This word was used 
to describe the operation of transposition of the unknowns from one side of the equation 
to the other. 
Until the end of the nineteenth century, algebra was primarily used for solving equa-
tions. Later on, it developed and transformed into a fully mature, axiomatic science. 
This modern and 'new' algebra chooses as a subject, abstract rules performed on sets of 
any nature. Soon after, new algebraic ideas and concepts, thanks to their universality, 
became an indispensable tool for the whole of mathematics as well as theoretical physics, 
theoretical chemistry and other scientific disciplines. 
Today, most of the pre-nineteenth century algebra is known as linear algebra and 
among its many topics, it deals with general linear systems, matrix analysis, orthogonali-
sation, eigenvalue problem and last, but certainly not lea.st, singular value decomposition. 
Although theoretical linear algebra is well developed, its practical, numerical approach 
still has a lot of room for improvement. We all want that algorithms of linear algebra 
will be faster and more accurate, portable and scalable. New parallel computer architec-
tures leave us with the challenge of porting efficiently existing linear algebra software and 
developing new one. 
In this thesis we show how rich and fruitful can be the task of moving a linear algebra 
algorithm from single processor environment to multiprocessor one. We present the key 
factors which help to realize such a move in the best possible manner. 
To present this transition we chose the singular value decomposition because of its 
1 
ii 
2 CHAPTER 1. INTRODUCTION 
mathematical beauty and numerous applications. Another reason, a very practical one, 
was to evaluate the suitability of Fujitsu's APlOOO Array Processor for numerical algebra 
applications. Since most previous work on the linear algebra on the APlOOO [5] involved 
non-iterative computations, the SVD was a natural choice to be implemented as the 
iterative algorithm. 
In this chapter we describe the subject of singular value decomposition, its very in-
teresting historical background, various uniprocessor SVD algorithms and finally, some 
examples of SVD applications. 
1.1 Singular value decomposition - theoretical approach 
First, we would like to establish some definitions which will be widely used throughout this 
thesis. We introduce definitions of: vector orthogonality, norms, matrix rank, orthogonal 
transformations, eigenvalues and eigenvectors. We pattern our definitions after Golub's 
and Van Loan's book 'Matrix Computation' [21]. 
1.1.1 Orthogonality 
Orthogonality is a key factor in SVD computations, therefore we will define this term 
firstly. 
Definition 1.1 Two vectors Xi and x; in mm are orthogonal if xf x; = O whenever i =/= j. 
Definition 1.2 Two vectors Xi and x; in mm are orthonormal if xf x; = Oij where Oij is 
the Kronecker symbol. 
Definition 1.3 Let X = {xi, ... , xp} be a set of vectors in mm . We say that vectors in 
X are orthogonal if all the pairs of vectors in X are orthogonal. 
Definition 1.4 Let X ={xi, ... , xp} be a set of vectors in mm. We say that vectors in 
X are orthonormal if all the pairs of vectors in X are orthonormal. 
For example in the Euclidean space the set of the unity vectors parallel to the axis of the 
Carthesian coordinate system is orthonormal. 
Definition 1.5 A matrix A E mmxm is orthogonal if AT A= I where I is identity matrix. 
In other words, the column-vectors of A, [iii, ... , iiml, are orthogonal. 
Very often orthonormal matrices are called unitary. 
I 
I 
I 
I 
1, 
1.1. SINGULAR VALUE DECOMPOSITION - THEORETICAL APPROACH 3 
1.1.2 Vector and matrix norms 
The norms are the class of the functions which introduce measure into the vector spaces. 
A function f which projects element x of the vector space V into field n can be called a 
norm on V if f satisfies the following properties: 
f(x)?:_O and f(x)=O<=}x=O 
f(ax) = lalf(x) where a En 
f(x + y) :S f(x) + f(y) 
Definition 1.6 A vector norm llxll on IRn belongs to the class of the p-norms if 
(1.1) 
(1.2) 
The vector norm that is mostly used in this thesis is llill2 = (lx112+· · ·+lxnl2)t = (x7'x)t. 
Similarly we define two most frequently used matrix norms. 
Definition 1. 7 A matrix norm IIAJJ on mmxn belongs to the class of the p-norms if 
def II Ail Ip IIAII = IIAIIP = sup II-II where p ?: 1. 
x/0 X p 
(1.3) 
If p = 2 the above norm is called 2-norm or Euclidean norm. 
Definition 1.8 A matrix norm IIAII on mmxn can be called the Frobenius norm if 
m n 
IIAII ~f IIAIIF =, t1 ~ laiil 2 • (1.4) 
There is a very important class of transformations called orthogonal transformations. 
The main property of these transformations is that they preserve 2-norm and the ma-
trix Frobenius norm. An example of the orthogonal transformation in JR 2 is rotation 
matrix Q2: 
( 
cos </> sin </> ) Q2= . 
- sin </> cos </> 
(1.5) 
1.1.3 SVD theorem 
Below is the description of the singular value decomposition theorem [21, page 71] and 
some of its applications in establishing matrix properties. 
. 
4 CHAPTER 1. INTRODUCTION 
Theorem 1.1.1 (SVD) For rmy real m-by-n matrix A there are orthogonal matrices 
U = [u1, ... 1 Um] and V =[vi, ... 1 Vn] such that 
UTAV=diag(cr1 1 ••• ,crp) where p=min{m,n} and cr1zcr2z···crnzO. 
The CTi are called singular values1 of A and the vectors Ui and Vi are the ith left and ith 
right singular vector respectively. 
Let us define the concepts of subspace, range, null space and rank which are associated 
with an m-by-n matrix A. 
Definition 1.9 The set of all linear combinations of the vectors v1, ... , Vn E mn is called 
subspace and referred to as the span{ v'1, ... , v'n}. 
Definition 1.10 The range of matrix A, range(A), we call set of vectors y E mm such 
that for some X E mn y = Ax. 
Definition 1.11 The null space of matrix A, null(A), we call set of vectors x E mn such 
that Ax= 0. 
Definition 1.12 The rank of the matrix A, rank(A), is the maximal number of its inde-
pendent rows or columns. 
Having these definitions we can show how well SVD can describe properties of a given 
matrix A. If the singular values of the matrix A are cr1z ... CTr > CTr+l = ... = CTp = 0 
where p = min{m, n} then 
rank(A) = r 
null(A) = span{vr+i, ... ,vn} (1.6) 
range(A) = span{u1, ... , ur} 
Moreover, the 2-norm and the Frobenius norm can be nicely expressed in terms of the 
SVD: 
IIAIIJ;. = crf + ... + cr; (1.7) 
1 In the past also known as the principal values. 
h 
I 
., 
.I 
I 
1.2. HISTORICAL BACKGROUND OF THE SVD 5 
1.2 Historical background of the SVD 
There are two reasons why we dedicate this section to the history of the SVD. Firstly, we 
would like to show that the development of the SVD was a long process and many people, 
to whom we would like to pay tribute, contributed a lot to it . Secondly, we believe it is 
much more appropriate to study SVD being aware of its background, as it is said that the 
frame gives a picture dignity. 
The beginning of many of new ideas in mathematics have often their multiple sources 
and if so, it is usually difficult, if not impossible, to clearly establish the people behind 
their discoveries. Although the early history of SVD is not that blurred, there are more 
than five people who can be credited for its creation and development. It is really up to 
us how we interpret the historical facts and in this thesis we are convinced by and follow 
the research of G. W . Stewart [42]. 
1.2.1 The roots of the SVD 
In his paper [42], G. W. Stewart starts the history of SVD by giving an intriguing ob-
servation that most of the classical matrix decompositions were known before widespread 
use of matrices. The latter were expressed in form of determinants, linear systems of 
equations, and especially bilinear and quadratic forms. There is no doubt that the K. F. 
Gauss is the founder of this area. 
In 1829 A.L. Cauchy described the properties of eigenvalues and eigenvectors of a 
symmetric system. Seventeen years later C.G. Jacobi [28] published his famous 2 algorithm 
for diagonalising a symmetric matrix and later, after his death in 1851, his paper on LU 
decomposition was published. 
In 1868 K. Weierstrass, in his 'Theory of the bilinear and quadratic Forms ' , came up 
with the canonical forms for pairs of bilinear functions which today should be identified 
as the generalized eigenvalue problem. 
Now, we are ready to describe two different ways of introducing SVD: the first through 
the bilinear form which is the continuation of the line presented above and the second 
through the integral form which is the domain of integral equations, developed in the first 
decade of our century. 
2 The Jacobi algorithm will be described in Section 1.3 of this Chapter. 
6 CHAPTER 1. INTRODUCTION 
1.2.2 Bilinear form and SVD 
Three names can be mentioned if we talk about an algebraic way of introducing SVD. 
There are Eugenio Beltrami (1835-1899), Camille Jordan (1838-1921) and James Joseph 
Sylvester (1814-1897). According to G.W. Stewart, Beltrami was the first to introduce 
SVD into algebra by publishing his concept in 'Giornale di Matematiche ad Uso degli 
Studenti Delle Universita' in 1873. The purpose of his paper was to introduce bilinear 
forms to the students of Italian universities. Beltrami makes substitutions for i = U[ and 
iJ = Vij (matrices U and V are required to be orthogonal) in the following bilinear form 
( matrix A is real and of order n) 
f(x, Y) = xT AiJ, (1.8) 
so that after 
f ( x, iJ) = f' Sij, (1.9) 
where 
S= UTAV. (1.10) 
If we assume that Sis diagonal (S = E = diag(a1, ... , an)) then by using the fact 
that matrix U and V are orthogonal we can establish the following relations: 
UT A = EVT and AV = UE. (1.11) 
After multiplying the first relation of Equations 1.11 by AT from the right hand side and 
the second relation from the left hand side Beltrami obtains the following equations: 
(1.12) 
From the above we can see that Cli are the roots of the following characteristic polynomials: 
det(AAT - a 2 I)= 0 and det(AT A - a 2 I)= 0. (1.13) 
In order to show that singular values ai are distinct and nonzero Beltrami argues that the 
last two equations are identical because they are polynomials of degree n that have the 
same values at a= Cli and the common value det2(A) at a= 0. Moreover, using the fact 
that O < ii# AiJF = #(AAT)i, Beltrami shows that the a'f are positive. 
At the end of his paper, Beltrami gives the recipe for diagonalising matrix A: (1) find 
roots of the det(AAT - a2I) = 0, (2) determine U from the UT(AAT) = E2UT and (3) 
get V from the uT A = EVT. 
I 
' 
1  
11 
I 
I 
1.2. HISTORICAL BACKGROUND OF THE SVD 7 
This is the sketch of Betrami's path which led him to the singular value decomposition 
for a real, square, nonsingular matrix having distinct singular values. 
In Stewart's paper, Camille Jordan is given the status of a co-discoverer of the singular 
value decomposition. From his paper published a year after Beltrami, we clearly see that 
the work is independent. 
In his paper, Jordan seeks the maximum and minimum of the form represented by 
P=xAy (1.14) 
subject to IJxllF = IIY11F = 1. The maximum is determined by the equation 
(1.15) 
which must be satisfied for all dx and dy that satisfy 
(1.16) 
According to Stewart, the following argument of Jordan is not quite clear and he proposes 
that for some constants u and T we must have 
(1.17) 
Since dx and dy are independent we can write two equations: 
A - - d -TA -T y = ux an x = Ty . (1.18) 
From the first of the above equations, we see that the maximum is 
(1.19) 
and from the second, the maximum is 
( -TA)~ -T-x Y; = TY y = T, (1.20) 
so that u = T. Now, Jordan states that a is determined by the vanishing of the determi-
nant 
D= 
-al A 
AT -al 
Then he shows that this determinant contains only even powers of a. 
(1.21) 
... 
I 
8 CHAPTER 1. INTRODUCTION 
Let a 1 be a root of the equation D = 0, and let i = u and y = v be the solution of 
the 1.18 equations, where lliZIIF = llvllF = 1. Also, let 
U = (iZU) and V = (vV) (1.22) 
be orthogonal, and let 
i = (Jf and y = v§. (1.23) 
Following these substitutions, Jordan lets 
-T ... -
P = x Ay. (1.24) 
After substitutions, P attains its maximum and at the maximum we have 
... - - -T A -T Ay = a1x and x A= a1y , (1.25) 
which implies that 
(1.26) 
Thus with 6 = f 1 and 7/1 = ff 1, P assumes the form 
(1.27) 
where P1 is independent of 6 and 7/1· After applying reduction to Pi, Jordan comes to 
the canonical form 
(1.28) 
Jordan's approach of using a partial solution of the problem is to reduce it to the 
smaller size. By doing so, he avoids the degeneracies that complicate Beltrami's approach. 
Joseph James Sylvester wrote two papers and a footnote on the subject of the singular 
value decomposition. All three were published in 1889. The footnote of his paper, in 
which he describes an iterative algorithm for reducing a quadratic form to diagonal form, 
contains suggestion that the analogous iteration can be used to diagonalise a bilinear form. 
Sylvester's algorithm described in his later paper resembles Beltrami's one. 
1.2.3 Integral form and SVD 
During the first two decades of our century the topic of integral equations played a sig-
nificant role in mathematics. In 1907, Erhard Schmidt, well known from Gram-Schmidt 
11 
I: 
1.2. HISTORICAL BACKGROUND OF THE SVD 9 
orthogonalisation method, introduced the infinite dimensional analogue of SVD and for-
mulated his approximation theorem. He showed that if two functions g and h are contin-
uous and A(s, t) is continuous and unsymmetric kernel on [a, b] x [a, b] then 
lblb 1 lb b a a A(s, t)g(s)h(t)dsdt =~Ai a g(s)ui(s)ds 1 h(t)vi(t)dt, 
t 
(1.29) 
where 
u(s) =Alb A(s, t)v(t)dt and v(t) =Alb A(s, t)u(s)ds (1.30) 
is a pair of adjoint eigenfunctions corresponding to the eigenvalue A. Schmidt says the 
above expression 'corresponds to the canonical decomposition of a bilinear form'. 
In his second contribution, the approximation theorem, Schmidt solves the problem 
of finding the best approximation to matrix A of the form 
k 
A~ Lxtif[ (1.31) 
i=l 
in the sense that 
k 
I/A - L XiYfll =min. (1.32) 
i=l 
He starts by pointing out that if 
(1.33) 
then 
k 
IIA - AkllF = IIAIIF - Lu;. (1.34) 
i=l 
Schmidt shows that if for an arbitrary Xi and Yi the below inequality is fulfilled 
k k 
IIA - L XiWII ~ IIAIIF - Lu;, (1.35) 
i=l i=l 
then Ak is the desired approximation. 
An important application of the approximation theorem is the determination of the 
rank of a matrix in the presence of an error. If A is of rank k and A = A + E, then the 
last n - k singular values of A satisfy 
(1.36) 
so that the defect in rank of A will be manifested in the size of its trailing singular 
values.The above inequality is actually a perturbation theorem for the zero singular values 
of a matrix. 
I 
.. -
10 CHAPTER 1. INTRODUCTION 
Herman Weyl's contribution to the theory of the SVD was to develop a general per-
turbation theory and use it to give an elegant proof of Schmidt's approximation theorem. 
In 1913 Autonne extended the decomposition to complex matrices. Eckart (1936) and 
Young (1939) extended it to rectangular matrices and rediscovered Schmidt's approxima-
tion theorem. 
According to G. W. Stewart, the term 'singular value' most likely came from the 
literature on integral equations. It appeared first in Schmidt's paper but in our modern 
meaning, it was used in 1937 in the publication by Smithes. 
1.3 Uniprocessor algorithms for SVD computations 
There are two main stream methods for computing the singular values. The first one is 
based on plane rotations (Jacobi rotations)3 [28] and was introduced by Kobetliantz [29] 
in 1955 and developed further by Hestenes [26], Forsythe and Henrici [15]. The second 
one is based on the QR algorithm and was suggested by Kublanovskaya [31] in 1961. 
The computational algorithm based on the QR method was introduced into numerical 
analysis by Golub and Kahan [19] in 1965. In 1970 Golub and Reinsch [20] published 
the Algol SVD procedure which has been the basis for the Eispack 2 and Linpack SVD 
subroutines. In 1990, Demmel and Kahan [12] proposed an interesting alternative to the 
SVD algorithm. 
1.3.1 Jacobi methods 
Firstly we would like to present Jacobi diagonalising scheme as it is the core of all non 
QR-based SVD algorithms. The basic principle behind Jacobi methods is to perform a 
sequence of orthogonal updates At- QT AQ after which, each new matrix A is "closer" to 
the diagonal than its predecessor. These methods are iterative and usually there is some 
convergence criteria associated with them. 
In 1846, Jacobi described his idea of reducing a matrix to the diagonal form [28]4 • 
He achieves that by systematic reduction of the quantity 
n n 
of f(A) = LLafj (1.37) 
~ i=l j=l j-::/:.i 
3 Jacobi rotations are no different from Givens rotations (21]. 
4 Jacobi's original paper is one of the earliest references found in the numerical analysis literature (21] . 
11 
1111 
ii 
ii 
1.3. UNIPROCESSOR ALGORITHMS FOR SVD COMPUTATIONS 
To reduce of f(A) he applies rotations of the form 
1 
0 
J(p , q, e) = 
0 
0 
0 
cos(e) 
- sin(8) 
0 
p 
0 
sin(e) 
cos(e) 
0 
q 
0 
0 p 
0 q 
1 
11 
(1.38) 
Jacobi procedure requires choosing index pair (p, q) such that 1::; p < q:::; n , computing 
e such that 
[ 
bpp bpq l [ cos(8) sin(8) l T [ aPP apq l [ cos(8) sin(8) l 
bqp bqq = sin(e) cos(e) aqp aqq sin(e) cos(e) (1.39) 
is diagonal, and overwriting A with B = JT AJ where J = J(p, q, 8). Since the Frobenius 
norm is preserved by Jacobi transformation, we can write 
of f(B) 2 = of f(A) 2 - 2a~. (1.40) 
The above equation shows what it means that A moves closer to diagonal with each Jacobi 
step. 
The sin(8) and cos(8) are computed according to the following 2-by-2 symmetric 
Schur decomposition. To diagonalise 1.39 is to set 
0 = bpq = apq(cos2(8) - sin2(8)) + (app - aqq) cos(e) sin(8). (1.41) 
If apq = 0 then we set (sin(8), cos(8)) = (0 , 1) . Otherwise we define 
a -a ( = qq PP and t = sin(8)/ cos(e) 
2apq (1.42) 
and t = tan(8) solves the following quadratic equation 
t2 + 2(t - 1 = 0. (1.43) 
The two roots of the above equation are represented by 
sign(() 
t = 1(1 + J1 + (2. (1.44) 
Ii 
12 CHAPTER 1. INTRODUCTION 
The sin(0) and cos(0) can be computed from the formulae 
cos(e) = ~ andsin(e) = tcos(e). 
1 + t2 
(1.45) 
Choosing the smaller of the two roots of the equation 1.44 ensures that 101 ~ 7r /4 and 
minimises the difference between B and A in the Frobenius norm sense. 
The above introductory consideration leads us to the classical Jacobi algorithm which 
overwrites a given, symmetric, n-by-n matrix A with matrix D = VT AV where V is 
orthogonal and of f(D) ~ € II A IIF· The c is a tolerance and c > 0. In this classical Jacobi 
algorithm, to maximise the reduction of the quantity of f(A), we choose the pair (p, q) so 
that a~ is maximal. Since lapq l is the largest off-diagonal entry, it follows that 
(1.46) 
where N = n(n -1)/2 (see note below)5 represents all possible choices of pairs (p, q) such 
that p =/- q. From equation 1.39 we get 
off(b)2 ~ (1- ~ )k off(A)2. (1.47) 
By induction, after k Jacobi updates, we obtain 
(1.48) 
which means that the classical Jacobi procedure converges at a linear rate. However, 
Schonhage (1964) and van Kempen (1966) [21] showed that the asymptotic convergence 
of the above algorithm is quadratic. 
The original Jacobi method is inefficient in choosing the optimal pair (p, q). This 
searching process involves O(n2) operations while the Jacobi updates involve only O(n) 
operations. To solve this problem we can step cyclically through all the pairs in row-by-
row fashion. This is called cyclic-by-row Jacobi algorithm. For example, if n= 5 we cycle 
as follows: 
(p, q) = (1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (3, 5), ( 4, 5), (1, 2), ... (1.49) 
The cyclic-by-row Jacobi algorithm also converges quadratically and is considerably faster 
than Jacobi's original algorithm. Another type of Jacobi algorithm is the threshold Jacobi, 
6 It is customary to refer to N Jacobi updates as a sweep(21] . 
I • 
II 
[, 
: 
II 
1.3. UNIPROCESSOR ALGORITHMS FOR SVD COMPUTATIONS 13 
where we skip the annihilation of apq if its modulus is less than some small, sweep-
dependent parameter. We can use Jacobi diagonalisation method to solve the symmetric 
eigenvalue problem which can in principle be used to solve the SVD problem. For example, 
we may compute the eigenvalue decomposition 
VT (AT A)V = diag(af, ... , a~), 
where V = [v1, ... , vn] is orthogonal and O"i satisfies 
0-1 2 · · · 2 O"r > O"r+l = · · · = O"n = 0, 
with r = rank(A). We next calculate the vectors 
Avi 
Ui=- (i=l, ... ,r), 
O"i 
(1.50) 
(1.51) 
(1.52) 
and then determine the vectors Ur+l, ... , Um so that the matrix U = [u1, ... , um] is 
orthogonal. The factorisation UT AV= diag(o-1, ... , an) gives the SVD of A. The above 
shows that theoretically we can compute an SVD of A by an eigenvalue decomposition of 
AT A. Practically this is not a good idea because the explicit formation of AT A can lead to 
well-known numerical difficulty of ill-conditioning if matrix A is nearly dependent. This 
problem was addressed and solved by Hestenes who applied the Jacobi method implicitly. 
1.3.2 Hestenes algorithm 
In 1958 Magnus Hestenes published in his paper the method for obtaining singular values 
via applying Jacobi rotations on one side6 SVD of the matrix A. The Hestenes algorithm 
proceeds as follows. 
We want to generate an orthogonal matrix V such that the transformed matrix AV= 
W has orthogonal columns. Having the matrix W, we normalise the length of each non-
null column to unity i.e. 
W(:,i) 
O"i = I w( :,i) I, u( :,i) = ----;,;- and w = UE . (1.53) 
That is the way we obtain the matrices U and E. An SVD of the matrix A is given by: 
AV= W-rAV = UE-rA = UEVT. (1.54) 
6 This is referred in the literature as 'one-sided' SVD method. 
14 CHAPTER 1. INTRODUCTION 
flag=l 
for k=l:50 \* Beginning of the kth sweep *\ 
if flag=l 
for i=O:n-1 
for j=i+l:n 
flag=O 
(k) _ A(k) A(k) 
O'. ·. - ( ") ( ") ii : ,i :,i 
(k) _ A(k) A(k) 
a .. - ( .) ( .) 33 :,3 :,3 
(k) _ A(k) A(k) 
O'. ·. - ( ") ( ") t3 :,t :,3 
(le) 
if aii ~ € 
a\~l+a(~) 
n JJ 
flag=l 
(le) (le) 
c(k) - ajj -aii 
<:, - (le) 2a .. 
'3 
t(k) - sign(e(le)) 
- 1e<1e>1+.J1+e<1ei2 
cos e(k) = l j sin e(k) = t(k) cos e(k) 
-/1+t<1ei2 
A (k-:f"I) = rotate (A (k) A (k)_ sin e(k) cos e(k)) (:,t) (:,i)' (:,3)' ' 
A(k~I) = rotate (A(k~ A(k)_ sine(k) cose(k)) (:,3) (:,i)' (:,3)' ' 
v;(k.+I) = rotate (V:(k) v:(k~ sin e(k) cos e(k)) (:,t) (:,i)' (:,3)' ' 
V,(k_+I) = rotate (V,(k) V,(k~ sin 9(k) COS 9(k)) {:,i) (:,i)' (:,3)' ' 
end 1f 
end for 
end for 
end if 
end for 
function: A(r+l) = rotate (A(r) A(r) sine(r) cose(r)) (:,p) (:,p)' (:,q)' ' 
X = A(r) cose(r) - A(r) sin e(r) ( :,p) ( :,q) 
y = A(r) sin e(r) + A(r) cose(r) (:,p) (:,q) 
A(r+l) = X (:,p) 
A (r+l) -(:,) - y 
en;I rotate 
Figure 1.1: Single processor, Hestenes algorithm implementation. 
I 
I 
i 
I 
1.3. UNIPROCESSOR ALGORITHMS FOR SVD COMPUTATIONS 15 
To compute the matrix W, Hestenes uses plane rotations represented by matrix Q2 , 
such that when applied to the matrix A it will make columns (A(:,i), A( :, j)) orthogonal. 
Q2 is a plane rotation matrix. At the kth step we have: 
(1.55) 
where: 
km= 3n(n2- 1), Sis the number of sweeps, n is the column dimension of matrix A, 
A(o) = A and W = A(k.n). 
The post-multiplication by Q~k) affects only columns A(:,i) and A(:,j) so 
A(k+l) = Q~k) A(k) reduces to: 
( 
A (k~l) ) ( cos e(~) - sine(~) ) ( A (k~ ) (:,i) - tJ tJ (:,i) 
A (k~l) - sine(~) cos e(~) A (k)_ 
( :,J) tJ tJ ( :,J) 
(1.56) 
We choose rotation angle efj), such a way that the two new columns are orthogonal. 
After Rutishauser [38], we compute dot products: 
(1.57) 
We set e(~) = 0 if a/~) = O·, otherwise we compute: tJ tJ 
(1.58) 
Similarly we update the matrix V: 
(1.59) 
where: 
v(o) = In and V = v(k.n) . (1.60) 
Figure 1.1 presents Hestenes algorithm implementation for uniprocessors in its general 
form. 
II 
-. 
16 CHAPTER 1. INTRODUCTION 
1.3.3 QR based algorithm 
Definition 1.3.1 The QR factorisation of an m-by-n matrix A is given by 
A=QR (1.61) 
where Q E IRmxm is orthogonal and RE IRmxn is upper triangular and m 2'.: n. 
The Golub's and Kahan's algorithm for computing the SVD of a real matrix A= UEVT 
has two phases: 
1. Compute orthogonal matrices P and L such that J = pT AL is in bidiagonal form, 
i.e. has nonzero entries only on its diagonal and first superdiagonal. It is achieved 
by using Householder bidiagonalisation [21, pages 236-238]. 
2. Compute orthogonal matrices G and H such that E = GT J H is diagonal and 
nonnegative. The diagonal entries ai of E are the singular values of J and A. The 
columns of U = PG are the left singular vectors, and the columns of V = LH are 
the right singular vectors. In [19] Golub and Kahan give two methods to accomplish 
this step. They are: 
• Form a 2n-by-2n matrix 
(1.62) 
and compute its eigenvalues. The singular values ai of matrix J are related 
to eigenvalues of i, which are just +ai and -ai. To do that form tridiagonal, 
symmetric matrix T = f(J) and use Wilkinson's bisection method [45] to find 
the eigenvalues of T. 
• Form the tridiagonal matrix JT J. Since JT J is a tridiagonal, hermitian matrix 
there exists a diagonal matrix b. such that matrix K = b.(JT J)b.T is a real, 
symmetric, positive and also tridiagonal. The eigenvalues of matrix K can be 
found by using the Sturm sequence algorithm. 
In his 1968 paper [18], Golub proposed another method which is based on QR algo-
rithm of Francis (16] and was widely adopted in the previously mentioned Eispack 2 and 
Linpack SVD subroutines. The method proceeds as follows . We want to compute SVD 
of matrix A, so 
A= UEVT. (1.63) 
I 
I 
1.3. UNIPROCESSOR ALGORITHMS FOR SVD COMPUTATIONS 17 
We use Householder bidiagonalisation to compute matrix J such that 
J = pT AL. (1.64) 
In order to find singular values of matrix J, we construct SVD of J in the usual form of 
(1.65) 
The singular values of matrix A and J are the same7. Thus, we can write 
EJ = ar J H = ar pr A LH = E . 
'--v--' '-v-' 
(1.66) 
UT V 
Let us write matrix J explicitly. 
I I 
a1 /31 0 0 
0 a2 /32 0 
0 0 0 f3n-1 
(1.67) 
0 0 0 an 
0 }(m- n) X n. \ ) 
In [19] Golub showed that the eigenvalues of the following symmetric and tridiagonal 
matrix K are ± singular values of J . So the task now is to find eigenvalues of matrix K 
of the form 
I \ 
0 
(1.68) 
0 
2n X 2n. 
This task can be accomplished by using Francis 's QR algorithm which proceeds as follows. 
Let K = K 0 • We compute the factorisation Ko = QoRo where Q'{; Qo = I and Ro is 
an upper triangular matrix. After multiplying the matrices in reverse order we get 
(1.69) 
7 Matrix J was constructed from A by orthogonal transformation. 
18 CHAPTER 1. INTRODUCTION 
Now if we treat K1 ,in the sam~ way as the matrix Ko we will obtain sequence of matrices 
defined as follows: 
(1.70) 
We can accelerate the convergence of the QR method by choosing right shift parameter 
s(i): 
(1.71) 
Since the eigenvalues of K always occur in pairs, we will compute the QR decompo-
sition of 
(1.72) 
so that 
Q(i) R(i) = K2 (i) _ S2 (i) J . (1.73) 
In his paper [16], Francis showed that it is not necessary to compute (1.73) explicitly but 
it is possible to perform the shift implicitly. Let N(i)(:, 1) = Q(i)(:, 1) and NT (i)N(~) = I. 
Then the Francis theorem states that if 
ii) T(i+l) is a tridiagonal, 
iii) K(i) is non-singular, 
iv) the subdiagonal elements of T(i+l) are positive, 
it follows that T(i+l) = K(i+l). Now diagonalise matrix K using the quite simple approach 
presented in the next page. 
' 
! 
I 
It 
• 
1.3. UNIPROCESSOR ALGORITHMS FOR SVD COMPUTATIONS 19 
Let us define matrix z~i) 
(p) (p + 1) (p + 2) 
I 
1 
0 
1 
cos ()p 0 sin ()P (p) 
z(i) = p 0 1 0 (p+ 1) (1.74) 
sin ()P 0 -cos ()p (p+ 2) 
1 
0 
1 2n x 2n. 
The cos 81 is chosen so that 
zI i) ( K2 ( iJ - s2 ( iJ I) (:, 1) = o . (1.75) 
Due to the Francis theorem we have 
K (i+i) - T(i+iJ - z(i) z(i)Kz(i) z(i) 
- - 2n-2 · · · 1 1 · · · 2n-2 · (1.76) 
The product of all the orthogonal transformations gives the singular values and yields the 
matrix of orthogonal eigenvectors of K. 
The Golub and Reinsch [20] variation differs from the above algorithm only in the 
way matrix K is transformed to the diagonal. The transformation 1.76 in the Golub and 
Reinsch algorithm looks as follows: 
K (i+lJ = sT (iJsT (il 5T (ilJr.(ilr.(iJ r(.:J n n-1 · · · 2 2 3 • · · n , (1.77) 
where the matrices T(i) are chosen so that the sequence JT (i)J(i) converges to a diagonal 
matrix while the matrices s(i) are chosen so that all J(i) are of the bidiagonal form. So, 
both algorithms differ only in the technique of deriving 5(i) and T(i) . 
Let us summarize this standard algorithm. We apply Francis QR algorithm implicitly 
to JT J. The algorithm computes a sequence of J,: of bidiagonal matrices starting from 
J0 = J. This sequence is obtained by computing from Bi a shift s 2 , which is the smallest 
1, 
I! 
20 CHAPTER 1. INTRODUCTION 
eigenvalue of the bottom 2-by-2 block of J; Ji, Then the algorithm does an implicit QR 
factorisation of the shifted matrix J; Ji - s2 I= QR where Q is orthogonal and R upper 
triangular, from which it computes a bidiagonal Ji+l such /f+1 Ji+l = RQ + s2 I. With i, 
Ji converges to a diagonal matrix with the singular values on the diagonal. 
In (21, page 239] we can find that Golub-Reinsch algorithm requires 4mn2 - 4n3 /3 
flops if only singular values of A are desired and 4m2n + 8mn2 + 9n3 if whole SVD of 
matrix A is needed. 
1.3.4 Implicit zero-shift QR based algorithm 
In the 'standard' (Golub-Reinsch) algorithm, the error of the computed singular values CTi 
of matrix A is no bigger that p(n) ·E · IIAll2 ((21, pages 246-248]), where IIAII = cr1, € is the 
machine precision, and p(n) is a slowly growing function of the dimension n of A. This 
means that the singular values of order of IIAll2 are computed to high relative accuracy. 
However, small singular values can not be generally computed with high relative accuracy. 
Demmel and Kahan in their paper (12] proposed an algorithm which can compute the 
singular values for bidiagonal matrices to the same relative precision as the individual 
matrix entries. Using paraphrase, if all the matrix entries are known to high relative 
accuracy, all the singular values can be computed to high relative accuracy independent 
of their magnitudes. Let us define the relative perturbation of value a. 
Definition 1.3.2 We say 8a is a relative perturbation of a of size at most 1J if l8al s; 7Jlal. 
If A is a matrix then IAI and l8AI denote the matrices of absolute entries of A and 8A. 
The following is the central theorem, upon which Demmel's and Kahan's algorithm is 
based. 
Theorem [12, page 875] 1.3.1 Let J be an n-by-n symmetric tridiagonal matrix with 
zero diagonal and off-diagonal entries b1 , ... , bn-l· Suppose J +8J is identical to J except 
for one off-diagonal entry, which changes to abi from bi, a=/- 0. Let a= max(lal, la-11), 
Let Ai be the eigenvalues of J sorted into decreasing order, and let >.~ be the eigenvalues 
of J + 8J similarly sorted. Then 
>. . I 
~<>. -< a>. · . a _ i _ i (1.78) 
.. 
• 
= 
1.3. UNIPROCESSOR ALGORITHMS FOR SVD COMPUTATIONS 21 
The proof of the above theorem is given in [12, page 876]. 
Next, Demmel and Kahan use the fact that the eigenvalues of J of the form 
(1.79) 
are the± singular values of J and some zeros. (See Section 1.3.3). From the last theorem 
1.3.1 and the above fact they obtained following corollary. 
Corollary [12, page 877] 1.3.1 Let J be an n-by-n bidiagonal matrix and suppose <>]ii+ 
Jii = a2i-1Jii, <>Ji,i+l + Ji,i+l = a2Ji,i+1, O'.j -=p 0. Let a = rrt11 max(IO'.il, lai1 J). Let 
u1 2:: ... 2:: Un be the singular values of J, and let u~ 2:'. ... 2:: u~ be the singular values of 
J + JJ. Then 
(1.80) 
Thus, if for example l-71:::; lail:::; 1+71, then no singular value can change by more than 
a factor of a = (1 - 71) 1- 2n. The above corollary may be contrasted with the following 
classical perturbation bound for singular values of Wielandt-Hoffman. 
Theorem [21, page 429] 1.3.2 Let u1 2:: ... 2:: Un be the singular values of A, and let 
u~ 2:: .. . 2:: u: be the singular ~alues of A+ JA. Then Ju; - uil:::; ll<>AJJ. 
The Demmel's and Kahan's algorithm corresponds to the 'standard' when the shifts= 0 
and therefore is called zero-shift QR algorithm. Their final algorithm is a hybrid of 
the 'standard' QR and implicit zero-shift QR. Standard QR is used when the condition 
number of matrix J, which is defined as the ratio of the largest to the smallest singular 
values, is small. The roundoff errors make small perturbations in the smallest singular 
values of J which, in this case, can be accepted. If the condition number is large, the 
implicit zero-shift QR is used instead. 
The most inner loop of the zero-shift QR algorithm has only four multiplications beside 
two calls to ROT (rotation) procedures. This is in contrast to the standard algorithm 
which in addition to two calls to ROT has twelve multiplications and four additions. Also, 
Demmel's and Kahan's algorithm is more parallelizable than 'standard' one. 8 
Apart from computing singular values to full machine precision, this alternative algo-
rithm is generally faster than the 'standard' one, and ranges from 2.7 times faster to 1.6 
8 The parallel futures of this and the other algorithms are described in the next chapter . 
I 
I 
22 CHAPTER 1. INTRODUCTION 
times slower counting reduction to bidiagonal form (7.7 times faster to 3.4 times slower 
not counting reduction to bidiagonal form) . 
1.4 Scientific and engineering applications of the SVD 
The singular value decomposition not only is important as an algebraic concept which 
serves linear algebra on its own but also has numerous applications in science and engi-
neering. In recent years, the SVD has become a fundamental tool for the formulation of 
new concepts in signal processing such as angles between subspaces, oriented signal-to-
signal ratio, canonical correlation analysis and for the reliable computation of the solutions 
to problems such as total linear least squares or source separation by subspace methods. 
The development of new techniques used for image enhancement and image compression 
has been also based on SVD. 
1.4.1 Least Squares problems 
In the linear algebra, the SVD plays an important role in a number of least squares 
problems. Let us give some examples which will illustrate that. 
El. Let Un be the set of all n x n orthogonal matrices. For an arbitrary n X n real 
matrix A, determine Q E Un such that 
IIA - QIIF :s; IIA - XIIF for any X E Un,. (1.81) 
The solution for this problem comes from Fan and Hoffman [14]. They showed that if 
A= U:EVT, then Q = uvT. (1.82) 
E2. An important generalization of the above problem occurs in factor analysis. For 
arbitrary n X n real matrices A and B, determine Q E Un such that 
IIA - BQIIF :s; IIA - BXIIF for any X E Un,. (1.83) 
It has been shown by Green [22] that if 
BT A= U:EVT, then Q = uvT. (1.84) 
E3. Let M~~n be set of all m X n matrices of rank k. Assume A E M~~n - Determine 
B E M~~n (k :s; r) such that 
IIA-BIIF < IIA - XIIF for all X E M~t , (1.85) 
I' 
.. 
1.4. SCIENTIFIC AND ENGINEERING APPLICATIONS OF THE SVD 23 
In 1936, Eckart and Young [13] showed that if 
A= UEVT, then B = UO.kVT (1.86) 
where 
' 0"1 0 . .. 0 
0 0"2 ... 0 
nk = ... . . . . . . . . . (1.87) 
0 0 . .. O"k 
0 
It is worth to observe that 
(1.88) 
E4. An n X m matrix X is said to be the pseudo-inverse of an m x n matrix A if X sat-
isfies the following four properties: AXA= A, XAX = X, (AXf = AX, (XAf = XA. 
We denote the pseudo-inverse by A+ . We wish to determine A+ numerically. In [36] 
Penrose shows that A+ can always be determined and is unique. One can verify that A+ 
is given by 
(1.89) 
where 
I 
' ...1.. 0 0 ... 
0-1 
0 _!__ 0 
0-2 
.. . 
A= . . . . . . ... . . . (1.90) 
0 0 ...1.. .. . 
U}c 
0 nxm. 
In the 70's a number of the algorithms have been proposed for computing the pseudo-
inverse of a matrix. These algorithms usually depend upon a knowledge of the rank 
of the matrix or upon some suitably chosen parameter. For example, if one uses the 
above method to compute the pseudo-inverse, then after having computed the singular 
value decomposition numerically it is necessary to determine which of the singular values 
are zero by testing against some tolerance. The above examples show only a few SVD 
applications in linear algebra . 
24 CHAPTER 1. INTRODUCTION 
Figure 1.2: Selected singular value outer product images of Mandrill Baboon. 
(i) Original (ii) (i11v'f) (iii) (i15vg) (iv) (i131vl1) 
1.4.2 Image analysis 
The example below was taken from a paper of H.C. Andrews and C.L. Patterson 
published in the IEEE Transactions on Acoustics, Speech, and Signal Processing [2]. 
Through the development of the theory of linear filtering, SVD has its permanent 
place in signal processing along with such a fertile tool as FFT. The methods used by 
Andrews and Patterson are simple extensions of the theory of linear filtering. We already 
know that any matrix [G] can be represented in a space defined by orthonormal matrices 
[U] and [V] as 
[G] = [U][A]1f2[Vf. (1.91) 
From SVD of matrix [G] we have that if 
(1.92) 
and 
(1.93) 
I 
1.4. SCIENTIFIC AND ENGINEERING APPLICATIONS OF THE SVD 25 
where the terms iii and Vi are the column-vectors of [U] and [V], respectively, then 
ifI' 1 
-Y 
[G] = [ii1ii2, .. . , iln][A]1l2 V2 (1.94) 
if1' 
n 
where 
-;.. 1/2 1 0 0 0 0 0 0 0 
0 0 0 -112 ,\1 
[A]1/2 = + + .... (1.95) 
0 0 0 0 
From the above, the expansion of the matrix [G] can be represented in the vector outer 
product notation as 
R 
[GJ = I: x:12aii!f. (1.96) 
i=l 
The upper limit of the sum R is the rank of matrix [G] . If [G] is nonsingular, R = n. This 
outer product notation can be viewed as a summation of R matrices of rank one, each 
weighted by the square root of the respective singular value -;..f2• After this introduction 
Andrews and Patterson merge both SVD and the matrix representation of the image. Any 
image can be defined as an array of the pixels where each value of the array represents 
optical intensity of the pixel and is usually a positive number. Let us refer to this matrix 
as matrix [G]. From SVD, an image represented by [G] may be decomposed into a sum of 
rank one matrices (images) which can be called eigenimages (ii/if{). Figure 1.2 illustrates 
the technique on a 128 X 128 image known as the 'Mandrill baboon'. The figure presents, 
absolute-value versions of the first, sixth, and thirty-first outer product matrices obtained 
from the decomposition of the original. 
The fact that a computer provides the singular values which are not zero but are quite 
small (i.e. 10-11) can be used to reduce storage space for the image. If an approximate 
representation of the matrix [G] is formed by truncation, then 
K 
[GK]= L XJ 12uivT (1.97) 
i=l 
26 CHAPTER 1. INTRODUCTION 
(i) 
Figure 1.3: Selected partial sums of the SVD of Mandrill Baboon 
images. (i) Original (ii) G6 (iii) Gn (iv) G31 
and the mean-square truncation error becomes 
R 
JJ[G] - [GK]JJ 2 = ~ Ai, 
i=K 
(1.98) 
If we choose K < R which still provides a good representation of [G], the storage require-
ments drop from n2 to K(2n+ 1) computer words . The truncation error is minimised by 
the monotonic ordering of the singular values. In Figure 1.3, the selected partial sums 
of SVD are presented. The image enhancement can be provided by the introduction of 
various weight functions applied to the eigenimages. The motivation for such a procedure 
might be the fact that the basic images perfectly match [G], hence enhancing certain 
eigenimages will emphasize specific structure inherent to [G] alone. Andrews and Patter-
son present two particular filtering methods. The first method is a linear weighting where 
the linear function is applied with various slope coefficients. 
(1.99) 
Figure 1.4 presents the effect of the linear edge sharpening applied to the image of 'Man-
drill Baboon' for different slope coefficients. The second method is a nonlinear weighting 
1.4. SCIENTIFIC AND ENGINEERING APPLICATIONS OF THE SVD 
Figure 1.4: Linear SVD edge sharpening. (i) Original (ii) Slope = 1.0 
(iii) Slope = 3.0 (iv) Slope = 1000 
such that 
27 
(1.100) 
This method is used to enhance the higher zero-crossing (frequency) aspects of the image 
and can be used as both 'low-pass' (a > 1) and 'high-pass' (a < 1) nonlinear filters. 
In Figure 1.5, the nonlinear edge sharpening method is presented for various weights 
a. We presented here only examples of the image enhancement techniques which are 
simple extensions of the theory of linear filtering. In their paper, Andrews and Patterson 
continued by describing more advanced techniques of image restoration also using SVD. 
1.4.3 Cryptoanalysis 
This research was done by C. Moler and D. Morrison from the department of computer 
science of the University of New Mexico and is described in (35]. 
Cryptoanalysis is the task of decoding cryptograms - coded messages. Before describ-
ing the usefulness of SVD in this very often difficult task we have to present the subject 
of cryptonalysis in more detail. 
We all know that texts written in modern languages consist of sentences which are 
28 CHAPTER 1. INTRODUCTION 
Figure 1.5: Nonlinear SVD edge sharpening. (i) Original a = 1.0 
(ii) a = 2.0 (iii) a = 0.5 (iv) a = 0.0 
built from words. Morphologically, words are built from letters but phonetically, words 
are built from vowels and consonants. Many of these languages have the property that 
vowels are frequently followed by consonants, and consonants are very often followed by 
vowels. We say a text is a 'vowel-follows-consonant' text (vfc) if the proportion of vowels 
following vowels is less than the proportion of vowels following consonants, that is if: 
number of vowel - vowel pairs number of consonant - vowel pairs 
____ .c__ ____ _;;_~< . 
number of vowels number of consonants 
(1.101) 
Various languages have different 'vfc ratios'. The English and Russian languages have 
high vfc ratio. German and Spanish have to be classified as nearly non-vfc languages. 
The SVD is used in the so-called cryptoanalyst's problem. One way of coding messages 
is to use simple substitution where individual letters of the message have been replaced 
by a permutation of themselves. One of the procedures which may lead to breaking 
the code is to count the occurrences of individual letters in the message and compare the 
frequencies with the known frequencies in a typical uncoded text. However, this technique 
. 
• 
:--::::::- ---= - - - - .. 
1.4. SCIENTIFIC AND ENGINEERING APPLICATIONS OF THE SVD 29 
is applicable only to a fairly long messages . Another technique used is partitioning of the 
alphabet of the cryptogram into two subsets, representing the vowels and the consonants. 
Such a partition must satisfy the above vfc rule if we expect it to be plausible. This 
technique is called cryptoanalyst's problem. 
If n denotes the number of letters in the alphabet then for any text, the diagram 
frequency matrix is the nxn array A with elements aii denoting the number of occurrences 
of the ith letter followed by the jth letter. Blanks and punctuation are ignored and the 
first letter of the text is assumed to follow the last letter. In general, the matrix is not 
symmetric, but for each i 
L aii = L a;i = Ji - the number of occurrences of the ith letter. (1.102) 
i i 
Let us define two column vectors, v and c as: 
Vi 1 if the i th letter is a vowel, 0 otherwise, 
Ci 1 if the i th letter is a consonant, 0 otherwise. 
Now, the vfc rule can be expressed in terms of A, v and c 
ifI' Av t1' Av 
vTA(v+c) < c'I'A(v+c)· (1.103) 
After cross-multiplying and cancelling the common term, we get 
(1.104) 
In algebra terms, the cryptoanalyst's problem can be formulated as: Given the matrix A, 
find a partitioning v and c so that the above inequality holds. 
As we saw in the previous example, the matrix A can be expressed as a sum of rank 
one matrices of the form 
(1.105) 
If we take rank one approximation of matrix A, i.e. A ~ A1 = u1 xii/[, the first left 
and right singular vectors tend to be approximately equal and they reflect the frequencies 
of the letters in the text. 
In order to take into account the correlation between pairs of letters in the text we 
have to use rank two approximations of the form 
(1.106) 
1· 
1: 
1, 
II 
I I 
,, 
30 CHAPTER 1. INTRODUCTION 
In their pa.per, Moler a.nd Morrison proposed the following partition of the alpha.bet: 
{ 1 if Xi2 > 0 a.nd Yi2 < 0 v · ~ 0 otherwise, 
{ 1 if Xi2 < 0 and Yi2 > 0 Ci 0 otherwise, 
{ 1 if sign ( Xi2) = sign (Yi2) ni 0 otherwise. 
(1.107) 
The la.st category - the 'neuter' letters - are the ones that ca.n not be classified a.s either 
vowels or consonants. 
The solution for the cryptoa.na.lyst's problem using SVD is presented below in the form 
of theorem [35]. 
Theorem 1.4.1 Let A= A 2 be a nonnegative rank 2 matrix expressed by {1.106} and let 
vectors v and c be defined by {1.107}, then the vfc rule defined by {1.104} is satisfied. 
In Figure 1.6, we present a.n example from Moler's and Morrison's pa.per obtained for 
Lincoln's Gettysburg Address. The coordinates of all the letters are obtained from the 
second right a.nd left singular vectors of the matrix representing Gettysburg Address. The 
sign patterns identify vowels a.nd consonants. 
Moler a.nd Morrison conclude that "the second singular vectors in the singular value 
decomposition of the diagram frequency matrix provide the cryptoa.na.lyst with a helpful 
and surprisingly reliable wa.y to classify the letters in a. cryptogram a.s vowels or consonants, 
if the encoding algorithm is a. simple substitution or k-alpha.betic substitution, with k 
known, a.nd if the text is vfc text." 
1.4. SCIENTIFIC AND ENGINEERING APPLICATIONS OF THE SVD 31 
E 
H 
A I 
0 
QK B w u Yp PGM 
L C 
s V 
D 
R 
N T 
Figure 1.6: Frequent letters in the Gettysburg Address. 
I 
.1 
I 
I 
Chapter 2 
Parallel Implementation of the 
SVD 
2.1 Introduction 
In this chapter we will show the process of moving one of the SVD algorithms, presented 
in the previous chapter, from a single processor environment to a multiprocessor one. 
To make such a move in the best possible manner we need to evaluate the suitability of 
each SVD algorithm for the parallel environment as well as to address the issue of the 
best architecture choice. Since the first parallel computer emerged in early seventies (the 
Illiac IV from Carnegie - Mellon University), the situation in the parallel computers field 
has changed dramatically. Nowadays many different parallel architectures are available 
with different network topologies. Because we have to make a choice in both algorithm 
and architecture, we expect that some sort of compromise will need to be made. Apart 
from the choice of architecture, we need to address the issue of the best fitted network 
topology, since the fast and reliable communication will play a significant role in SVD 
parallel implementation. 
Another part of this chapter will cover the area of parallel memory hierarchy. This 
is a very important issue, because during this research we established that, to achieve 
significant performance gains in parallel implementation of the SVD algorithm we need 
a careful examination of the implementation on all of the levels of the parallel memory 
hierarchy. The in-depth investigation of the performance of the chosen algorithm on each 
level of the parallel memory hierarchy will be presented in the next chapters. 
33 
- -
II 
II 
11 
11 
: 
I 
·--
34 CHAPTER 2. PARALLEL IMPLEMENTATION OF THE SVD I 
Also we will present a description of the APlOOO hardware and software environment 
but only to the extent which is relevant to the subject of this thesis. The APlOOO perfor-
mance will also be addressed. 
This chapter will as well cover the basic SVD algorithm parallel implementations 
realised in the early stages of this research project. The first implementation - the systolic 
array, was based on the work of Brent and Luk [4] . The second - the block implementation, 
is the one we use throughout this thesis and should be viewed as a starting point for further 
performance improvements. 
2.2 General approaches to the linear algebra parallel, high 
performance computing 
In this section we would like to mention some of the techniques which are used in high 
performance linear algebra computations. Many of these techniques are used in the rela-
tively simple linear algebra computations of the BLAS (Basic Linear Algebra Subroutines), 
which form the computational core of the extensive linear algebra library LAPACK [1] . 
Those techniques are general enough to be considered as likely useful optimisations 
for SVD computations. Without getting in too much detail the optimisations techniques 
used in above mentioned tools are: 
• Scaled columns which balances floating point operations. The most known algorithm 
where this technique is used is Fast Givens Rotations. 
• Reduced inner product computations [37] . This technique can be used also in classic 
Jacobi eigenvalue algorithm. 
• Matrix factorisation and reduction routines (as in LAPACK), using memory hierar-
chy optimizations such as re-organizing the computation so that can be expressed 
in 'blocks' of 'level 3' sub-computations, where the size of 'block' reflects the size of 
the level ( eg. register or cache) of the memory hierarchy. 
• Matching processors configuration shape to the shape of output matrix used partic-
ularly in matrix multiply. This technique can provide better cache utilisation and 
can sometimes effectively reduce communication load. 
2.3. WHICH ALGORITHM? 35 
A more thorough discussion of such optimisations used in matrix factorisation and 
BLAS, as implemented on the APlOOO can be found in [5]. We expect that some of the 
techniques mentioned will be used by us later and therefore will be described in detail in 
the appropriate chapters. 
2.3 Which algorithm? 
In this section we would like to establish which SVD algorithm presented in previous 
chapter will be the most suitable for the parallel environment. The choice of the algorithm 
is easier since we have only two candidates which differ in the way they compute SVD. 
These two mainstream algorithms are QR based and Jacobi's rotation based. The QR 
based algorithms of Golub - Kahan and Golub - Reinsch1 outperform Hestenes algorithm 
in the single processor environment. In regard to the fl.op count, two sweeps of Jacobi 
based method are roughly equivalent to a complete QR reduction to a diagonal form 
with accumulation of transformations. This is the reason why the QR based methods 
superseded the one-sided orthogonalisation method given by Hestenes in 1958. To answer 
the question of this section we need to analyse the computational steps of each algorithm 
and see how well they fit into the parallel environment. Before we do that let us review 
some definitions and the basic law of porting code from single processor computer to multi 
- processor systems. 
Definition 2.3.1 The speedup is the ratio of two program execution times. If r1 is the 
execution time of a program running on one processor, and rp the execution time running 
on P processors, then the speedup is: 
S=~. 
Tp 
(2.1) 
Speedup is usually discussed as a function of the number of processors, but it is also a 
function of problem size. 
Definition 2.3.2 The efficiency is a measure of hardware use, equal to the ratio of 
speedup achieved on P processors to P. If the time taken by a program to execute on 
one processor is r 1 , and the time on P processors is rp, the efficiency 7J is: 
(2.2) 
1 In this chapter we call both algorithms Golub - Kahan - Reinsch SVD algorithm. 
II 
11 
II 
36 CHAPTER 2. PARALLEL IMPLEMENTATION OF THE SVD 
This basic law of parallel processing is called Amdahl's Law and was formulated by 
Gene Amdahl in 1967. This law states that if a, 0 ~ a ~ l, is the proportion of a 
calculation that can be parallelised, and 1- a is the proportion that is done in serial, then 
the speedup that can be achieved with P processors is: 
and if P ~ oo then 
p 
S= ' 
a+P(l-a) 
1 S<--. 1- a 
(2.3) 
(2.4) 
Thus, for example, no matter how many processors are used, if 5 percent of a calculation 
must be done serially, the maximum speedup possible is 20. Amdahl's Law can be for-
mulated differently if one assumes that a is a function of problem size n . In such a case 
Amdahl's Law can be stated that, for fixed size problems parallel computer can only give 
bounded speedup. 
Now, having in mind Amdahl's Law, we can say that the fact that QR based methods 
have lower flop count relative to Jacobi's methods should not be the only factor which has 
to be taken under consideration in our decision making process. The Amdal's Law states 
clearly, that we should look how much of the code of both methods can be efficiently 
parallelised. In order to do so let us examine the computational steps of each algorithm 
from parallelisation view point. 
Let us look at the possibility of porting QR based methods onto a parallel computer. 
From the first chapter, we know that Golub - Kahan - Reinsch algorithm has two distinc-
tive steps: (1) Householder bidiagonalisation and (2) computing eigenvalues of tridiagonal, 
symmetric matrix using the QR algorithm. Although the first step, Householder bidiag-
onalisation, is inherently parallel, the effective vector length decreases at each step (21, 
page 237], causing inefficiencies. Generally at each step, number of processors involved in 
Householder reduction has to be different, thus causing problems in load balancing2 • The 
parallelisation of the second step is much more difficult. The effort of writing parallel QR 
algorithm for symmetric tridiagonal matrices was undertaken by Sameh and Kuck in 1977 
(39]. Their algorithm is capable of processing tridiagonal matrix of size n in O(log2 n) 
steps with O(n) processors per iteration and no square roots. This results in a speedup 
of O(n/log2 n) and efficiency of 0(1/log2 n). The algorithm fits very well only to the 
2 Inefficiency in load balancing can be reduced by cyclic data distribution. 
I 
I 
I 
2.4. WHICH ARCHITECTURE? 37 
tree topology which is fairly uncommon and otherwise, has to be mapped onto an other 
topology, which may cause inefficiencies in communication. Also, as the authors suggest, 
their method may be numerically unstable. Apart from that, any QR based parallel algo-
rithm requires expensive communication between processing elements due to the uneven 
partition of the matrix. Therefore, we can conclude that QR based algorithms are rather 
difficult to port onto the parallel environment. 
Now, let us look at the possibility of porting Jacobi's rotation based methods onto 
a parallel computer. In this type of algorithm, we have to perform Jacobi rotations on 
all pairs of columns of the matrix per sweep. This can be done independently with any 
arbitrary data distribution scheme. Also the generation of all the pairs can be performed 
in any sequence3 • Because the Jacobi like methods for SVD ~omputations are inherently 
parallel, according to the Amdahl's Law, we can expect a quite considerable speedup 
achieved. Also, since we have freedom of data distribution pattern, we can adjust the 
amount of communication for the best computation to communication ratio. The fact 
that the Jacobi's rotation based methods are much simpler than QR based should also 
contribute to a better and more flexible parallel implementation. Another argument for 
the Jacobi algorithm is that it requires only nearest neighbour communication. 
We think that the above arguments are convincing enough to choose Jacobi's rotation 
based algorithms for porting onto the parallel environment. For the reasons that were 
described in the previous chapter (see end of Section 1.3.1), we choose to implement 
Hestenes SVD algorithm. 
2.4 Which architecture? 
Another question that needs to be answered is which architecture is the best for the 
particular algorithm. To address the issue of architecture choice, let us consider only 
three but most widely used parallel computer models: systolic arrays, SIMD and MIMD. 
Before we describe how well or how badly Hestenes algorithm fits any of these three 
architectures it will be worth-while to think what, in terms of hardware is required to 
realise a parallel version of Hestenes algorithm. 
Firstly, after analysing the algorithm we see that the input array can be divided 
into blocks and each block can be processed in a separate cell (PE) which has its own 
3 For discussion of mathematical equivalence of various parallel Jacobi orderings see [33] . 
11 
~-
38 CHAPTER 2. PARALLEL IMPLEMENTATION OF THE SVD 
memory. Secondly, the colurr.ns of each block can be exchanged with the columns of the 
block stored in the adjacent cell in some cyclic fashion. Thus, there is no need for a 
long-range communication but an efficient nearest neighbour communication will fit the 
communication requirements of the algorithm perfectly. Thirdly, since we can not expect 
all of the columns in each processor to converge (become orthogonal) at the same time 
we will prefer rather the asynchronous way of computations. Fourthly, we would like to 
be able to maximise the computation to communication ratio . Therefore the architecture 
which allows only a fixed block size is undesirable. Keeping the above requirements in 
mind, we will choose an architecture to fit best the Hestenes algorithm. Naturally, it is 
highly desirable that the algorithm will be implemented onto a commercially available 
and widely used parallel computer. 
Let us start from the systolic arrays. A systolic array is a regular array of simple 
processing elements with a nearest-neighbour interconnection pattern. The concept of 
systolic arrays was developed by H.T. Kung in 1978 (32]. The main purpose of the design 
of systolic array was to create a simple, regular, cost-effective and well-balanced (balanced 
computation with 1/0) architecture. For the real time processing, systolic arrays offer the 
best combination of characteristics for utilising VLSI/VHSIC technology (41]. In (4], 
R.P. Brent and F.T. Luk clearly show that systolic array is a very good architecture for 
implementing Hestenes algorithm. The main draw-back of that implementation is that 
the number of columns n of the input matrix A which can be processed is hardware 
dependent. Also systolic arrays can not be categorised as a widely used parallel computer 
even though it is a very good architecture for real time signal processing where SVD is 
used extensively (6]. 
The Single Instruction, Multiple Data computers (SIMD), can execute the same pro-
gram in the fully synchronous fashion on the multiple sets of data. This type of archi-
tecture meets nearly all of our requirements. Unfortunately the third requirement is not 
fulfilled. Lack of asynchronous mode in SIMD will cause inefficiences and unnecessary 
delays in block processing which will be deepen as the algorithm is closer to convergence. 
The Multiple Instruction, Multiple Data computers (MIMD), can execute multiple 
instances of a program in both synchronous or asynchronous fashion on the multiple sets 
of data. Clearly all of our requirements are fulfilled by this parallel computer model. The 
synchronisation is needed only once per sweep just before we exchange blocks between 
2.5. PARALLEL MEMORY HIERARCHY 39 
the cells and other than that we can leave execution of cell code in asynchronous mode. 
All of the above led us to choose the APlOOO parallel computer as the hardware plat-
form for this research. The APlOOO is a highly parallel MIMD computer designed by the 
Fujitsu Laboratories Ltd. It has an orthogonally-connected, torus mesh topology which 
suits perfectly the nearest neighbour communication required by Hestenes algorithm. An 
interesting discussion on why meshes are particularly good choice for massively parallel 
architectures in general is given in [43]. 
Another reason for using the APlOOO was to evaluate its suitability for numerical 
algebra applications as a part of collaborative research project between The Australian 
National University and Fujitsu Laboratories Ltd. 
2.5 Parallel memory hierarchy 
The concept of the memory hierarchy is as old as von Neumann's idea of constructing the 
first computer. Let us quote: 
"Ideally one would desire an indefinitely large memory capacity such that 
any particular ... word would be immediately available .... We are ... forced to 
recognise the possibility of constructing a hierarchy of memories, each of which 
has greater capacity than the preceding but which is less quickly accessible." 
A. W. Burks, H. H. Goldstine, and J. von 
Neumann, Preliminary Discussion of the 
Logical Design of an Electronic Comput-
ing Instrument {1946). 
A memory hierarchy is a natural reaction to temporal and spatial locality as well as 
to the technology. The principle of the locality and the guideline that smaller hardware is 
faster yields the concept of a hierarchy based on different speeds and sizes. Since slower 
memory is cheaper, a memory hierarchy is organised into several levels - each smaller, 
faster and more expensive per byte than the level below. 
Let us define some general terms which are applicable to all memory hierarchies. 
A memory hierarchy normally consists of many levels, but it is managed between two 
adjacent levels at a time. The upper level - the one closer to the processor - is smaller and 
I 
I 
40 CHAPTER 2. PARALLEL IMPLEMENTATION OF THE SVD 
faster than the lower level. The minimum unit of information that can be either present 
or not present in the two-level hierarchy is called block. The size of a block may be either 
fixed or variable. If it is fixed, the memory size is a multiple of that block size. 
Success or failure of an access to the upper level is designated as a hit or a miss: A 
hit is a memory access found in the upper level, while a miss means it is not found in 
that level. Hit rate, or hit ratio is the fraction of memory accesses found in the upper 
level. This is usually represented as a percentage. Miss rate (1 - hit rate) is the fraction 
of memory accesses not found in the upper level. 
Since performance is the major reason for having a memory hierarchy, the speed of 
hits and misses is important. Hit time is the time to access the upper level of the memory 
hierarchy, which includes the time to determine whether the access is a hit or a miss. Miss 
penalty is the time to replace a block in the upper level with the corresponding block from 
the lower level, plus the time to deliver this block to the CPU. The miss penalty is further 
divided into two components: access time - the time to access the first word of a block on 
a miss, transfer time - the additional time to transfer the remaining words in the block. 
The goal of a memory hierarchy is to reduce execution time not miss rate, hence a 
good measure of memory hierarchy performance is the average time to access memory: 
I Average memory access time= Hit time + Miss rate* Miss penalty I 
The relationship between average memory access time and block size is shown in 
Figure 2.1 In this thesis we recognise the following parallel memory hierarchy levels: 
CPU registers, cache memory, main memory and other processors' memory. One can 
argue that other processors memory can be classified as another memory hierarchy level. 
We believe that if we look upon the MIMD architecture as a one, homogenous system with 
many processing elements, each of which having its own memory then our classification 
is substantiated. 
In the chapters to follow we will show and analyse SVD algorithm performance on all 
parallel memory hierarchy levels4 • 
4 From now on, under the term memory hierarchy level, we understand parallel memory hierarchy level. 
2.6. APlOOO ARCHITECTURE 
Average 
access 
time 
Block size 
Figure 2.1: The average memory access time as a function of block size. 
2.6 APlOOO architecture 
41 
In this section we will present the APlOOO architecture and its implementation as well 
as its software environment. Also we will discuss the APlOOO performance figures as 
supplied by the manufacturer. The descript ion of the APlOOO presented herein is partial 
and addresses only relevant issues to this thesis. 
The APlOOO is a highly parallel MIMD computer developed at Fujitsu Laboratories 
Ltd. Fujitsu Laboratories ' first project in parallel computers in 1983 dealt with cellular 
array processors, including CAP-256 developed in 1987. The APlOOO was developed in 
1990 and has an architecture different from that of the CAP-256. The APlOOO system went 
into commercialisation in 1992 as a platform for parallel processing research. Figure 2.2 
shows a photograph of a system with 256 cells. 
2.6.1 APlOOO - hardware environment 
The following design concepts were incorporated in the APlOOO system [27): 
• High throughput, low latency communication network 
42 CHAPTER 2. PARALLEL IMPLEMENTATION OF THE SVD 
Figure 2.2: APlOOO - photograph (256 cells). 
• Fa.st message handling 
• Efficient data distribution and collection 
• Fast barrier synchronisation 
To realise these concepts, Fujitsu Laboratories developed a scheme called structured 
channel routing5 on a 2D torus topology network. Figure 2.3 shows the APlOOO system 
configuration. The processor elements, called cells, are connected by three independent 
communication networks. These are the torus network (T-net) for point-to-point commu-
nication between cells, the broadcast network (B-net) for 1-to-N communication in data 
distribution and collection and the synchronisation network (S-net) for barrier synchroni-
sation. 
The B-net connects all cells and the host computer and is used for broadcasting, data 
distribution and data collection. We use it for distribution of matrix A from the host to 
the cells. The B-net has a data transfer rate of 50 MB/s. 
5 This is a modified version of the wormhole routing. 
2.6. APlOOO ARCHITECTURE 43 
Figure 2.3: APlOOO - system configuration. 
The T-net uses a two dimensional torus topology network. Each port of the T-net has 
a 16-bit data path, a few control signals and a data transfer rate of 25 MB/s. We use it 
for column exchange between cells. 
All cells and the host computer are also connected by the S-net. The S-net has a tree 
topology and is used for barrier synchronisation. We use it for sweeps synchronisation to 
be sure that all of the sweeps are completed before columns are exchanged. A Sun-4/330 
workstation acts as the host computer. The host interface consists of a VME-bus interface, 
B-net interface, and 32 MB of local memory. The maximum APlOOO configuration has 
1024 cells. Figure 2.4 shows the cell hardware configuration. Each cell has an integer 
unit (IU), floating point unit (FPU), a message controller (MSC), a routing controller 
(RTC), B-net interface (BIF) and 16 MB of dynamic memory. The IU, FPU (25 MHz, 
SPARC chip) and 128 KB of cache memory are connected to the MSC. During normal 
operation, the MSC works as a direct-mapped cache memory controller with copy-back 
policy. The line size is 16 bytes. The MSC, BIF, and DRAMC in each cell are connected 
via the LBUS, a 32-bit synchronous bus. Each cell has an external LBUS connector. This 
enables the installation of various hardware options such as a high-speed 1/0 interface, 
disk interface, vector processor and additional memory. The MSC services three types of 
communication functions: line sending, buffer receiving and stride DMA. 
II 
44 CHAPTER 2. PARALLEL IMPLEMENTATION OF THE SVD 
B-net 50 Mbyte/s 
-
I BIF S-net 
--
LBUS 
SPARC 
IU + FPU 
DRAM 
-
I 
controller 
~ ' I MSC ,, II :i 
1, - - I 
DRAM II I: Cache 1, 
16 Mbyte 1, 
II Memory 
-
..___ 
~ j RTC \ _ 25 Mbyte/s 
-
, I -
/ -
Figure 2.4: APlOOO - cell configuration. 
Line sending - The line sending function sends a cache line message in a manner 
similar to a cache flush. When a CPU initiates a line sending command, the MSC checks 
the tag memory to see if the data is in the cache or in main memory. It simultaneously 
checks the output network device status to see if the device can accept the data. If all of 
the devices are ready and there is data in the cache, the data is read and written to the 
Write buffer (HIT). If there is no data in the cache, the DMA to one cache line is invoked 
and the data is read from main memory and written to the device (MISS). The CPU is 
blocked until the status has been checked and data has been moved from the cache to 
Write buffer or DMA is initiated. If the devices are busy, data access exception is reported 
to the CPU and a trap occurs. This trap enables the CPU to retry line sending and avoid 
deadlock. 
Buffer receiving - The buffer receiving corresponds to any asynchronous data trans-
fer request using the ring buffer in main memory. Hardware monitors the ring buffer for 
I 
I 
I 
,, 
I 
2.6. APlOOO ARCHITECTURE 45 
overflow, notifies the CPU of such an error by an interrupt or stops transferring until data 
reading produces a receive area in the buffer. The accessed area is released and returned 
to the MSC as a receive buffer by modifying the register. Buffer receiving accommodates 
three registers: BASE, write pointer (WTP) and read pointer (RDP). The buffer size can 
be between 64 KB and 512 KB. A received message is not written to the cache memory 
directly. This is because it may overwrite cached data which is being used. 
Stride DMA - The stride DMA function assembles the regularly dispersed data items 
from memory into a message and transfers this message to the RTC or BIF. The function 
also regularly stores in memory the messages received from the RTC or BIF. The areas 
of the memory which ought to be transferred are specified using the Addr, Size, Hskip, 
Vskip, Vent and Hcnt parameters. 
2.6.2 APlOOO - software environment 
This section describes a parallel programming model and the software configuration of 
the APlOOO. Parallel programs for the APlOOO are developed in the following sequence: 
• Parallel algorithms are translated into parallel programs based on the APlOOO par-
allel programming interface. 
• Programs are executed, debugged, and evaluated on the workstation using the 
APlOOO parallel software simulator, CASIM. 
• The application is executed, debugged, and evaluated under the APlOOO operating 
system. 
The APlOOO provides a computation model that is focused on task and message com-
munication. A task is a basic unit which can be executed in parallel, can memorise its 
own internal states and can act according to received messages. Tasks cooperate to solve 
problems by exchanging messages. Application programs can be written in C or For-
tran. Parallel processing functions, for example, communication and synchronisation, can 
be represented by the APlOOO parallel library, which supports various types of message 
communication for easy implementation of data parallel algorithms. 
A user program consists of a process on the host computer and tasks in each processor. 
In the host part of the parallel program cells are configured and initialised, tasks and data 
11 
I 
1 
1 
I 
I 
! 
i 
I 
I 
i 
46 CHAPTER 2. PARALLEL IMPLEMENTATION OF THE SVD 
are distributed to the cells. CAREN is a server which initialises the APlOOO hardware, 
sets the execution environment, creates tasks, and controls message transfer. Processes 
such as a user host program, debugger and runtime monitor, communicate with the server 
process using a UNIX pipe. 
2.6.3 APlOOO - performance 
In this section we present performance evaluation of the basic functions using user-level 
libraries. To evaluate performance, the physical configuration of APlOOO was used ( e.g. 
2-D torus). All data presented herein were taken from [40]. The performance of the 
following functions was evaluated: communication latency, global functions and barrier 
synchronisation. 
Communication latency - to determine the communication latency between cells, 
the time taken to send messages of various lengths from a master cell to a slave cell and 
back was measured. This test is known as the pingpong benchmark. During this test, the 
other cells are idle and have no effect on the communicating cells. Half of the time in this 
test is used to evaluate the overall latency, which includes latency in the network and the 
message handling overhead. Figure 2.5 shows the message handling time versus message 
size in bytes. In this figure, Regularl shows the timing of ordinary Direct Memory Access 
(DMA) transfer functions Lasend( ) and crecv( ). XYl and XY24 show the results for 
an XY-transfer distance of 1 and 24, respectively. The differences in the times for the 
various inter-cell distances concur with the hardware specification of 160 ns/unit-distance. 
The XY-transfer setup time is very small because a great deal of overhead is eliminated. 
Total throughput is about 5.5 MB/s. When xy_crecv() (referred to as CXYl) is used, a 
throughput of about 20 MB/sis obtained because xy_crecv( ) eliminates copying at the 
reception. LS1 shows the speed-up that results when message handling is switched from 
ordinary DMA to XY-transfer by specifying the LSEND option at runtime. 
Global functions - Global functions, for example, xy_damax() and xy_dsum( ), are 
used to compute the maximum values and summations of all data for all cells. These 
functions are implemented using binary tree communication. The return values for all 
cells are the same. xy_dsum( ) computes an 8-byte double-precision real summation for 
all data in all cells. xy _damax( ) finds the absolute maximum value in double-precision 
real value of data in all cells. Table 2.1 shows the completion times for global functions. 
2.6. APlOOO ARCHITECTURE 47 
240.0 r---.--,---,------i------,--,---.---,----,--~-~ 
• 
200.0 
160.0 
,...._ 
"' ::I
'-" 
<1.) 120.0 ~ 
80.0 
40.0 
100.0 200.0 300.0 400.0 500.0 
Message size (bytes) 
Figure 2.5: APlOOO - pingpong performance. 
Cells 4x4 8x8 16 X 16 
x_damax/y _damax 30 40 52 
xy_damax 58 78 104 
x_dsum/y _dsum 23 30 39 
xy_dsum 44 59 79 
Table 2.1: Completion times for global functions (in µs). 
Although xy _dsum( ) and xy _damax( ) calculate only one element, some applications 
perform calculation of a vector, for example, summation, and find the maximum value. 
The performance results for vector calculation are shown in Figure 2.6. In this figure, we 
can see that the y _dsum and btree performances intersect at the vector lengths of 3 and 4. 
y _dsum( ) is faster because the message length is limited to 16 bytes (including headers), 
and only one library call is needed. Since the xy ....send( ) function appends a 32-byte 
header to the message, the amount of data to be exchanged when the vector length is 4 
equals that for the y _dsum. The results show that the main contribution to the efficient 
implementation from y _dsum is the incorporation of specially formated messages. 
Barrier synchronisation - Ordinary DMPP uses message passing for synchroni-
sation. This method, however, is very slow and the data transfer and synchronisation 
48 
,....._ 
Cl) 
::s 
'-' 
Cl) 
e 
~ 
600.0 
500.0 
400.0 
300.0 
200.0 
100.0 
CHAPTER 2. PARALLEL IMPLEMENTATION OF THE SVD 
D---{] - - - 0: 
/ 
/ 
/ 
32 y_dsum 
16 y_dsum _,. _,. 
/ 
/ 
/ 
_,.O 
/ 
/ 
_,.O 
_,. _,. O 32 btree 
0/ - - 0- - -
/ -0- - - {]-
/ {]--
- - .-0"- -
_ {] -.,./ 
/ ____;a----.r---
0.0 L._ _ ___,_ _ ___,_ _ --1-----'------'---- ____J-- ----'-- -___L_------'---------' 
0.0 2.0 4.0 6.0 8.0 10.0 
Vector length (bytes) 
Figure 2.6: Binary tree and y ....sum performance. 
messages can be mixed, causing confusion . The APlOOO incorporates a private network 
for synchronisation to solve the above problems. The synchronisation time is 10.6 µs and 
does not vary with the number of cells. For synchronisation with message passing, the 
synchronisation time is t = 33 · log(p) + 6.8 µs, where pis the number of cells. 
The FPU performance is 8.3 MFLOPS/cell single precision and 5.5 MFLOPS/cell 
double precision. 
2. 7 Hestenes algorithm implementation 
In this section we present two implementations of Hestenes SVD algorithm on APlOOO 
which were developed at the early stages of this research project. The first, systolic 
array implementation was inspired by R.P. Brent and F.T. Luk [4]. The second, block 
implementation is the natural extension of the first and should be regarded as a basis for 
all of the performance improvements presented in this thesis. 
.. 
2. 7. HESTENES ALGORITHM IMPLEMENTATION 
OUT_R1 IN_L2 OUT_R2 IN_L9 OUT_R9 IN_L10 I PROCESSOR! 
-
PROCESSOR 
-
_ .1PROC:SSOR PROCESSOR) -~ L1 1 R1 2 ••• L10 10 R10 L2 R2 - L9 R9 r 1- \. 
IN_R1 OUT_L2 IN_R2 OUT_L9 IN_R9 OUT_L10 
Figure 2.7: Example of inter-processor connections for n=20. 
if Lk < Rk then process (Lk, Rk) else process (Rk, Lk)i 
if k = 1 then OUT Rk := Rk 
else if k < n/2 then OUT Rk := Lk; 
if k > 1 then OUT Lk :=Rk; 
wait 
if k < n/2 then Rk := IN Rk else Rk := Lk; 
if k > 1 then Lk :=IN Lk; 
Figure 2.8 : Systolic array data flow for SVD algorithm. 
2.7.1 Systolic array implementation 
49 
The amenability of Hestenes algorithm for a parallel environment was clearly shown in 
the paper of R.P. Brent and F .T. Luk [4]. They proposed systolic arrays as the hardware 
platform for the parallel realisation of Hestenes algorithm. For m-by-n matrix, their 
implementation requires O(mn) time and O(n) processors6 • In the cited paper, they 
show how O(n) linearly-connected processors can generate all column pairs (i,j) in O(n) 
steps. 
Let us see how a systolic array can realise Hestenes algorithm. We assume that n 
is even. (If n is odd we can always add a 'zero' vector). Let us store two columns of 
matrix A per processor . In this case n/2 processors Pi, ... , Pn;2 are used, where only 
neighbouring processors Pk and Pk+l can communicate with each other. Figure 2.7 shows 
the configuration of the processors for n = 20. Processor Pk has two buffers, the left 
Lk where the first of the two column-vector is stored and the right Rk where the second 
is placed. Inter-processor communication is realised through inputs IN Lk, IN Rk and 
outputs OUT Lk and OUT Rk. Initially Lk = 2k - 1 and Rk = 2k. At each time step, 
processor Pk executes the procedure presented in Figure 2.8. In this procedure, 'process' 
6 Compare this with O(mn2 ) for Golub-Kahan-Reinsch algorithm . 
50 CHAPTER 2. PARALLEL IMPLEMENTATION OF THE SVD 
STEP 1 
STEP2 
STEP3 
0 0 0 0 0 0 
8 8 8 8 8 8 
Figure 2.9: Communication in systolic array implementation for n=20. 
means to perform operations on pair ( i, j) as required . Data flow and communication 
for n = 20 of the above program are shown in Figure 2.9. Each possible pair ( i, j) is 
processed exactly once during a cycle of n - 1 consecutive steps. We can see that n/2 
processors can perform a full sweep of the Hes ten es algorithm in n - 1 steps of time O ( m) 
each, so in total time O(mn). The initialisation process (storing column-segments in the 
local processor memory) can also be performed in time O ( mn). 
Suppose that S sweeps are required to orthogonalise all columns to full machine accu-
racy. To compute SVD the total time required for systolic array of n/2 linearly connected 
processors is O(mnS). The serial Hestenes algorithm requires, for the same task, a time 
O(mn2S). 
ln order to realise the systolic array implementation on the APlOOO, we need to create 
two programs: first, the host program which runs on the front-end computer and second, 
the cell program which runs on the processors of APlOOO in parallel. The host program 
is responsible for proper task creation, data fetching and collecting output from the cells. 
The cell program receives data from the host, performs necessary calculations, exchanges 
data with other cells and sends the results to the host. In the host program, we set 
2. 7. HESTENES ALGORITHM IMPLEMENTATION 51 
15.0 r----.----,--------.----.---.-----.----~---
Systolic Array Implementation: 0.0015x-0.167 
10.0 
,...... 
V, 
'-' 
~ 
s 
~ 
5.0 
0.0 ~--~--~~----'---------l.----L-----L----'------' 
0.0 2000.0 4000.0 6000.0 8000.0 
Matrix size (K Bytes) 
Figure 2.10: Execution time per sweep for the systolic implementation. 
up cells in one-dimension and then fetch random data to matrix A and scatter them to 
particular cells. After the orthogonalisation of matrix A in the cells is finished we gather 
data from all the cells. In the host we reconstruct output matrix W and the new matrix 
V which holds accumulated rotations performed on A through all the sweeps. The host 
program receives as well other parameters measured in the cells as execution time, fl.op 
count, communication time and others. 
Our cell program consists of two procedures. Procedure SVD(i,j) performs orthogonal-
isation of two columns. Procedure SEND() performs inter-cell communication, exchanging 
left and right columns between cells . 
Figure 2.10 shows execution time per sweep as a function of size of data. Since the 
number of processors available on the APlOOO was 128 we were limited in the number of 
the columns of the input matrix A to n = 256. We decided that incrementing the number 
of columns by 10 is sufficient enough to check the behaviour of this implementation and 
draw the execution time per one sweep as a function of the initial matrix size. All the 
data were stored as doubles, i.e. 8 bytes per number. Each point in the Figure 2.10 is an 
average of 5 test runs. The standard deviation of these five runs was of order 0.001 and 
52 CHAPTER 2. PARALLEL IMPLEMENTATION OF THE SVD 
P1 P2 P3 P4 P5 
SWEEP1 
SWEEP2 
SWEEP3 
SWEEP7 
SWEEPS 
SWEEP9 
SWEEP 10 
AL AR 
Figure 2.11: Data flow in block implementation for P=5 and n=20. 
therefore it is not present on the mentioned figure. The linearity of the execution time 
O(mn) is evident and the regression correlation factor was 0.9999. 
Although the systolic array implementation is efficient as a hardware implementation 
(e.g. purposely build systolic array engine for SVD computations) it has major drawbacks 
as a MIMD implementation. It restricts the number of columns of the input matrix to the 
doubled number of processing elements available, has low computation to communication 
ratio (per processor in one sweep~ 6), large communication overheads and very low cache 
reuse factor. All of the mentioned deficiencies are due to the small number of columns 
stored per processor. The next implementation, block implementation, diminishes the 
basic restriction of the systolic array implementation. 
2. 7. HESTENES ALGORITHM IMPLEMENTATION 53 
Block Implementation: 0.0016x-0.163 
10.0 
,-._ 
en 
'-' 
Q.) 
s 
~ 
5.0 
2000.0 4000.0 6000.0 8000.0 
Matrix size (K Bytes) 
Figure 2.12: Execution time per sweep for the block implementation. 
2. 7.2 Block implementation 
In order to improve the performance of floating point computations and communication, 
we adopted the systolic algorithm to operate on pairs of blocks of n/2P columns (rather 
than single columns), perform calculations on them and exchange larger messages between 
processors. By doing that we gain much needed flexibility in accommodating columns in 
the particular processors and retain the simple communication pattern of systolic array 
implementation. 
In this implementation, two blocks of columns (rather than two columns) are stored 
in each processor. Let us distinguish these two blocks by calling the first left (AL) and the 
second right (AR)· In the complete sweep, the order of operation is to rotate every column 
from the left block with every column from the right block in each cell, and then exchange 
the 2P-1 times left and right blocks between processors according to Brent-Luk ordering. 
At the last step of the complete sweep, we rotate columns within blocks themselves. This 
is done once per sweep. The number of columns per processor has to be even but that 
is the only restriction for the block implementation. The block implementation scheme is 
presented in Figure 2.11. 
I 
I 
: 
I 
54 CHAPTER 2. PARALLEL IMPLEMENTATION OF THE SVD 
I I I 
0.560 - -
Systolic Array Implementation 
•••••• 
• •••••••••• • • • 4 • • • 0.540 - • 
-
;:,... 
u 
s:: 
• 
<I.) 
·u 
li= 
Ill 0.520 - -Block Implementation 
•••••••• • • • • • • • • • • • • • . ~~ 
• 0.500 -
• 
0.480 I I I 
0.0 2000.0 4000.0 6000.0 8000.0 
Matrix size (K Bytes) 
Figure 2.13: Efficiency for systolic and block implementation. 
To compare both implementations, we simulated a systolic array implementation 
within a block implementation by storing one column per block. The result is shown 
in Figure 2.12. The block implementation is marginally slower. This can be explained by 
the fact that in the block implementation we use generalized, thus less efficient code. 
Also to present the hardware utilisation by these two programs we computed the 
efficiency as defined earlier in this chapter. The results are presented in Figure 2.13. The 
maximum in both functions is probably due to the communication overheads for small 
matrix sizes. The execution times per one sweep for a single processor run are presented 
in Figure 2.14 and they well agree with execution times of 0( mn2) for the serial Hestenes 
algorithm. 
As mentioned at the beginning of this section, the block implementation is our starting 
point, a template for all the performance improvements presented herein. All the time 
measurements and other factors will be compared with this implementation to show the 
degree of performance improvement achieved. More detailed discussion on this subject 
will be presented in the next chapter. 
2.8. CONCLUSIONS 
.,...... 
"' '-' 
Q.) 
.§ 
E-< 
1000.0 ,---.-------,------,----~----,-------,-----r---~ 
800.0 
600.0 
400.0 
200.0 
• 
• 
• • 
• • 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
• • • • 0.0 '----4~........,---=--------'-------1....---------1----'-----...L__------'---__J 
0.0 2000.0 4000.0 6000.0 8000.0 
Matrix size (K Bytes) 
Figure 2.14: Execution time per sweep for the single processor run. 
2.8 Conclusions 
55 
In this chapter we emphasised that the process of parallelising linear algebra algorithms, 
SVD algorithms in particular, can not be automated yet and care has to be taken to 
match the algorithm with the architecture if the performance improvements are desired. 
We also showed that, according to the Amdahl's Law, in the process of parallelisation, 
the overall flop count in the algorithm is not that important. For the best performance 
it should not take lead over the parallelisability ( easiness of the code to be as parallel as 
possible) of the algorithm. 
The choice of the APlOOO parallel computer as a hardware platform for this research 
project was accomplished after careful examination of the possible advantages and disad-
vantages of such a choice. Although the APlOOO FPU - the SPARC chip is not a state 
of the art of the todays microprocessor technology, the overall APlOOO architecture has 
a very modern design and provides excellent environment for parallel implementation of 
Hestenes algorithm. 
At the end we showed how to avoid the basic restriction of the systolic array imple-
mentation by storing blocks of columns in each processor and then follow the Brent-Luk 
ordering to cycle through all the possible pairs of columns-segments. 
I 
j 
Chapter 3 
Improvements of the Floating 
Point Performance 
3.1 Introduction 
We start this chapter by explaining what we understand by 'Improvements of the Flaa.ting 
Point Performance'. In the single processor environment, a.ny change of the number of 
floating point operations executed in the unit of time or any change of the order in 
which these operations a.re executed that leads to the better performance of a particular 
numerical algorithm ca.n be regarded a.s a.n improvement of the floating point performance. 
Although this is also true for the multiprocessor environment, in the parallel computer 
we can as well alter geometrical shape or distribution of the data in space which can also 
lead to the improved performance. 
In this chapter we describe the performance improvements of Hestenes algorithm which 
were obtained by decreasing the number of floating point operations executed in the unit 
of time a.nd by reshaping parts of the code to use SPARC architecture to its full potential. 
All these changes decrease the program execution times on each processor and ca.n be 
regarded as improvements on a local, single processor scale. 
We will show how, by the introduction of the sea.ling factors, we balance the multi-
plication to addition ratio in the function rotate (see Figure 1.1) and therefore utilise 
the SPARC ability to execute simultaneously faddd (floating point a.dd double) and fmuld 
(floating point multiply double) instructions. We will also show how we can reduce explicit 
inner product computations by computing them once a.nd then performing only updates. 
57 
,, 
I 
I 
I 
58 CHAPTER 3. IMPROVEMENTS OF THE FLOATING POINT PERFORMANCE 
Also a number of assembly language optimisations will be presented which utilise 
the SPARC architecture more efficiently than the commercial C compiler. All but one 
assembly procedure was written in order to realise the concept of balanced multiplication 
to addition ratio. 
3.2 Measuring and comparing performance 
If we want to know how good the performance of the numerical algorithm is, we have to 
to measure both the execution time and the accuracy of the results. Let us examine in 
this chapter only the first parameter, i.e. the execution time. We leave the discussion of 
the accuracy measurements to Chapter 5. 
In the parallel environment, the execution time measured consists of the time spent 
on performing floating point operations and the time spent on communication. In order 
to minimise the error of floating point performance measurements, we should saturate the 
processors with data, so that the time spent on communication compared with the time 
spent on computations will be negligible. In case of the Hestenes SVD algorithm, on later 
sweeps, the amount of computation is reduced, which may further affect the accuracy of 
the execution time measurements. 
The possible solution to the above difficulty is to run the Hestenes algorithm for one 
sweep only. This results in the reduction of the execution time and solves the problem of 
diminishing amount of computations at later sweeps1 . On the other hand, the disadvan-
tage of one sweep runs is that we can not see if our improvements affect the convergence 
rate. In the worst case, it is possible that any gain in execution time per sweep due to the 
improvement can be lost due to the greater number of sweeps needed for the algorithm 
to converge. Because of the above reason, we choose to show in the result table, the total 
execution time and the number of sweeps that were needed for a particular matrix size to 
converge. 
In this research, we tried to find the balance between what is feasible in terms of time 
availability on the APlOOO and the high computation to communication ratio. The input 
matrix sizes were chosen to accommodate both the square and rectangular shapes and to 
provide the universal test platform for all the improvements presented in this thesis. The 
chosen dimensions of the matrices and the number of processors used also allow the size 
1 The average number of sweeps for the matrices used in this research project is 13. 
1, 
3.2. MEASURING AND COMPARING PERFORMANCE 59 
of the data stored per processor to be represented by an integer. 
We used the APlOOO's dgettime() function to measure the execution time. The stan-
dard deviation of the measured execution time of 50 trials of the test program (:::::: 1.0 s 
duration) was a- = 0.00562 s. With great confidence, we can say that the accuracy of 
the time measurements achieved is ±0.02 s. We always used the same input matrix A in 
order to render the t ime measurements from the input data2• 
Apart from the execution time and number of sweeps a good indicator of the effi-
ciency 3 of the code is the MFLOPS rating. In order to measure the MFLOPS rating for 
the Hestenes algorithm, we compute the number of significant floating point operations 
executed in one full sweep for the program (inner product computations and rotations 
procedures) and divide them by the time in which they were executed. 
We need to resolve the question of comparing the MFLOPS rating between various 
versions of the Hestenes algorithm implementations which, due to the changes introduced, 
have different number of floating point operations. 
The original Hestenes algorithm implementation - block implementation - has, in the 
inner product procedure, 6m flops (floating point operations), and in the rotation proce-
dure, 6m flops form x n matrix A and 6n for n x n matrix V . The total number of flops 
in a full sweep4 is represented by the following equation: 
n(n - l) 
Flops= 2 (12m + 6n) (3.1) 
Let us describe two ways of establishing the MFLOPS rating. The first, true MFLOPS 
rating, is obtained when after reducting the number of flops in both rotation and inner 
product computation procedures , we adjust the corresponding parameters of the Equa-
tion 3.1. The second, relative MFLOPS rating, is good if, despite the flops change in the 
program, we use the Equation 3.1 in its original form. The relative MFLOPS rating can 
be used to compare the efficiency of the code to its original, single processor implemen-
tation, but of course it will be misleading in terms of the true MFLOPS FPU figures. 
We choose to give true MFLOPS rating in our floating point performance measurements. 
This will allow us to monitor real usage of the APlOOO FPU which attains 5.5 MFLOPS 
peak performance for double precision numbers. 
2 The dependence of execution time from input data will be discussed in Chapter 5. 
3 This is not the efficiency as defined in Definition 2.2.2. 
4 The number of rotations a t the later sweeps decreases as algorithm converges. 
I: 
II 
t, 
II 
1, 
II 
t, 
60 CHAPTER 3. IMPROVEMENTS OF THE FLOATING POINT PERFORMANCE 
Matrix Size ( m x n) 128x128 256x256 512x512 2048x128 4096x128 6144x128 
Time (s) 93.2 827 6958 877 1826 2677 
Impr. I(%) 0.0 0.0 0.0 0.0 0.0 0.0 
P=l Impr. II (%) 0.0 0.0 0.0 0.0 0 .0 0.0 
MFLOPS 1.8 1.9 1.9 2.1 1.9 2.0 
Sweeps 12 14 15 10 10 10 
Time (s) 21.3 202 1716 234 468 674 
Impr. I(%) 0.0 0.0 0.0 0.0 0.0 0.0 
P=4 Impr. II (%) 0.0 0.0 0.0 0.0 0.0 0.0 
MFLOPS 2.0 1.8 1.8 1.9 1.8 1.8 
Sweeps 11 12 13 10 9 9 
Time (s) 8.4 51.6 464 65.9 145 205 
Impr. I(%) 0.0 0.0 0.0 0.0 0.0 0.0 
P=16 Impr. II (%) 0.0 0.0 0.0 0.0 0.0 0.0 
MFLOPS 1.5 1.9 1.7 1.8 1.5 1.5 
Sweeps 12 12 13 10 9 9 
Time (s) 9.3 24.0 134 32.9 56.6 91.0 
Impr. I(%) 0.0 0.0 0.0 0.0 0.0 0.0 
P= 64 Impr. II (%) 0.0 0.0 0.0 0.0 0.0 0.0 
MFLOPS 0.3 1.1 1.6 0.9 1.0 1.0 
Sweeps 11 12 13 10 9 10 
Table 3.1: Results for the basic, unimproved Hestenes algorithm implementation. 
Finally, we show, in our result table, Impr. I and Impr. II , two versions of the improve-
ments of the execution time for a particular implementation. They are both expressed as 
a relative difference between the execution times achieved for two different implementa-
tions. Impr. I is a relative improvement between the 'old' and 'new' implementations. It 
indicates the reduction of the execution time on a local (between two consecutive imple-
mentations) scale. 
Told - Tnew 
Impr. I= T 100% 
old 
(3.2) 
Impr. II represents a relative improvement between the basic, block implementation, and 
the 'new', improved version of the Hestenes SVD algorithm. It indicates the reduction of 
the execution time on a global scale. 
Impr. II = Tblock; T improved 100% 
block 
(3.3) 
The results for the unimproved implementation of t he Hestenes algorithm are presented 
in Table 3.1. In the above table, we see that the less columns of the matrix A are stored 
per processor, the worse the performance is. This can be explained by two factors: loss of 
load balance and significant domination of the communication over the computation for 
3.3. SCALED COLUMNS 61 
a small number of columns stored per processor. 
The one-sided Jacobi algorithm is well known from the fact that it maintains the 
nearly perfect (the same number of column stored per processors) load balance across 
the processors. However, as the algorithm nears to the convergence, fewer columns need 
to be rotated. The distribution of the number of columns needed to be rotated at the 
final stages of the algorithm depends on many factors ( data itself, round off errors) and 
therefore can be viewed as a random process. This property of the algorithm has certain 
influence on the execution time. The worst case is when P-1 processors have no rotation 
to perform, while the remaining processor must perform a complete set of rotations. This 
statistical loss of balance has a greater effect on the execution time as the number of 
processors increases . 
. 
3.3 Scaled columns 
Let us remind the fact that in nearly all modern microprocessor designs, floating point 
addition and floating point multiplication can be executed at the same time. This is 
because those microprocessors have independent hardware circuits to implement faddd 
and fmuld instructions. In order to use this feature in coding, we need to group the 
addition and multiplication operations so they will be balanced, or in other words, the 
multiplication to addition ratio has to be one. 
In the scaled columns implementation, we adopted a technique introduced by Gen-
tleman [17] which balances multiplication to addition ratio in the function rotate (see 
Figure 1.1 in Chapter 1.) of the Hestenes implementation. 
Let us remind the plain rotation matrix for the original Hestenes implementation in a 
new form which satisfies the condition of balanced multiplication to addition ratio. 
( 
A(k+l) ) ( 1 (:,tJ = cos e(~l 
A (k~l) iJ tan e(k) 
( :,J) tJ 
- tan e(~) ) ( A(k~ ) tJ (:,i) 
1 A (k). ( :,J) 
(3.4) 
Let us introduce the scaling factors c}k) for the ith column and cy) for the jth column 
of matrix A. The relation between the original, non scaled, and the scaled columns of the 
matrix A is shown in Equation 3.5 below. 
A (k) - c(k) A'(k) A (k). = c(k) A'(k) (:,i) - i (:,i)' (:,J) J (:,1) (3.5) 
62 CHAPTER 3. IMPROVEMENTS OF THE FLOATING POINT PERFORMANCE 
Equation 3.6 defines the scaling factors ct) and c;k). 
c}k+1) = cos ei/k)ct) 
c(k+l) = cos ei3.(k)c(k) J J 
c~0) = c(0) = 1 
i J 
(3.6) 
The scaling factors ct) and ct) hold the accumulated cosines calculated for the ith and 
the jth column during the kth rotation. Initially, the scaling factors c}0) and c;0) are set 
to unity. The fact that l8ijl ::; 1r /4 ensures that I cos 8 ijl ~ 1/../2 and therefore we do not 
need to worry about the underflow. We postmultiply each column of matrix A by ct) and 
ct) after each sweep is completed. The choice of a sweep as a post-multiplication step 
was arbitrary. On the other hand, it is the natural choice, since one sweep represents, from 
the program point of view, a set of complete operations which leads to the elimination of 
each off-diagonal element of matrix A at least once. Below, we show the Hestenes rotation 
performed on scaled columns: 
The scaled columns technique utilises the fact that the vector p-norms are preserved if 
the rotations are performed in Euclidean space. 
3.3.1 Numerical results 
In Table 3.2, we present numerical results for the scaled columns implementation. From 
the results we see that the improvement degrades as we store less columns per processor. 
Let us compare both rotation procedures: for unimproved and for scaled implementations. 
In the assembly language procedure5 , ROTATE_T_C(), (opposite page) representing 
the unimproved rotation procedure, we have five load (ldd) operations, two store (std) 
operations, four multiply (frnuld) operations, one addition (fadd) and one subtraction 
(fsubd). All of these operations are performed in double precision . The main three 
reasons why this code is inefficient are: grouped, costly load/store operations (first two 
loads and further store and load pair), only partly used pipelining (result of subtraction 
required in immediate addition) and unbalanced multiplication to addition ratio. 
6 We limit the presentation of assembly language only to the main loops in each of the procedures. 
3.3. SCALED COLUMNS 63 
Matrix Size ( m x n) 128x128 256x256 512x512 2048x128 4096xl28 6144x128 
Time (s) 86.7 760 6359 818 1720 2515 
Impr. I (%) 7.0 8.0 8.6 6.8 5.8 6.1 
P=l Impr. II (%) 7.0 8.0 8.6 6.8 5.8 6.1 
MFLOPS 1.6 1.6 1.6 1.9 1.7 1.8 
Sweeps 12 14 15 10 10 10 
Time (s) 20.3 189 1578 227 451 651 
Impr. I (%) . 5.0 6.5 8.0 3.3 3.4 3.5 
P=4 Impr. II(%) 5.0 6.5 8.0 3.3 3.4 3.5 
MFLOPS 1.7 1.5 1.6 1.7 1.5 1.6 
Sweeps 11 12 13 10 9 9 
Time (s) 8.8 50.7 443 68.2 152 216 
Impr. I(%) 
-4.4 1.8 4.5 -3.6 -5.0 -5.2 
P=l6 Impr. II (%) 
-4.4 1.8 4.5 -3.6 -5.0 -5.2 
MFLOPS 1.1 1.5 1.4 1.4 1.2 1.2 
Sweeps 12 12 13 10 9 9 
Time (s) 9.5 26.4 148 38.0 65.7 106 
Impr. I(%) -2.1 -9.7 -9.9 -15.5 -16.1 -16.8 
P=64 Impr. II (%) -2.1 -9.7 -9.9 -15.5 -16.1 -16.8 
MFLOPS 0.3 0.8 1.2 0.7 0.7 0.7 
Sweeps 11 12 14 10 9 10 
Table 3.2: Results for the scaled implementation of the Hestenes algorithm. 
ROTATE_T_C(TAN,COS,ALEFT,ARIGHT,LENGTH) LY33: 
double •ALEFT , •ARIGHT,•TAN,•COS; int LENGTH; ldd [Y.06], %f28 
{ ldd [Y.oO] ,%f26 
int l; fmuld %f28,%f26,%f26 
register double L,R,Left,Right; ldd moo] ,%f6 
fsubd %f30,%f26,%f26 
for (l=O;l<LENGTH;l++) fmuld %f30, %f6, i'J:30 
{ faddd %f30,%f28,%f30 
Left= ALEFT[l]; ldd C%o1], %f12 
Right= ARIGHT[l]; fmuld %f26,%f12,%f26 
L = Left - Right••TAN; std %f26, [1.02] 
R = Left••TAN + Right; ldd [Y.01] ,%f16 
ALEFT[l] = L••COS; fmuld %f30,%f16,%f30 
ARIGHT[l] = R••CDS; inc 8,%02 
} std %f30, [1.06] 
} inc 8,%06 
cmp %06,%03 
blu,a LY33 
64 CHAPTER 3. IMPROVEMENTS OF THE FLOATING POINT PERFORMANCE 
Let us see the code for the scaled columns rotation procedure RDTATE_T1_T2(). 
ROTATE_T1_T2(T1,T2,ALEFT,ARIGHT,LENGTH) 
double •ALEFT,•ARIGHT,•T1,•T2; int LENGTH; 
{ 
} 
int 1; 
register double Left,Right; 
for (l=O;l<LENGTH;l++) 
{ 
} 
Left= ALEFT[l]; 
Right= ARIGHT[l]; 
ALEFT[l] = Left - Right••T1; 
ARIGHT[l] = Left••T2 + Right; 
LY44: 
ldd 
ldd 
fmuld 
fsubd 
std 
ldd 
fmuld 
inc 
faddd 
std 
inc 
cmp 
blu,a 
[%05] ,'l.f8 
['l.oO] ,t.fO 
t.f8,t.f0,t.f0 
'l.f30,'l.fO,%fO 
'l.fO, C'l.o2] 
[%01] ,%f6 
'l.f30,'l.f6,'l.f30 
8,%02 
'l.f30,'l.f8,'l.f8 
'l.f8, [%05] 
8,%05 
%o5,'l.o3 
LY44 
In the above procedure, the number of load operations is reduced to three and the 
number of fmuld is reduced to two. Even though we achieved the balance of fmuld and 
faddd operations, the load operations still prevail and pipelining is not used efficiently 
(results of multiplication is needed for subtraction and addition). This can partly explain 
the relatively low (8%) peak performance improvement achieved for the scaled implemen-
tation. 
In Table 3.2, we observe that the less columns of matrix A are stored per processor, the 
worse the performance is. The explanation for this fact was already given in the previous 
section. For those cases, the potential saving from the scaled columns implementation is 
too small to offset the overhead of maintaining the extra data structure. 
The convergence rate was not affected by the introduction of the scaled columns except 
for the case of matrix A 512 by 512 and the number of processors P = 64. This can be 
explained by the way we compute number of sweeps. The number of sweeps shown is the 
rounded average number of sweeps across all the processors and therefore it can fluctuate 
±0.5 of a sweep. The convergence and its dependence from data distribution will be 
discussed in detail in Chapter 5. 
3.4. REDUCED INNER PRODUCT COMPUTATIONS 65 
3.4 Reduced inner product computations 
The other technique which decreases further the number of floating point operations is the 
reduced inner product6 computations. Although similar technique was already described 
in the paper by Romine and Sigmon [37], our idea was established independently. The 
algebra used to establi~h the column-vector norms is different and we do not use pairwise 
decoupling against which we will give an argument later. In the original algorithm, three 
inner products are computed before each rotation of pair of columns. The technique we 
introduce is based on the observation that only one from three inner products is required 
and others can be replaced by an simple update operation. 
To see how this technique works, let us assume we have two column-vectors in the 
Euclidean space A~~l) and A~~}). Let us also assume that three inner products a:ff), a;~) 
and a:J;) are known and they are related to the vectors as shown below: 
(k) (k) (k) 
a .. = A( .) · A( .) tt :,t :,t (k) (k) (k) a . . = A( .) · A( .) JJ : ,3 :,3 (k) (k) (k) a . . = A( .) · A( .) tJ :,i :,3 (3.8) 
After column-vector rotation by a known angle, the new inner products can be expressed 
as: 
0 (k+l) = A (k~l) . A (k+i) 33 (:,3) (:,3) 
The transformation from A(k) to A(k+l) is given in Equation 3.4. After the substitution, 
we get: 
(3.10) 
The a:(~) and a/~) can be computed once at the beginning of each sweep and stored in ii 33 
the array. After rotation of each pair of column-vectors, they can be updated according 
to the formulas above. The respective parts of the array where the updated alphas are 
stored can be sent to other processors, the same way as are the columns from each block. 
Theoretically the a};) can also be updated but after exchanging blocks of columns between 
processors, it has to be recalculated (always new set of columns in the block pair), which 
from point of view of time saving is not worthwhile. Also the value of the eJ;) goes to 
6 Inner product is also known as a dot product . 
66 CHAPTER 3. IMPROVEMENTS OF THE FLOATING POINT PERFORMANCE 
zero as the columns become more orthogonal. This can lead to the accumulation of errors 
in the updated a!J). 
In their paper, Romine and Sigmon introduced a new technique of pairwise decoupling 
which further can potentially reduce the time spent on the inner product computations. 
They observed that if the orthogonality of a column pair known to be orthogonal during 
a given sweep is not destroyed before being encountered during the next sweep, then the 
inner product the a!J) computation in the latter sweep is unnecessary. Therefore, they 
proposed to maintain an extra data structure to keep track on all those inner products 
which did not change after the rotation of two columns. Although this can give the 
potential savings for big matrices in the single processor environment - after analysis of 
costs involved in maintaining and communicating such a data structure - we abandon the 
idea of implementing it in the parallel environment. 
In the reduced inner product computations method also referred by us, as an a-update, 
per one rotation we achieved a reduction in floating point operations from 6m to 2m which 
is a significant result. Because we updated the inner products instead ofrecalculating them 
in every rotation, we may encounter an increase in the number of sweeps needed for the 
algorithm to converge. 
3.4.1 Numerical results 
In Table 3.3, we present the numerical results for the reduced inner product implemen-
tation. The results show a 10% to 20% improvement in performance for a small number 
of processors (large number of columns stored per processors). This improvement will 
degrade as the number of processors increases due to the statistical loss of load balance 
and high communication to computation ratio mentioned in the previous section. 
We observed that there is no increase of number of sweeps which indicates that the 
a-update technique does not accumulate errors to affect the convergence rate of the algo-
rithm. 
Still in the case of this improvement as well as in the case of scaled columns, the basic 
procedures for rotation and inner product computation ( a~;+l) computation) are not very 
efficiently implemented by the SPARC C compiler. The efficient implementation of these 
two procedures are presented in the next section. 
3.5. ASSEMBLY LANGUAGE OPTIMISATIONS 67 
Matrix Size ( m x n) 128xl28 256x256 512x512 2048x128 4096xl28 6144x128 
Time (s) 73.5 618 5143 643 1376 1994 
Impr. I (%) 15.2 18.7 19.1 21.4 20.0 20.7 P=l Impr. II (%) 21.1 25.2 26.1 26.6 24.7 25.5 
MFLOPS 1.3 1.3 1.3 1.4 1.2 1.3 Sweeps 12 13 14 10 10 10 
Time (s) 17.1 161 1359 183 374 534 
Impr. I(%.) 15.8 14.8 13.9 19.4 17.1 18.0 
P=4 Impr. II (%) 19.9 20.2 20.8 22.0 20.0 20.9 
MFLOPS 1.4 1.2 1.3 1.2 1.1 1.2 
Sweeps 11 12 13 10 9 9 
Time (s) 8.0 43.6 395 56.2 132 186 
Impr. I (%) 9.1 14.0 10.8 17.6 13.2 19.9 
P=l6 Impr. II (%) 4.6 15.6 14.8 14.6 9.0 9.4 
MFLOPS 0.9 1.2 1.2 1.0 0.8 0.9 
Sweeps 12 12 14 10 9 9 
Time (s) 11.1 25.2 125 35.5 60.8 97.0 
Impr. I (%) -16.8 4.5 15.5 6.6 7.5 8.5 
P=64 Impr. II (%) -19.1 -5 .0 6.6 -8.1 -7.3 -6.7 
MFLOPS 0.2 0.6 1.0 0.4 0.5 0.5 
Sweeps 11 12 13 10 9 10 
Table 3.3: Results for the reduced mner product implementation of the Hestenes 
algorithm. 
3.5 Assembly language optimisations 
In this section we present the improvements applied not to the algorithm but to the code 
itself, generated by the SPARC C compiler. It can be argued whether such improvements 
should be described herein since they rather belong to the class of the low level code de-
sign. We felt that because we write about the performance improvements of a particular 
numerical algorithm which is executed on a particular computer, we can also include the 
performance improvements on the level of the assembly language since they will indicate 
the performance of the APlOOO itself. Also the balanced multiplication to addition ra-
tio had to be implemented manually since the SPARC C compiler did not optimise the 
rotations procedures very well. 
3.5.1 Rotation procedures 
In this section we present two parts of the assembly language procedures which are used 
to rotate two columns. Below, both parts, left and right, represent only the code of the 
most inner loops from the rotation procedures. On the left hand side, we present once 
68 CHAPTER 3. IMPROVEMENTS OF THE FLOATING POINT PERFORMANCE 
again the original SPARC C translation of the scaled rotation and on the right hand side, 
our implementation of this procedure. The major drawbacks of the original SPARC C 
assembler were described in Section 3.3. The main idea in our implementation of the 
rotation procedure was to interleave the floating point instructions faddd and fmuld with 
load and store instructions so they can be performed in parallel by two independent add 
and multiply units. Also we group faddd and fmuld instructions in such a way that the 
result of each of them is not needed immediately and the SPARC pipeline can be used to 
its full potential. 
LY44: LOOP: 
ldd [%05] , i.£8 
ldd [i.oO] , i.£0 
fmuld i.fB,i.fO,i.fO 
fsubd i.£30, i.£0, Y.fO 
std i.£0, [i.o2] 
ldd [i.o1] , i.£6 
fmuld i.£30, i.£6, Y.£30 
inc 8,i.o2f 
faddd i.£30, i.fB, Y.£8 
std i.£8, [i.oS] 
inc B,i.oS 
cmp i.o5,i.o3 
blu,a LY44 
fmuld TA,B1,A1s 
ldd [a1+n] ,Ai 
subcc n,16,n 
faddd BO,BOs,BOs 
std A Os, [aOs+n] 
fmuld TB,A1,B1s 
ldd [bO+n] ,BO 
subd A1,A1s,A1s 
std BOs, [bOs+n] 
fmuld TA,BO,AOs 
ldd [aO+n],AO 
faddd B1,B1s,B1s
0 
std A1s, [a1s+n] 
fmuld TB,AO,BOs 
ldd [b1+n] ,B1 
fsubd AO,AOs,AOs 
bge LOOP 
do { 
A1s =TA* b[n+1]; 
BOs = b[n] + BOs; 
a[n] = AOs; 
Bis= TB* a[n+1]; 
A1s = a[n+1] - A1s; 
b[n] = BOs; 
AOs =TA* b[n]; 
Bis= b[n+1] + Bis; 
a[n+1] = A1s; 
BOs =TB* b[n]; 
AOs = a[n] - AOs; 
n -= 2; 
} while (n >= O); 
Both procedures were tested on the APlOOO for two vectors of length 1000 which 
were rotated 1000 times. The original scaled procedure had execution time of 2.7 s and 
delivered 1.5 MFLOPS. The assembly language procedure completed the same task in 0.9 
s achieving 4.2 MFLOPS. The impact of the new procedure on the algorithm is presented 
in the numerical results section. 
3.5.2 Inner product computation procedures 
Very similar arguments given in the previous section forced us to write our own inner 
product computation procedure called InnerCalc. On the opposite page we present the 
3.5. ASSEMBLY LANGUAGE OPTIMISATIONS 69 
Matrix Size ( m x n) 128x128 256x256 512x512 2048x128 4096x128 6144x128 
Time (s) 52.1 448 3590 449 969 1431 
Impr. I(%) 29.1 27.5 30.2 30.2 29.6 28.2 
P=l Impr. II (%) 44.1 45.8 48.4 48.8 46.9 46.5 
MFLOPS 2.6 2.7 2.8 3.4 3.1 3.1 
Sweeps 12 14 14 10 10 10 
Time (s) 12.0 116 944 137 273 398 
Impr. I(%) 29.8 28.0 30.5 25.1 27.0 25.5 
P=4 Impr. II(%) 43.7 42.7 45.0 41.7 41.7 41.0 
MFLOPS 2.9 2.5 2.7 2.8 2.6 2.6 
Sweeps 11 12 13 10 9 9 
Time (s) 6.6 32.3 294 45.9 107 157 
Impr. I(%) 17.5 25.9 25.6 18.3 18.9 15.6 
P=l6 Impr. II (%) 22.0 37.4 36.6 30.3 26.2 23.7 
MFLOPS 1.5 2.4 2.3 2.2 1.7 1.7 
Sweeps 12 12 14 10 9 9 
Time (s) 9.4 21.8 100 32.6 55.5 89.8 
Impr. I(%) 15.3 13.5 20.0 8.2 8.7 7.4 
P=64 Impr. II (%) -0.5 9 .5 25.4 1.0 2.0 1.3 
MFLOPS 0.3 1.0 1.8 0.8 0.8 0.9 
Sweeps 11 12 13 10 9 10 
Table 3.4: Results for scaled columns implementation after introduction of assembly lan-
guage modifications. 
comparison between the original SPARC C assembler and ours. The basic technique 
used in the presented routine was to group the floating point operations in order to use 
efficiently the SPARC processor pipelining. The test program revealed that in order 
to compute 1000 times the inner products of two vectors of length 1000, the original 
subroutine required 1.9 s achieving 3.2 MFLOPS and the assembly language procedure 
needed 1.1 s and achieved 5.5 MFLOPS which is the peak performance of the SPARC 
chip for doubles. 
L77326: LOOP: do { 
ldd [%i0] ,%£30 fmuld bn,an,ab ab = a[n] * b[n]; 
fmuld %f30,%f30,%f10 faddd al,aa,al alpha_ii += aa; 
ldd [%17],%£28 fmuld bn,bn,bb bb = b[n] * a[n]; 
faddd %f24,%f10,%f24 ldd [a+n] ,an 
fmuld %f30 , %f28,%f14 faddd ga,ab,ga alpha_ij += ab; 
inc 8,%17 fmuld an,an,aa aa = a[n] * a[n] 
cmp %17,%i1 ldd [b+n] ,bn n-=8; 
inc 8,%i0 faddd be,bb,be alpha_jj += bb; 
faddd %f26,%f14,%f26 bge LOOP } while (n >= O); 
fmuld %£28, %£28, %f 12 
blu L77326 
70 CHAPTER 3. IMPROVEMENTS OF THE FLOATING POINT PERFORMANCE 
(k+l) . . h We also wrote an assembly language procedure for the a ij computations m t e 
reduced inner product implementation. Using a similar test program as above, the unim-
proved version had execution time 1.1 s and floating point performance of 1.8 MFLOPS. 
The new assembly language version's running time was 0.8 sand the MFLOPS rating was 
2.6. 
3.5.3 Numerical results 
Let us see how the performance of the,. Hestenes algorithm implementation is affected by 
the changes in the rotation and inner product computation procedures. The results of 
the scaled columns implementation with new, more efficient rotation and inner product 
procedures are presented in Table 3.4. The peak performance improvement achieved is 
for Impr. I 26% and for the Impr. II 49%. 
The results for both scaled columns and reduced inner product implementations with 
improved assembly language routines are presented in Table 3.5. Certainly, the double 
reduction in execution time was achieved due to the more efficient assembly language 
routines. On the other hand, the balanced multiplication to addition ratio was the primary 
factor which enabled us to write those routines. The improvement in execution time is 
proportional to the number of columns stored in each processor in the case of the scaled 
columns implementation. It is also proportional to the length of the columns in the case of 
the reduced inner product implementation. Both trends are clearly visible in the discussed 
results. 
3.6 Conclusions 
In this chapter we presented the improvements in the floating point performance by using 
three general methods. Firstly, we reduced the number of floating point operations, sec-
ondly we balanced multiplication-addition ratio, and thirdly, we improved the execution 
code generated by the C compiler in order to use various properties of the SPARC chip 
efficiently. In the first case, the changes were applied to the algorithm and were purely 
numerical. In the second, the changes were applied to the execution code. 
In Table 3.3, we see that the maximum improvement (Impr. II) in the execution 
time for the numerical changes was 27% (both scaled columns and reduced inner product 
3.6. CONCLUSIONS 
Matrix Size ( m x n) 128x128 256x256 512x512 2048x128 4096x128 6144x128 
Time (s) 47.6 391 3235 400 861 1266 
lmpr. I(%) 8.6 12.7 9.9 10.9 11.1 11.5 
P=l lmpr. II (%) 48.9 52.6 53.5 54.4 52.8 52.7 
MFLOPS 2.0 2.1 2.2 2.3 2.1 2.1 
Sweeps 12 13 14 10 10 10 
Time (s) 11.1 105 859 122 246 357 
lmpr. I (%) 7.5 9.5 9.0 10.9 9.9 10.3 
P=4 lmpr. II (%) 48 .0 48.0 50.0 47.9 47.4 47.0 
MFLOPS 2.2 1.9 2.1 1.9 1.7 1.7 
Sweeps 11 12 13 10 9 9 
Time (s) 6.2 29 .8 258 41.1 97.1 142 
Impr. I (%) 6.1 7.7 12.2 10.5 9.3 9.6 
P=16 Impr. II (%) 26.1 42.3 44.5 37.5 32.9 30.7 
MFLOPS 1.1 1.8 1.7 1.4 1.1 1.1 
Sweeps 12 12 13 10 9 9 
Time (s) 10.9 20.9 93.1 30.3 52.0 82.2 
Impr. I (%) -16.0 4.1 6.9 7.1 6.3 8.5 
P=64 lmpr. II (%) -17.5 13.1 30.7 7.7 8.1 9.7 
MFLOPS 0.2 0.7 1.3 0.5 0.5 0.5 
Sweeps 11 12 13 10 9 10 
Table 3.5: Results for the reduced inner product implementation after introduction of 
assembly language modifications. 
71 
computations). The performance of the scaled columns method through its balanced 
multiplication to addition ratio relies heavily on the interpretation of this method by the 
compiler. If the compiler does not recognise and use properly the balanced multiply-add 
operations, the performance improvement will be negligible. 
Since in this thesis we describe the ways to speed up the execution of the Hestenes 
algorithm on all levels of the parallel memory hierarchy, it is appropriate to describe also 
improvements on the processor level. In Table 3.5, we see that both improvements in the 
execution code of the earlier mentioned procedures contribute about half of the total peak 
performance improvement (Impr. II) of 54%. The large part of this contribution is due 
to the efficient implementation of the scaled columns on the level of assembly language. 
The peak performance of all the optimisations presented is 3.4 MFLOPS and is 
achieved for the scaled columns with the modified assembly language rotation procedure. 
This result is not sustained after the introduction of reduced inner product implementa-
tion due to the smaller number of floating point operation (reduction by the factor of 3) 
and the only moderately efficient aiJ+l) assembly language procedure (2.6 MFLOPS). 
72 CHAPTER 3. IMPROVEMENTS OF THE FLOATING POINT PERFORMANCE 
Matrix Size ( m x n) 128x128 256x256 512x512 2048x128 4096x128 6144x128 
P=4 Efficiency T/or g. 1.09 1.02 1.01 0.94 0.97 0.99 
P=16 Efficiency T/org. 0.69 1.00 0.94 0.83 0.79 0.82 
P=64 Efficiency T/org . 0.16 0.54 0.81 0.42 0.50 0.46 
P=4 Efficiency T/impr. 1.08 0.92 0.94 0.82 0.88 0.89 
P=16 Efficiency T/impr. 0.49 0.82 0.79 0.61 0.56 0.56 
P=64 Efficiency T/impr. 0.07 0.30 0.54 0.21 0.26 0.25 
P=4 Efficiency T/accum. 2.08 1.95 2.02 1.77 1.83 1.84 
P=16 Efficiency T/accum. 0.94 1.72 1.69 1.32 1.17 1.16 
P=64 Efficiency T/accum . 0.13 0.62 1.17 0.45 0.55 0.51 
Table 3.6: Comparison of efficiencies for the original and improved Hestenes algorithms. 
In the top section of Table 3.6, we present the efficiences computed for the original 
algorithm without the improvements. The efficiences of the improved algorithm which 
includes all improvements described in this chapter are depicted in the middle of the 
table and the accumulated efficiencies at the bottom. The accumulated efficiency was 
computed by using the same equation (Equation 2.2) as for the other two, but with the 
difference that for r1 we used execution time for one processor from the original algorithm 
and for Tp we used the execution time from the improved algorithm. 
Although one can argue that it is not quite correct, by doing so we can get a good 
estimate of the global performance improvement represented by the efficiency factors. We 
observe that the real efficiency for the improved algorithm 1Ji mpr. is worse than for the 
original algorithm 1Jorg. by about 20%. The possible explanation is that the improved 
algorithm does not scale so well as the original one. On the other hand, the accumulated 
efficiency 1Jaccum. which includes execution time improvements shows a considerable gain. 
Chapter 4 
Optimisations for the Memory 
Hierarchy 
4.1 Introduction 
In this chapter, we would like to show new techniques of improving the performance of 
the Hestenes algorithm which are based on other principles than balancing or reducing 
the number of floating point operations. The basic philosophy behind these techniques is 
to maximise t he reuse of data on each ievel of the memory hierarchy without the need for 
reloding them from the lower (slower) memor¥ level. The practical realisation of this idea 
is done by altering the geometry of the data - their distribution among the processors -
and by reorganising the order of computations. There are three major methods presented 
in this chapter, each addressing one1 memory level: 
other processors level ( communication) In the rectangular configuration method, 
we alter the configuration of the processors from linear to rectangular and discuss 
the impact of this change on the performance. 
cache level In the partitioning method, we further divide blocks of columns to sub-
blocks, which can fit within the cache, and also reorganise the order of computations 
to minimise cache misses. 
register level In the four column method, we increase the number of columns that are 
rotated to four at a time in order to improve the data reuse at the register level. 
1 This should be understood as one but not exclusively one memory level. 
73 
11 
74 CHAPTER 4. OPTIMISATIONS FOR THE MEMORY HIERARCHY 
In this chapter, we want to intensify the computation by increasing the number of 
columns (column-segments) being held in each level of the memory hierarchy. 
4.1.1 General performance measures - reuse factors 
In the Hestenes algorithm, the number of floating point operations is proportional to w2 , 
where w is the number of columns stored at a particular level of the memory hierarchy. 
Thus, the central principle in all of the optimisations presented herein is to increase w. 
We expect that similar principles can be used to enhance other linear algebra algorithms 
implemented on the parallel (MIMD) architectures . 
Let us introduce a reuse factor 'R as a general measure of the algorithm performance 
at a particular memory hierarchy level. Generally, 'R is defined as the ratio of the number 
of floating point operations performed on a set of data, kept at a particular level of the 
memory hierarchy, to the number of 'load' operations required to put ( and maintain, in 
the case of the cache level) the data at that level. Let us define three reuse factors used 
in this chapter: 'Rcomm, 'Rcache, 'Rreg• 
Definition 4.1 'Rcomm is the number of floating point operations performed per unit data 
communicated. This reuse factor mirrors the computation to communication ratio. 
Definition 4.2 'Rcache is the ratio of the number of floating point operations performed 
on updating a set of columns of A (that can be all kept in cache} to the number of cache 
misses involved in that update. 
Definition 4.3 'Rreg is the ratio of the number of floating point operations performed on 
a given element of A to the number of loads to floating point register of that element. 
4.1.2 Measuring and comparing performance 
Discussion of methods used to measure the performance of the Hestenes algorithm in 
the APlOOO environment was given in Section 3.2 and its outcomes can also be applied 
here. However, there are a few new issues regarding the cache performance analysis and 
measurement of the communication time which we will discuss below. 
To measure cache performance as a percentage of cache hits, we use the Fujitsu's 
cache performance analyser. In the SVD program, we marked areas (from - to), where 
the analysis of cache performance will take place. The tool is invoked at the command 
' 
' 
' 
' I 
: 
I 
I , 
,,, 
' 
I 
' I 
I 
' ,, 
.M 
4.2. OPTIMISATIONS ON OTHER PROCESSORS MEMORY LEVEL 75 
line. There are four types of access to the cache that can be counted: (1) user and system, 
(2) user only, (3) user data only and ( 4) user instruction only. We used option (2) since 
it gave us the most reliable ( see the discussion below) and consistent cache performance 
measurements. Also the presented results of the cache hit ratio are averaged across all the 
cells. We need to make two remarks on the reliability of the cache performance analyser. 
Although one would think that we should measure the user data only accesses to the cache, 
we noted that the cache performance analyser measures these accesses unreliably and we 
had to switch to the user's data and instructions type of accesses. Also we observed that 
the vertical communication introduces sporadic inconsistencies of cache measurements. 
Although all the care was taken so that the cache performance was measured accurately, 
there are some results which could not be given due to their anomalies (unbelievably low 
cache hit ratios). Those results were denoted as X in the corresponding results table in 
Section 4.3.2. The explanation of this fact is unknown to us and the investigation of it is 
beyond the scope of this thesis. 
We also introduced procedure to measure the communication time. We used the same 
C function dgettime(} as for the execution time measurement in the previous chapter. Two 
communication times were measured, the horizontal, between processors in X direction, 
and the vertical, between processors in Y direction . Since both communications occur 
frequently, the procedures which are involved in the time measurement take considerable 
amounts of time themselves; therefore they distort the execution time measurements. In 
order to avoid the influence of the communication time measurement all the test runs were 
executed twice, firstly, to measure the communication time and secondly, to measure the 
other parameters with communication time measurements disabled. The communication 
time was measured in each processor and then was averaged over the number of processors 
involved in the computations. 
4.2 Optimisations on other processors memory level 
4.2.1 Rectangular cells configuration 
In order to increase the communication reuse factor 'Rcomm = 0( n/ P) for the linear 
configuration, we use a rectangular PyXPx configuration, where Py > 1, PyPx = P 
and Wcomm = n/ Px . Thus, each processor holds a m' X Wcomm submatrix of A, where 
---
r 
I 
11 
76 CHAPTER 4. OPTIMISATIONS FOR THE MEMORY HIERARCHY 
1, 2 3, 4 - 5, 6 7, 8 - 9,1011,12-13,1415,1€-17,1819,2( 
I I I I I 
1, 2 3, 4 - 5, 6 7, 8 ..-- 9,10 11 ,1.- 13,1415,1€-17,1819,2( 
I I I I I 
1, 2 3, 4 - 5, 6 7, 8 ..-- 9,10 11,1:-13,1415,1€-17,1819,2( 
I I I I I 
1, 2 3, 4 - 5, 6 7, 8 - 9,10 11,1:- 13,1415,1€--17,1819,2( 
'"--------- ---------J 
Px 
Figure 4.1: Distribution of columns for rectangular cells configuration. 
m' = m/ Py . In a basic compute-communicate step of the SVD algorithm, there are 
O(w;ommm') arithmetic operations, since each column-segment is rotated with Wcomm/2 
others. The amount of data reuse on the communication level is therefore proportional 
to the number of columns per cell Wcomm· By using the rectangular configuration, the 
data reuse is increased by a factor of Py compared with the linear configuration, since 
now Wcomm = n/Px = nPy/P and the reuse factor is 'R' = O(n/Px)· An example of the 
distribution of the columns for rectangular configuration is presented in Figure 4.1. Also, if 
the size of the submatrix of A per processor is sufficiently small (i.e. M'wcomm ~ Cs where 
Cs is the cache size expressed in 8 bytes units2), the cache reuse ratio 'R~a.che = 'R~omm• 
so the rectangular configuration increases the cache reuse by a factor of Py as well. 
Since in this method each column is divided into Py segments, in order to compute 
correctly the inner product for each pair of columns, we need to introduce the vertical 
communication. The at) and a;;) are computed once per sweep and each processor 
holds only the partial inner product for the stored segment of the ith and jth columns. 
The correct inner product is obtained by summing the partial inner product across the 
vertical processors and then distributing the result of this summation to each processor 
in the Y direction. This is accomplished by using the APlOOO's y _dsum function which 
2 We want to compare a number of double precision numbers to the case size . 
,, 
1, 
', 
I, 
i 
: 
1, 
I 
1: 
' 
II 
Ii 
[ I 
Ii 
4.2. OPTIMISATIONS ON OTHER PROCESSORS MEMORY LEVEL 77 
is costly to use 3 . In the case of computing a};) and a}~), usage of the y _dsum does 
not represent any disadvantages since these are computed only once per sweep. For ai;), 
which is computed every time we rotate a pair of columns, this may introduce a major 
performance degradation if the number of processors in Y direction is large. 
From the above facts , we expect that the best results for the rectangular configuration 
will be obtained for the t est runs where the total communication times for the X and Y 
directions are the smallest . 
Another fact which supports the use of the rectangular configuration is that the com-
munication network of the APlOOO is a toroidal mesh, and the rectangular configuration 
is mapped better onto the APlOOO network topology than the linear configuration. 
4.2.2 Numerical results 
The test runs were performed for the same matrix sizes as in the previous chapter i.e. 
128 X 128, 256 X 256, 512 x 512, 2048 X 128, 4096 x 128, and 6144 X 128. We also maintained 
the same number of processors , i.e. 1, 4, 16 and 64. 
Let us make a digression on the chosen matrix sizes. The ideas presented later in this 
chapter, partitioning and four column rotation, were developed in order to deal with the 
large number of columns stored per processor. Therefore they will perform best if we use 
them for the matrices with the large number of columns per processor. The argument for 
the sizes we chose is the consistency of the data used in this thesis and therefore, it is 
easy to examine the performance improvements for various implementations. 
We used the following rectangular (Y X X) divisions for the processors: 
4 processors: 1 X 4, 2 X 2, and 4 X 1. 
16 processors: 1 x 16, 2 X 8, 4 X 4, 8 X 2, and 16 X 1. 
64 processors: 1 x 64, 2 X 32, 4 X 16, 8 x 8, 16 X 4, 32 X 2, and 64 X 1. 
Firstly, we would like to know which configurations of the processors yield the shortest 
execution times. We established this by executing the program for all the configurations 
and then by comparing the execution times with the best results obtained in the previous 
chapter (scaling and reduced inner product with all the assembly language improvements) 
for the linear processor configuration (See Table 3.5.). The results obtained are presented 
3 For the performance of the y_dsum, see Figure 2.6. 
78 CHAPTER 4. OPTIMISATIONS FOR THE MEMORY HIERARCHY 
Matrix Size ( m x n) 128x128 256x256 512x512 2048xl28 4096x128 6144x128 
P = 1 X 4 0.0 0.0 0.0 0.0 0.0 0.0 
P = 2 X 2 -7.0 -5.3 0.7 16.8 13.3 13.0 
p = 4 X 1 -44.7 -21 .9 -11 .2 22.2 24.4 22.7 
P = 1 X 16 0.0 0.0 0.0 0.0 0.0 0.0 : 
P = 2 X 8 27.4 14.1 8.5 28.3 28 .7 25.4 i 
p =4 X 4 23.9 6.8 7.4 41.8 46.7 42.7 
i p = 8 X 2 -8 .6 -25.4 -9 .7 43.8 54.4 51.4 
p = 16 X 1 -109.0 -113.6 -51.6 33.2 52.8 52.2 I i 
p = 1 X 64 0.0 0.0 0.0 0.0 0.0 0.0 
P = 2 X 32 42.2 35.6 23.9 41.1 35.6 41.3 
P = 4 X 16 65.1 45.5 35.1 61.7 59.5 65.9 i 
P= 8 x 8 74.3 45.4 3.7 71.2 70 .6 73.5 : 
p = 16 X 4 65.4 8.7 -0 .3 72.6 75 .6 11.6 
p = 32 X 2 15.7 -100.8 -112.1 56.5 66.5 73.3 
p = 64 X 1 -90.1 -346.1 -363.3 25.5 49.3 62.4 
I 
11 Table 4.1: Relative performance improvements (Impr. I) for the rectangular configuration : 
against linear configuration. Values typed in bold style denote the best I 
performance improvement, the it alics values denot e the worst cases. .: 
Matrix Size ( m x n) 128x128 256x256 512x512 2048x128 4096x128 6144x128 i 
P = 1 X 4 2.1 0.0 11.5 0.0 57.0 0.0 18.9 0.0 33.0 0.0 48.7 0.0 
i 
p = 2 X 2 0.6 1.2 4.0 5.8 16.2 24.4 6.1 1.6 12.5 2.1 17.5 2.5 
' p = 4 X 1 0.0 3.2 0.0 14.6 0.0 68 .2 3.4 0.0 3.8 0.0 4.3 0.0 
[ p = 1 X 16 3.3 0.0 10.7 0.0 58.4 0.0 18. 7 0.0 37.2 0.0 56.6 0.0 
P = 2 X 8 1.4 0.3 5.2 1.5 28 .8 6.3 8.9 0.4 19.8 0.6 30.1 0.7 : 
P=4 x 4 0.7 0.8 2.5 3.5 14.0 15.1 3.9 0.9 9.0 1.0 13.5 1.2 ,, 
P= 8 x 2 0.2 1.9 0.7 8.5 4.0 36.0 1.0 1.8 2.8 2.2 3.8 2.3 
p = 16 X 1 0.0 4.8 0.0 21.3 0.0 97.1 0.0 4.5 0 .0 4.8 0.0 4.8 
P = 1 X 64 9.8 0.0 13.6 0.0 48.5 0.0 20.6 0.0 34.3 0.0 55.0 0.0 
I 
p = 2 X 32 4.7 0.4 6.9 0.6 26.5 3.5 10.2 0.3 19.1 0.5 27.7 0.5 
P = 4 X 16 2.1 0.6 3.3 1.6 12.2 7.7 5.0 0.5 9.1 0.6 11.9 0.6 
P = 8 x 8 0.9 0.6 1.8 2.2 8.3 10.2 2.4 0.6 4.4 0.6 6.3 0.6 
p = 16 X 4 0.4 1.3 0.8 4.8 3.1 21.8 1.1 1.1 2.0 1.2 2.6 1.1 
P = 32 X 2 0.1 6.0 0.3 25.6 1.0 120.7 0.3 5.7 0.5 5.8 0.8 5.8 
P = 64 X 1 0.0 14.8 0.0 63.6 0.0 274.3 0.0 12.7 0.0 12.8 0.0 12.9 
1, Table 4.2: Communication times in seconds for X and Y directions for the rectangular 
configuration. Values typed in bold style denote the shortest times, the italics 
values denote the longest total communication times. 
------
4.2. OPTIMISATIONS ON OTHER PROCESSORS MEMORY LEVEL 79 
Matrix Size ( m x n) 128x128 256x256 512x512 2048x128 4096x128 6144x128 
Conf. 1 X 1 1 X 1 1 X 1 1 X 1 1 X 1 1 X 1 
1: Time (s) 47.6 391 3235 400 861 1266 
P=l 
lmpr. I(%) 0.0 0.0 0.0 0.0 0.0 0.0 
lmpr. II (%) 48.9 52 .6 53.5 54.4 52.8 52.7 
11 MFLOPS 2.0 2.1 2.2 2.3 2.1 2.1 
1, Sweeps 12 13 14 10 10 10 
I Conf. l x 4 l x4 1, 2x2 4xl 4xl 4xl 
1: Time (s) 11.1 105 853 94.9 186 276 
I! P=4 lmpr. I(%) 0.0 0.0 0.7 22.2 24.4 22.7 
: Impr. II (%) 48.0 48.0 50.3 59.5 60.3 59.1 I 
MFLOPS 2.2 1.9 2.1 2.4 2.4 2.4 
I Sweeps 11 12 13 10 10 10 
. Conf. 2 x 8 2 x 8 2x8 8x2 8x2 16 X 1 
Time (s) 4.5 25.6 236 23.1 44.3 67.9 
P=16 
lmpr. I (%) 27.4 14.1 8.5 43.8 54.4 52.2 
lmpr. II (%) 46.7 50.3 49.2 64.9 69.3 66.9 
MFLOPS 1.4 2.1 1.9 2.4 2.5 2.5 I 
Sweeps 11 12 13 10 10 10 
Conf. 8 x 8 4 X 16 4 X 16 16 X 4 16 X 4 16 X 4 
! Time (s) 2.8 11.4 60.4 8.3 12.7 18.4 
i lmpr. I(%) 74.3 45.5 35.1 72.6 75.6 77.6 
P=64 lmpr. II (%) 69.8 52.5 55.0 74.7 77.6 79.7 
MFLOPS 0.6 1.3 1.9 1.7 2.1 2.2 
Sweeps 11 13 13 10 9 10 
' 
Table 4.3: The best results for the rectangular configuration. 
11 
I' in Table 4.1. Before drawing any conclusions, let us present results for the communication 
1, 
time measurements in the X and Y directions. Table 4.2 shows the communication times 1, 
in seconds for the X direction - first column and in the Y direction - second column. The 
shortest total communication times are typed in bold style. 
I Theoretically, the communication reuse factor 'Rcomm should be the best for the case 
where the number of columns stored per processor is the biggest and there is no horizontal 
communication at all. This is the case where all the columns are held in one processor 
in the X direction and they are evenly distributed as segments in the Y direction. So, 
theoretically, the best results should be obtained for P x 1 configuration providing that 
there is no cost involved in the vertical communication. In practice, from Table 4.2, 
we see that the vertical communication can not be ignored even if there is a need to 
communicate only Py numbers (partial inner products for aJJ)). As the number of the 
vertical processors grows, the vertical communication becomes very inefficient. It exceeds 
I 
rr 
11 
80 CHAPTER 4. OPTIMISATIONS FOR THE MEMORY HIERARCHY 
the horizontal communication for the P x 1 case already for four processors. For example, 
for the worst case of the matrix 512 by 512 and 64 processors in the Y direction to 
sum and distribute 64 partial inner products takes 274.3 s and to communicate in the X 
direction 4096 co-ordinates of the columns held in each processor takes only 48.5. The 
only conclusion we can draw from this is that the y_dsum function, if used often, degrades 
the overall algorithm performance. 
The minimum of the total communication times for the square matrices occurs for the 
square processor configurations. For the rectangular matrices, where the amount of data 
communicated in the X direction is large, the minimum of the total communication t imes 
is shifted towards more vertical configurations, reaching PX 1 configuration for the small 
number of processors. 
In Table 4.3, we present the best results obtained for the rectangular configuration. 
From this table we see that the best results for the rectangular matrices follow, in general, 
the cases of the shortest communication times. Generally, for the sufficiently large number 
of columns stored per processor and sufficiently large amount of rectangular configura-
tions to choose from, the algorithm will perform best at, or near square configuration of 
processors. 
The biggest improvement of 80% was achieved for the case of 128 by 6144 matrix A 
distributed over 64 processors in the configuration 4 x 16. This means that the execution 
time is 5.3 times shorter than for the basic unimproved block implementation. On aver-
age, the performance improvement achieved is around 60% which corresponds to 2.5 fold 
reduction in execution time. 
In the next section, we will present how we can further improve the performance of 
the algorithm for the cases where large number of columns is stored per processor where 
'large' means that the amount of data per cell exceeds the APlOOO cache size of 128 KB. 
4.3 Optimisations on cache level 
4.3.1 Partitioning 
In cases where larger blocks (wcomm > CsPy/m'), where Cs is the cache size expressed in 8 
bytes units, have to be accommodated in the cells, the rectangular processor configuration 
technique must be used together with the partitioning methods [44]. The basic principle of 
I 
i 
' 
i 
I 
i 
I 
I 
I 
I: 
1, 
: 
i 
' 
I 
I 
-~ -
4.3. OPTIMISATIONS ON CACHE LEVEL 81 
for i = 1 : Wcomm 
for j = 1 : Wcomm 
rotate (A( ,,i), A(,,j)) 
end 
end 
fork=l: ~ 
w 
for i = 1 : Wcomm 
for j = ( k - 1 )w + 1 : kw 
rotate (A( ,,i), A(,,j)) 
end 
end 
end 
Figure 4.2: Main loop for the rectangular configuration without (left) and with (right) 
partitioning. 
partitioning is to: (1) further divide the blocks of matrix A into sub-blocks that fit within 
the cache and (2) reorganise the order of computation to minimise the cache misses. 
Realisation of point (1) is as follows. The right block AR is partitioned to smaller sub-
blocks of w = CsPy/M columns (here we can identify w with Wcache) and we rotate each 
column from AL with thew columns from each successive sub-block in AR· In Figure 4.2, 
on the left hand side, we show a piece of code that is used in the rectangular config-
uration without partitioning; on the right, the same part for the partitioning method. 
Also to successfully implement partitioning, we need to reorganise the way we rotate the 
columns of matrix A and V. In the rectangular processor configurations and the previous 
implementations, the rotation of two columns of matrix V ( orthonormal matrix where the 
plane rotations are accumulated) is performed straight after the rotation of two columns 
of matrix A. This causes the cache misses since the columns of A which are already in 
cache need to be replaced with the columns of V. In order to prevent it, we accumulate 
and store all the parameters that are needed for the rotation of columns of matrix V, i.e. 
the tangents and cosines of the rotation angles and the flag which tells if the particular 
two columns need to be rotated, and perform these rotations after all of the rotations 
from one sub-block are completed. 
Let us compare the cache reuse factor when partitioning is used, R~ache> and when it is 
not, 'Rcache· Let us assume that the number of columns in both blocks is Wcomm > C&/M, 
and let the kth sub-block of AR be denoted AR,k· The number of floating point operations 
performed on AR,k is WcommwM' (since we rotate Wcomm columns from block AL with w 
columns in AR,k). The number of cache misses occurring is 4M'wcomm + wM'. The 
first term is due to the loading and reloading of each column in AL into cache and its 
I 
,, 
I 
82 CHAPTER 4. OPTIMISATIONS FOR THE MEMORY HIERARCHY 
Matrix Size ( m x n) 128x128 256x256 512x512 2048x128 4096x128 6144x128 
Configuration 1 X 1 1 X 1 1 X 1 1 X 1 1 X 1 1 X 1 
P=l Cache rect.(%) 92.08 90.34 91.72 97.43 X X 
Cache part.(%) 99.78 94.61 93.59 98.90 X X 
Configuration lx4 lx4 2 x 2 4 x l 4 X 1 4 X 1 
P=4 Cache rect.(%) 97.74 90.34 93.81 94.79 93.41 93.27 
Cache part.(%) 97.87 96.96 97.10 99.66 98.10 98.76 
Configuration 2x8 2x8 2 x 8 8 x2 8x2 16 X 1 
P=16 Cache rect.(%) 95.37 96.67 93.95 99.93 98.29 X 
Cache part.(%) 95.09 97.08 98.84 99.68 96 .64 X 
Configuration 8x8 4 X 16 4 X 16 16 X 4 16 X 4 16 X 4 
P=64 Cache rect.(%) 90.61 93.68 98.13 94.87 99.94 99.00 
Cache part.(%) 90.65 93.48 98.82 94.93 99.15 99.02 
Table 4.4: Cache performance comparison between the rectangular configuration with and 
without partitioning. 
subsequent direct-mapped cache conflict with one column in AR,k, which can occur at 
most twice; and the second is due to the loading AR,k into cache. Hence, the cache reuse 
factor for partitioning will be R~ache ~ w/4. 
In the case when partitioning is not used (this is equivalent to the case w = 1), the 
number of cache misses is WcommM' + M' + (M'wcomm/Cs)M'. The first term is due to 
the loading of each column of AL into cache; the second is due to the loading of AR,k 
into cache; the third is due to M'wcomm/Cs columns in AL having a direct-mapped cache 
conflict with AR,k)· The cache reuse factor in this case is Rcache :S Wcomm/(2+wcomm) ~ 1 
which is w/4 times less than that for the partitioning method. 
4.3.2 Numerical results 
Let us compare the cache hit ratios for the rectangular configuration and the rectangular 
configuration where the partitioning method was used . The results of the cache hit ratios 
measurement are depicted in Table 4.4. We see that for the partitioning method, gener-
ally, the cache hit ratio is better or equal to the cache hit ratio measured for the plain 
rectangular configuration. Also, the results confirm that the cache reuse is proportional 
not to the data size but to the number of columns stored per processor. The relative gains 
of cache hit ratios for matrices with long columns are much smaller than those for the 
square matrices. The smaller the number of columns per processor is, the less significant 
the gains from the partitioning method become and they disappear for the cases where 
P=64. 
I 
; 
' 
I 
I 
' 
--~ 
4.3. OPTIMISATIONS ON CACHE LEVEL 83 
Matrix Size ( m x n) 128x128 256x256 512x512 2048x128 4096x128 6144x128 
,, Conf. 1 X 1 1 X 1 1 X 1 1 X 1 1 X l 1 X 1 
Time (s) 41.5 333 2754 375 830 1243 
P=l 
Impr. I (%) 12.8 14.8 14.9 6.3 3.6 1.8 
Impr. II (%) 55 .5 59.7 60.4 57.2 54.5 53.6 
Ii MFLOPS 2.3 2.5 2.6 2.5 2.1 2.1 
11 Sweeps 12 13 14 10 10 10 
Conf. 1 X 4 l x4 2x2 4xl 4xl 4xl 
Time (s) 11.1 81.4 669 87.3 172 257 
P=4 
Impr. I(%) 0.0 22.5 21.6 8.0 7.5 6.9 
Impr. II (%) 48.0 59.7 61.0 62.7 63.2 61.9 
MFLOPS 2.1 2.3 2.6 2.6 2.6 2.6 
Sweeps 11 12 13 10 10 10 
Conf. 2x8 2 x 8 2x8 8x2 8x2 16 X 1 
Time (s) 4.5 25.7 180 23 .3 44.2 64.1 
Impr. I(%) 0.0 -0.4 23.7 -0.9 0.2 5.6 P=16 Impr. II(%) 46.7 50.2 61.2 64.7 69.4 68.8 
I: MFLOPS 1.4 2.1 2.5 2.4 2.5 2.6 
Sweeps 11 12 13 10 10 10 
; Conf. 8 x 8 4 X 16 4 X 16 16 X 4 16 X 4 16 X 4 
' Time (s) 2.8 11.4 60.5 8.3 12.7 18.5 
i Impr. I (%) 0.0 0.0 -0.2 0.0 0.0 -0.5 P=64 Impr. II (%) 69.8 52.5 54.9 74.7 77.6 79.6 
MFLOPS 0.6 1.3 1.9 1.7 2.1 2.2 
Sweeps 11 13 13 10 9 10 
Table 4.5: The best results for the rectangular configuration with partitioning. 
In the Table 4.5, we present the overall results for the rectangular configuration used 
with partitioning. First of all, from the results obtained we see how important the good 
Ii design of the implementation is on the cache memory level. The significant reduction in 
the execution times of 10 - 20 % (Impr. I) is achieved for a relatively small improvement of 
the cache performance of 2 - 5 % 4. Although one may argue that the design of the parallel 
implementation on this level of memory hierarchy belongs rather to the technicalities of 
good code writing, often the data fl.ow forced in a general, parallel algorithm design is not 
11 
optimal from the cache performance point of view. Partitioning can be permanently used 
since we did not observe any significant performance degradations for the small number 
of columns per processor. Even for the cases where no sub-blocks are created ( data size 
per cell equal cache size or smaller), the separation of the rotation of columns of matrices 
A and Vis beneficial and the performance of the algorithm is improved. These cases are 
4 This is because the cost of loads from the cache is roughly three times less than the the cost of loads 
from the memory. 
: 
' 
1, 
84 CHAPTER 4. OPTIMISATIONS FOR THE MEMORY HIERARCHY 
on and below the diagonal in Table 4.5. 
The number of sweeps also was not affected and the convergence rate remained the 
same as for the plain rectangular processors configuration. That is of no surprise since 
partitioning preserves the order in which the columns are rotated. 
4.4 Optimisations on register level 
4.4.1 Four-column rotation 
The last two optimisations affect the cache and communication levels of the parallel mem-
ory hierarchy. At the register level, the two-column rotation, which is the most intensive 
floating-point operation, causes poor data reuse and in effect, degrades the overall algo-
rithm performance. Data reuse at the register level can be improved by a factor of 3, 
for the rotation angle calculation, and by a factor of 2, for the actual column rotation by 
rotating four columns together. 
Let us explain how the four-column rotation works. Instead of choosing one column 
from the left block and one from the right, we pick the two consecutive columns from the 
left and two from the right blocks. Thus, instead of rotating A(:,i) with A(:,j), we rotate 
in order the following columns: A(:,i) with A(:,j), A(:,i+l) with A(:,j+i), A(:,i) with A(:,j+l) 
and A(: ,i+l) with A(:,j)· 
At the beginning of a sweep, we compute the column inner products A explicitly but 
only for the following rotations: 
(4.1) 
where now O ~ k ~ 5N(~-1). Later, they are implicitly computed as for the two-column 
method. Then we compute the inner products between differing columns5 : 
5 Note that now .,,(,'•) = ~~le) lt,J - 11,4,, . 
11 
: 
l 
' 
I 
' 
' 
l 
I 
I 
' 
i 
I 
I 
1, 
' 
4.4. OPTIMISATIONS ON REGISTER LEVEL 85 
I 
A(k~ ·A(k)_ A(k~ ,A(k)_ ' I (k) (k) 
' 
'Yi,j 'Yi+l,j+l (:,i) (:,3) (:,i+l) (:,3+1) 
r= (k) 
'Yi,j+l 
(k) 
fi+l,j 
A (k~ ·A (k)_ (:,i) (:,J+l) A(k~ ,A(k)_ (:,i+l) (:,3) (4.2) 
(k) 
'Yi,i+l 
(k) 
'Yj,j+l 
A(k~ ,A(k)_ (:,i) (:,i+l) A (k)_ ·A (k)_ (:,3) (:,3+1) 
The additional dot products ,tt1 and ,J1+i are now required to guarantee the correct 
angle computation for the last two rotations A(: ,i) with A(:,j+l) and A(:,i+l) with A(:,j)· 
We have to take into account the fact that by this stage, each of the four columns took 
part in the rotation once, and this rotation requires ,tt and ,J1+i). 
To perform four rotations , we need to compute the rotation angles e~~:)' e~~L+l, 
e~~Jli and e~~L using for each pair Equation 1.58. The rotation angles are primed to 
signify that they differ from those obtained for the two-column rotation, due to the altered 
order of rotations introduced by the four-column method. 
The four-column rotation is expressed in the following equation: 
(4.3) 
where: 
I 
A'(k) ' ( :,i) 
A'(k) 
A'= 
(:,i+l) 
A'(k) 
( :,3) 
(4.4) 
A'(k~ (:,3+1) 
I 
' 1 
-Ti ,j+1Tj+1,i+1 -T·. T i,j+l i,3 
-Tj,iTi+l,i 1 -T.:+1,i - T i+l,i+l 
Q(k) -
4 -
-Ti+1,i+1Ti,i +1 T ·· Tj ,i+l 1 3,i 
(4.5) 
Tj+l,i T j+1,i+1 -T,:,jTj+l ,i 1 
and 
- = 
I 
... 
Ii 
86 CHAPTER 4. OPTIMISATIONS FOR THE MEMORY HIERARCHY 
Inner Product Comp. Four Column Rotation 
8/8 16/16 
'Rreg for Q4 12/4 16/8 
Table 4.6: Register reuse in Q2 and Q4 schemes 
4.4.2 Register reuse 
The main gain in performance in the four-column rotation, despite the introduction of 
more floating point operations ( computing the additional 'Ytt1 and 'YJ1+i), is better 
register reuse. Let us define the register reuse factor 'Rreg as a ratio of the number of 
floating point operations performed on a given element of A to the number of loads to 
floating point register of that element. 
There are two regions of intensive floating point operations in the SVD procedure. The 
first is where we compute -y(k) (cf. Equations 1.57 and 4.2), and the second is the rotation 
of the columns ( cf. Equations 1.55 and 4.3). We compute the register reuse factor for the 
case when we (1) have four columns to rotate in the classic two column rotation way, and 
(2) rotate all using four column rotation scheme. In Table 4.6, we show the register reuse 
factors for both schemes. 
The benefit of using four-column rotation is due to the fact that more data is held on 
the register level hierarchy and therefore, register management can be improved. All of 
the variables needed to rotate four columns can be stored in all 16 SPARC FPU registers 
and no additional register load and store operations are needed. 
Although four-column rotation increases the register reuse, it also introduces extra 
computations, the additional 'Ytt1 and 'YJ1+i · This fact may have degrading effect on 
performance in the cases where: (1) long columns are used (m >> n) or (2) large number 
of processors in Y direction is used ( extensive use of y _dsum function). We expect that 
the four column rotation will perform well until the times of computation these additional 
gammas will be smaller than the improvements gained from better register reuse. 
4.4.3 Effect on the convergence 
However, the introduction of scaled columns and four column rotation affects the nu-
merical properties of the algorithm. One might expect that the accumulated cosines 
' 
' 
I 
4.4. OPTIMISATIONS ON REGISTER LEVEL 87 
I : 
Matrix Size ( m x n) 128x128 256x256 512x512 2048xl28 4096x128 6144x128 
Conf. 1 X 1 1 X 1 1 X 1 1 X 1 1 X 1 1 X 1 I Time (s) 45.2 357 2948 404 884 1287 
I: Impr. I (%) 5.0 8.7 8.9 -1.0 -2.7 -1.7 
' P=l 
' 
Impr. II (%) 51.5 56.8 57.6 53.9 51.6 51.9 
MFLOPS 2.2 2.3 2.3 2.3 2.0 2.0 
Sweeps 12 13 15 lO 10 10 
1, 
Conf. l x 4 l x 4 2 x 2 4 x l 4xl 4xl 
Time (s) 11.1 90.4 703 95.6 189 277 
1, Impr. I (%) 0.0 13.9 17.6 -0.7 -1.6 -0.4 1, P=4 
Impr. II(%) 48.0 55.2 59 .1 59.2 59.5 58.9 
MFLOPS 2.1 2.0 2.2 2.4 2.4 2.4 
Sweeps 11 12 14 10 10 10 
Conf. 2 x 8 2 x 8 2 x 8 8 x 2 8x2 16 X 1 
Time (s) 4.6 25.7 224 24.7 45.7 69.1 
Impr. I (%) -2.2 
-0.4 5.1 -6.9 -3.2 -1.8 P=16 I Impr. II(%) 45.4 50.1 51.7 62.5 68.5 66.4 
MFLOPS 1.3 1.9 2.0 2.2 2.3 2.4 
' 
Sweeps 11 12 13 10 9 10 
Conf. 8 x 8 4 X 16 4 X 16 16 X 4 16 X 4 16 X 4 
Time (s) 3.0 11.7 63.8 9.0 14.4 20.1 
,,, Impr. I(%) -7.1 -2.6 -5.6 
-8.4 -13.4 -9.2 P=64 
Impr. II (%) 67.9 51.4 52.5 72.7 74.5 77.9 
MFLOPS 0.5 1.1 1.8 1.5 1.9 2.0 
Sweeps 11 12 13 10 10 10 
Table 4.7: The best results for the four-column implementation without partitioning. 
represented by the scaling facto rs Cf and Cj and the complex calculations of the rota-
tions angles in the four-column rotation method, for those columns which already took 
part in the implicit rotation , could potentially introduce the round-off errors, for example. 
Also in the four-column implementation, we depart from the Brent-Luk ordering, 
which is only preserved for block movement , and therefore we may expect some degrading 
effect on the convergence. 
4.4.4 Numerical results 
In this section, we present the results for the four-column implementation without parti-
tioning in Table 4. 7 and with partitioning in Table 4.8. Before we analyse the results, let 
us describe our findings. 
In our first implementation of the four-column rotation method, we rotated four 
columns together if all of them required rotation and we used two-column rotation only 
! 
: 
i 
88 CHAPTER 4. OPTIMISATIONS FOR THE MEMORY HIERARCHY 
I 
Matrix Size ( m x n) 128x128 256x256 512x512 2048x128 4096x128 6144x128 
I 
Conf. 1 X 1 1 X 1 1 X 1 1 X 1 1 X 1 1 X 1 I 
Time (s) 38.2 298 2427 391 877 1272 I 
Impr. I(%) 8.0 10.5 11.9 -4.3 -5 .7 -2.3 i 
P=l I lmpr. II (%) 59.1 64.0 65.1 55.4 51.9 52.5 I 
MFLOPS 2.4 2.6 2.8 2.3 2.0 2.0 I 
Sweeps 12 13 15 10 10 10 
Conf. lx4 lx4 2 x 2 4 x l 4 X 1 4 X 1 I 
Time (s) 11.0 75.3 680 90.9 181 268 
lmpr. I(%) 0.9 7 .5 -1.6 -4.1 -5.2 -4.3 
P= 4 lmpr. II (%) 48.4 62.7 60.3 61.2 61.3 60.3 
MFLOPS 2.1 2.4 2.5 2.5 2.4 2.4 
Sweeps 11 12 14 10 10 10 
Conf. 2x8 2 x 8 2 x 8 8 x 2 8 x 2 16 X 1 
Time (s) 4.6 25.8 181 24.8 44.8 66.8 
lmpr. I(%) -2.2 -0.4 -0.6 -6.4 -1.4 -4.2 
P= 16 lmpr. II (%) 45.4 50.0 61.1 62.4 69.0 67.5 
MFLOPS 1.2 1.9 2.4 2.2 2.3 2.4 
I Sweeps 11 12 13 10 9 10 
Conf. 8x8 4 X 16 4 X 16 16 X 4 16 X 4 16 X 4 
Time (s) 3.0 11.8 64.3 9.1 14.5 20.2 
Impr. I(%) -7.1 -3.5 -6.3 -9.6 -14.2 -9.2 
P=64 lmpr. II(%) 67.4 51.0 52.1 72.5 74.4 77.8 
MFLOPS 0.5 1.1 1.8 1.5 1.9 2.0 
Sweeps 11 12 13 10 10 10 
Table 4.8: The best results for the four-column implementation with partitioning. 
I' if one of the pa.ir of the columns did not need rot ation. This scheme a.ppea.red to us a.s 
the na.tura.l wa.y of implementing four column rotation . 
I During the test runs, we observed tha.t the four-column rotation wa.s faster in full 
sweeps (sweeps where a.ll of the columns need to be rotated) tha.n two columns implemen-
ta.tion (recta.ngula.r implementation), but wa.s negligibly faster or even slower a.t the later I 
sweeps. The reason for this was tha.t although there is a.n a.dva.nta.ge of using four-column 
rotation if a.ll four columns require rotation, the time needed for extra. gammas computa.-
tion prevailed in the cases where some of the four columns did not require rotation. 
The simple fix for this problem was to use switching, i.e. during the full sweeps use 
the four-column rotation, a.nd benefit from the better register reuse a.nd in the later sweep 
!l 
switch to two-column rotation (without the extra. ga.mma.s computations). The switching [I I 
proved tha.t the four-column rotation works but only in the cases for which in reality it I 
I 
was designed for. I I 
I 
I 
I 
' 
I 
I 
' 
,, 
,, 
,, 
i 
I 
i 
I 
I: 
4.5. CONCLUSIONS 89 
From Table 4.7, we see that the four-column rotation method works well for one 
processor and for the cases where the communication in the Y direction is small and in 
practice the number of processors in the Y direction does not exceed two. 
There are two areas where the four-column method clearly fails. These are for the long 
columns where m > > n and for all the cases where there is the Y communication involved 
over a large number of processors. In the first case, the time is consumed by the extra 
two gammas to compute and in the second, by the inefficiency of y_dsum. We need to 
mention that although the procedure which rotates four columns and that which computes 
six gammas are not yet fully optimised and they achieve approximately 3.1 MFLOPS. The 
rough estimate for the performance of those procedures is about 4 MFLOPS. 
The partioning (Table 4.8) improves the results but obviously can not change the 
above noted problems. Overall, we see that the four-column rotation method can be 
applied after some consideration is made on the size of the matrix and the processors 
configuration that this matrix is distributed over. 
For the cases where the four-column rotation worked well, the implementation with 
partitioning was 2-10 % better than for the corresponding implementation for the rectan-
gular configuration also used with partitioning. 
Generally, the convergence ratio was not affected except m three cases where the 
number of sweeps was increased by one. This convinced us that the Hestenes algorithm 
is very stable and presents a very good choice for the parallel implementation. 
4.5 Conclusions 
In this chapter, we presented three methods that improve the performance of the Hestenes 
SVD algorithm. All of these methods address the performance on one or more parallel 
memory hierarchy levels. 
The first method - rectangular processor configuration - utilises the fact that the 
amount of data reuse on the communication level is proportional to the number of columns 
stored per cell. Also the rectangular configuration fits better to the physical mesh like 
topology of the APlOOO communication network. 
The second method - partitioning - is used in conjunction with the rectangular method. 
This method improves the cache reuse by rearranging the way the columns are rotated 
and by working on the smaller sub-sets of data. 
I 
I 
I 
I 
... 
I 
90 CHAPTER 4. OPTIMISATIONS FOR THE MEMORY HIERARCHY 
Matrix Size ( m x n) 128x128 256x256 512x512 2048x128 4096x128 6144x128 
Method F-P F-P F-P R-P R-P R-P 
Conf. 1 X 1 1 X 1 1 X 1 1 X 1 1 X 1 1 X 1 
P=l Time I (s) 93.2 827 6958 877 1826 2677 
Time II (s) 38.2 298 2507 375 830 1243 
Impr. (%) 59.1 64.0 64.1 57.2 54.5 53.6 
Method F-P F-P R-P R-P R-P R-P 
Conf. l x 4 l x 4 2 x 2 4 x l 4 x l 4 X 1 
P=4 Time I (s) 21.3 201.9 1716 234.3 468 674 
Time II (s) 11.0 75.3 669 87.3 172 257 
Impr. (%) 48.4 62 .7 61.0 62.7 63.2 61.9 
Method R R R-P R R-P R-P 
Conf. 2x8 2 x 8 2 x 8 8 x 2 8 x 2 16 X 1 
P=16 Time I (s) 8.4 51.6 464 65.9 144.6 205.4 
Time II (s) 4.5 25.6 180 23.1 44.2 64.1 
Impr. (%) 46.7 50.3 61.2 64.9 69.4 68.8 
Method R R R R R R 
Conf. 8 x 8 4 X 16 4 X 16 16 X 4 16 X 4 16 X 4 
P=64 Time I (s) 9.3 24.0 134.3 32.9 56.6 91.0 
Time II (s) 2.8 11.4 60.4 8.3 12.7 18.4 
Impr. (%) 69.8 52.5 55.0 74.7 77.6 79.7 
Table 4.9: The overall best results for Hestenes SVD algorithm implementation. 
(R- rectangular configuration implementation, R-P - rectangular configuration 
implementation with partitioning, and F-P - four column implementation with 
partitioning.) 
The third method - four-column rotation - improves the performance by a better reuse 
of the registers. It can be used efficiently in the case where all of the four columns require 
rotation and, despite of the extra floating operations introduced, is faster than the two-
column rotation method. Unfortunately the performance degrades in the cases of later 
sweeps from the reasons described earlier. 
Since in this chapter we finalise all the performance improvements presented in this 
thesis, we present the overall, best results obtained with the indication of which method 
was used and the overall improvement achieved. All these results are depicted in Table 4.9. 
Time I denotes the execution time for the basic unimproved block implementation. Time 
II is the shortest of all the methods implemented. 
From the table, we see that the best results are divided into two groups . The first is 
for the square matrices distributed over a small number of processors in the Y direction 
(no more than two) where the four-column implementation with partitioning was used. 
The second is for all the other cases where mostly the rectangular implementation with 
I 
I: 
1: 
I' 
I: 
I! 
' 
i 
i 
4.5. CONCLUSIONS 91 
partitioning was used. The last row of the results, for P = 64, could also be filled by the 
rectangular method with partitioning. The differences between the results for these cases 
were negligible. 
The average improvement ratio is: for the case of P = l, Impr. = 58.75%, for the 
case of P = 4, Impr. = 59.98%, for P = 16, Impr. = 60.22% and for the case of P = 64, 
Impr. = 68.22%. The overall average improvement was 61.79% which corresponds to the 
average reduction in execution time by 2.6 times. In the extreme case, for P = 64 and 
matrix 128 x 6144, the reduction of execution time achieved was nearly 5 times. 
All of the methods presented did not adversely affect the convergence ratio. Although 
the number of sweeps was increased or decreased by one, this did not have measurable im-
pact on the performance. In the next chapter, we will address the issue of the convergence 
in a greater detail. 
I 
I 
I 
I 
I 
Chapter 5 
Convergence 
The major subject of this thesis is to show the steps that lead to the faster implementation 
of the Hestenes SVD algorithm. Although these steps were already shown in previous 
chapters, they do not represent the whole picture. As stated in Chapter 3, apart from the 
speed, we require from any numerical algorithm accuracy. In this chapter we address the 
issues of the accuracy and convergence of the SVD algorithm we implemented. 
We do not aim for a general and theoretical error analysis as given, for example, 
in the classic book by J .H. Wilkinson [ 4 7]. Also it is beyond our scope to analyse the 
order in which the columns are rotated in each of our implementations and then give the 
theoretically expected convergence rate as it is done, for example, for the cyclic Jacobi 
method in the paper of Forsythe and Henrici [15]. We rather will present the accuracy 
obtained and the convergence rate in a practical way by the results of the test runs 
performed on two major methods: rectangular and four-column rotation. 
We also felt, that since all the results presented in previous chapters were obtained 
for the matrices filed with the real random numbers generated with the same seed from 
the range of -10 to 10, it would be appropriate to present the results for other ranges and 
types of the random numbers. That will allow us to investigate how the results depend 
on the initial data distribution in matrix A. 
5.1 General error propagation issues 
The most thorough investigation of both subjects of the error propagation and convergence 
of Jacobi methods is given in the classic book by J.H. Wilkinson [47]. 
93 
I 
' 
I 
I 
I 
,. 
I r 
11 
Ii 
1, 
94 CHAPTERS. CONVERGENCE 
Precision Single Double APlOOO-double 
ibeta 2 2 2 
it 24 53 53 
posep -23 -52 -52 
eps 1.19 X 10-·r 2.22 X 10- lt> 2.22 X 10-16 
negep -24 -53 -53 
eps 5.96 X 10- is 1.11 X 10- H> 1.11 X 10-16 
irnd 5 5 2 
Table 5.1: Floating point parameters for the IEEE compliant computer and the APlOOO. 
Firstly, we need to know the limitations of the APlOOO's floating-point arithmetic 
unit. The great help in establishing this is the routine machar from "Numerical recipes 
in C" [24]. This small program, implemented originally by Cody [7] and Malcom [34], 
determines the useful floating point parameters experimentally. We will not give all of 
them but only the sub-set that we are interested in. Table 5.1 gives the floating point 
parameters for the typical IEEE-compliant machine and the APlOOO which turns out to 
conform to this standard. The meaning of the parameters is: 
ibeta 
it 
posep 
The radix in which numbers are represented . 
The number of base-ibeta digits in the floating-point mantissa. 
The exponent of the smallest power of ibeta that, added to 1.0, gives the 
result which differs from 1.0. 
eps The floating-point number ibeta posep known as the floating-point precision. 
irnd The number which gives the information on what kind of rounding is done in 
addition, and on how the underflow is handled. 
The value of irnd is 2 and this means that the rounding in addition complies with IEEE 
standard. 
Any arithmetic floating-point operation 6.1 performed on the computer is subject to 
a rounding error € [47, page 113] 
z = f1d(x6.y) = (x6.y)(l + €) . (5.1) 
1 Our ~ denotes any of the basic arithmetic operation i.e. plus, minus, multiply and divide. 
,, 
I 
l 
I 
l 
I 
l 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
Ii 
5.1. GENERAL ERROR PROPAGATION ISSUES 95 
In case of the APlOOO, for arithmetic operations on double precision floating point num-
bers, the epsilon is: 
(5.2) 
After repeatedly performing the operation 6, say r times, the bounds in Equation 5.2, 
after simplification,2 can be expressed by: 
(5.3) 
where 
(5.4) 
Now, we can present (without proof) the error bounds for the operations involved in the 
SVD algorithm: 
Inner product. [47, page 114] 
(5.5) 
Plane rotation. [47, pages 131-132] 
cos = cos + ocos sin = sin+ osin R = R + oR (5.6) 
where R is the exact rotation matrix. 
(5.7) 
Multiplication by a sequence of plane rotations. [47, pages 133-141] 
(5.8) 
where N is the number of rotations performed N = n{n2-l). 
All of the above estimates of the error bounds (Equations 5.5, 5.6, 5.7, and 5.8) assume 
that the errors in angle computations do not depend on the errors introduced by the inner 
product computations. They can be useful to estimate the error in one sweep only. 
In the case of the Hestenes algorithm we have a strong, dependency of the errors in 
one arithmetic operation on the others. In the basic Hestenes algorithm (See Figure 1.1), 
in order to orthogonalise two columns, we need to (1) compute three inner products, 
2 The simplification is explained in [47, page 113] . 
I 
' 
I 
I 
I 
I 
I 
I 
: 
;1 
'I 
I 
J 
I 
I 
I 
I 
,. 
11 
1, 
11 
96 CHAPTER 5. CONVERGENCE 
(2) compute sin and cos, and (3) rotate these columns. Operation (1) introduces errors 
to the operation (2) which introduces errors to the rotation (3) . Rotation (3) changes 
the coordinates of the rotated column-vectors and therefore introduces again errors into 
operation (1) . The accurate error estimate in this situation is very difficult. 
Below we will give a simple, upper bound error estimate for the Hestenes algorithm. 
It is worthwhile to notice that at each rotation, we check if two columns are orthogonal 
within the given tolerance. This tolerance is established by the threshold value. Thus, we 
can say that after the Hestenes algorithm completes, any two columns will be orthogonal 
with the error less than the threshold value. 
To sum all of the above facts to some sensible figure, after assuming that n = m and 
the numbers used in the matrix A are the random numbers from the range (-1, 1), we 
have the following error bound for the off (W) 3 for the Hestenes algorithm: 
b,. ~ N € II A IIE (5.9) 
where N is the number of column-pairs ( i , j) such that i =/:. j, and f is the chosen threshold 
value. For n = 100 and f = 10-12 , b,. ~ 10-6 . We need to remember that this is an 
absolute upper bound of the error of the orthogonalisation of the matrix A. 
We need to say that since we, in practice, use the threshold Jacobi method, the accu-
racy of the results will depend on the choice of the threshold value. From our experiments 
on the threshold value, we do not recommend to use the number which is close to eps 
because there will always be some pairs of columns ( different from sweep to sweep) which 
will not satisfy the threshold even for many sweeps. The threshold we chose was 10-12 . 
5.2 General convergence rate issues 
Although the threshold Jacobi method always converges (for a reasonably chosen thresh-
old) because of the threshold imposed, it is very interesting to answer the question of 
how fast the convergence is. The speed of convergence of the cyclic Jacobi method is well 
covered in the publications (46](23](25](15]. 
Hansen, in his paper (23], discussed the convergence properties associated with various 
orderings for the serial Jacobi method. Particularly, he investigated experimentally various 
orderings and the average number of sweeps associated with them needed for the algorithm 
sMatrix Wis the orthogonalised matrix A after algorithm converged. 
I 
I 
I 
I 
I 
' 
,, 
! 
i 
I 
I 
I 
I 
I 
I 
I 
5.3. NUMERICAL RESULTS 97 
to converge. Some of the orderings required smaller number of sweeps than the others to 
converge. For the same matrix size and multiple test runs, he found out that the minimum 
number of sweeps to satisfy the convergence for a particular 'good' ordering was 4.11 and, 
for the 'bad' one, 4.94. According to the results, he assigned to each ordering a preference 
factor which, somewhat unfortunately named, is higher for the ordering which requires 
more sweeps to converge. 
We may expect that for the larger matrices, the standard deviation of the number of 
sweeps for the algorithm to converge will be greater. This fact shows that for the serial 
Jacobi method, it is worthwhile to try to find an optimal ordering. 
Unfortunately, for the parallel implementations of the Jacobi algorithm, it is much 
harder to change the orderings. In many cases it is already quite difficult to find the 
ordering which will conform to the balanced load on a particular architecture. 
In this research, we would like to know how fast two of our implementations, rectangu-
lar and four-column rotation, converge. To do that, we will run test programs and observe 
the reduction in the off(A) (see Chapter 1) quantity at every sweep. We hope that our 
implementation can hold the quadratic convergence as for the cyclic Jacobi method, i.e.: 
(5.10) 
5.3 Numerical results 
In this section, we will show and discuss the results of the test runs which will reveal the 
dependence of the results on the initial data stored in matrix A. Also we will present the 
accuracy in orthogonalising matrix A. Although we do not compute the singular values 
explicitly, their accuracy is dependent on the accurate orthogonalisation of matrix A, since 
each singular value CTi is simply the norm of each column-vector (W(:,i )) of the transformed 
matrix A denoted in Chapter 1 as matrix W. From the test runs, we will also be able to 
establish the number of sweeps required for each implementation to converge. 
5.3.1 Initial data distribution 
The input matrix A used in the test runs m previous chapters was always generated 
with the same random double precision floating point numbers. Now we would like to 
determine the dependency of the results on the different sets of random input data. 
- - --
T 
I 
I 
I 
I 
I 
I 
I 
i 
I 
I 
I 
98 CHAPTER 5. CONVERGENCE 
Type Matrix Size Configuration Average Time Average Sweep Average off(A) 
1 X 1 330.1 ±4.8 12.5 ±0.5 9.23e-14 
1 X 10 37.1 ±0.6 11.9 ±0.4 9.90e-13 
200 X 200 10 X 1 74.2 ±1.9 12.6 ±0.5 9.23e-14 
2x5 37.4 ±1.0 11.6 ±0.5 4.31e-13 
Integer 1 X 1 696.6 ±16.4 9.9 ±0.4 2.38e-12 
1 X 10 83.7 ±2.7 9.8 ±0.5 8.65e-12 
2000 X 120 10 X 1 76.6 ±1.9 9.9 ±0.4 2.38e-12 
5x2 74.3 ±1.9 9.9 ±0.4 l.91e-11 
1 X 1 326.8 ±5.3 12.2 ±0.5 4.05e-13 
1 X 10 37.3 ±0.2 12.0 ±0.0 l.24e-12 
200 X 200 
10 X 1 73.7 ±2.0 12.5 ±0.5 4.08e-13 
2x5 37.7 ±0.9 11.8 ±0.5 6.07e-13 
Positive 1 X 1 700.0 ±4.1 10.0 ±0.0 7.93e-14 
1 X 10 82.6 ±3.4 9.5 ±0.5 3.82e-12 
2000 X 120 
10 X 1 76.9 ±0.5 10.0 ±0.0 l.34e-12 
2x5 73.5 ±1.9 9.8 ±0.5 3.54e-11 
1 X 1 329.6 ±6.8 12.4 ±0.5 5.24e-13 
1 X 10 36.8 ±0.8 11.6 ±0.5 4.13e-12 
200 X 200 10 X 1 73.4 ±2.2 12.4 ±0.5 5.35e-13 
2x5 37.3 ±1.1 11.5 ±0.5 2.95e-12 
Mixed 
1 X 1 695.7 ±15.8 9.9 ±0.4 2.73e-12 
1 X 10 80.1 ±2.1 9.1 ±0.4 4.lOe-13 
2000 X 120 10 X 1 77.0 ±0.5 10.0 ±0.0 2.74e-12 
2x5 72.9 ±2.4 9.6 ±0.5 5.29e-13 
Table 5.2: Test results for the basic block implementation. 
We chose three basic sets of random data: integers, positive real numbers and real 
numbers which were generated in the range 'R = (-10 . .. 10). We ran the test programs 
for four processors configurations: 1 X 1, 1 x 10, 10 X 1 and 2 x 5. The tested matrices 
were of size: 200 X 200 and 2000 X 120. In Tables 5.2, 5.3, and 5.4, we present the 
results obtained. Each figure in the tables is an average from the ten runs, each of which 
performed for a different random seed. Please note that the times shown for the four-
column implementation are not any better than those for the rectangular configuration. 
This is because we ran the four-column implementation without switching in order to see 
clearly the impact of the ordering we used in the four-column rotation which is different 
from that used in the rectangular implementation . 
Firstly, we see that the standard deviations of the execution time is small in all three 
tables. This means that if the random numbers are used to fill the input matrix A, it 
is irrelevant which sub-set of these is used. In other words, we did not bias the results 
5.3. NUMERICAL RESULTS 99 
Type Matrix Size Configuration Average Time Average Sweep Average off( A) 
1 X 1 151.4 ±2.2 12.5 ±0.5 l.63e-13 
200 X 200 
1 X 10 20.6 ±0.3 11.9 ±0.3 l.57e-12 
10 X 1 39.9 ±0.8 12.7 ±0.5 l.63e-13 
Integer 2 x 5 19.0 ±0.5 11.7 ± 0.5 8.lOe-14 
1 X 1 313.7 ±6.8 9.9 ±0.3 4.18e-12 
2000 X 120 
1 X 10 49.8 ±1.5 9.8 ±0.4 l.37e-11 
10 X 1 33.0 ±0.7 9.9 ±0.3 4.18e-12 
2 x 5 36.9 ±1.0 9.8 ±0.4 l.87e-11 
1 X 1 150.8 ±3.1 12.4 ±0.5 l.46e-13 
200 X 200 
1 X 10 20.7 ±0.0 12.0 ±0.0 7.87e-13 
10 X 1 39.7 ±1.0 12.6 ±0.5 l.52e-13 
Positive 
2x5 19.1 ±0.4 11.8 ±0.4 8.59e-13 
1 X 1 314.9 ±1.5 10.0 ±0.0 l.13e-13 
2000 X 120 
1 X 10 49.3 ±1.9 9.6 ±0.5 5.40e-12 
10 X 1 33.1 ±0.2 10.0 ±0.0 l.12e-13 
2x5 36.8 ±0.9 9.8 ±0.4 2.lle-11 
1 X 1 151.4 ±2.9 12.5 ±0.5 8.89e-14 
1 X 10 20.5 ±0.5 11.7 ±0.5 l.05e-12 
200 X 200 10 X 1 39.5 ±1.0 12.5 ±0.5 8.61e-14 
2 x 5 19.0 ±0.5 11.6 ±0.5 7.29e-13 
Mixed 1 X 1 313.5 ±6.9 9.9 ±0.3 l.18e-11 
1 X 10 47.3 ±1.2 9.1 ±0.3 l.75e-12 
2000 X 120 10 X 1 33 .2 ±0.3 10.0 ±0.0 l.18e-11 
2x5 36.1 ±1.3 9.5 ±0.5 6.93e-12 
Table 5.3: Test results for the rectangular implementation. 
presented in previous chapters by choosing a certain 'good' seed. Secondly, the algorithm 
is insensitive to the type of the numbers that are fetched into matrix A. For all the 
integers, positive reals and real numbers, the results are very similar. Thirdly, we also 
tested the sensitivity of the implementations to the different range, R = (-100 . .. 100) 
(the result tables not included), only to find that it did not influence the execution times 
for any of the implementations. 
We may expect some anomalies in real situations where the input matrix A represents 
real non-random data. It is probable that there will be certain configurations of columns 
that will contribute to the increased number of sweeps for convergence. In the worst case, 
we may expect that two or more columns will, in the later sweeps, flip-flop and never 
satisfy the threshold condition, particularly if this one is very low. This issue is very 
difficult to investigate and we only signal that the problems may occur. 
,... 
100 CHAPTER 5. CONVERGENCE 
I 
Type Matrix Size Configuration Average Time Average Sweep Average off(A) I 
1 X 1 150.6 ±2.4 12.7 ± 0.5 3.22e-13 
1 X 10 21.0 ±0.5 11.7 ± 0.5 4.04e-13 I 
200 X 200 10 X 1 41.8 ±0.8 12.8 ±0.4 3.21e-13 ! 
2x5 19.9 ±0.4 11.8 ±0.4 7.52e-13 i Integer 1 X 1 320.1 ±1.9 10.0 ±0.0 4.91e-12 
1 
1 X 10 51.8 ±1.6 9.7 ± 0.5 4.09e-12 i 
2000 X 120 10 X 1 34.9 ±0.2 10.0 ±0.0 4.9le-12 i 
2x5 39.7 ± 0.3 10.0 ±0.0 2.lle-12 I 
1 X 1 148.4 ±2.6 12.2 ± 0.4 3.84e-14 
I 
1 X 10 20.9 ±0.5 11.6 ±0.5 l.07e-13 
200 X 200 10 X 1 40.9 ±1.0 12.3 ±0.5 2.83e-14 
2x5 19.8 ±0.5 11.7 ±0.5 2.06e-13 
Positive 1 X 1 318.7 ±8.2 9.9 ±0.3 8.34e-12 
1 X 10 51.2 ±1.8 9.6 ±0.5 2.16e-11 
2000 X 120 
10 X 1 34.8 ±0.8 9.9 ±0.3 8.32e-12 
2x5 39.2 ±1.1 9.8 ±0.4 2.14e-11 
1 X 1 151.1 ±2.8 12.7 ±0.5 3.76e-13 I I 1 X 10 21.2 ±0.5 11.8 ±0.4 3.20e-13 I 200 X 200 10 X 1 41.9 ±0.8 12.8 ±0.4 3.84e-13 
2x5 19.7 ±0.6 11.6 ±0.5 1.7le-12 
Mixed lxl 320.5 ±0.9 10.0 ±0.0 3.91e-12 
1 X 10 51.3 ±1.9 9.6 ±0.5 9.lle-12 I 
2000 X 120 
10 X 1 34.9 ± 0.1 10.0 ± 0.0 3.91e-12 I , 
1, 
2 x 5 38.7 ±1.2 9.6 ±0.5 l.47e-12 
11 
Table 5.4: Test results for the four-column implementation. 
11 
In the issue of convergence, we would like to know if the changed order of the four-
column rotation causes any deterioration in convergence over the rectangular implemen-
tation. From the results, we see that the average number of sweeps required in the 
four-column implementation and in the rectangular implementation are the same, within 
the standard deviation. 
The interesting finding is that the number of sweeps vary for different processor config-
urations. The number of sweeps required for convergence has its minimum for the linear 
and the 2 X 5 configurations. These variations are of the order of two standard deviations 
which, from the point of view of statistics, is quite normal but this minimum occurs al-
ways at the above mentioned configurations. The possible explanation for this behaviour 
is that for these processor configurations, the orderings are better than for the others. 
In (4], Brent and Luk suggest that the number of sweeps for the Hestenes implemen-
tation can be represented as a slow growing function O(log(n)), where n is the number 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
i 
! 
I 
I 
. --
5.3. NUMERICAL RESULTS 101 
n ... S2 S3 S4 Ss s6 S1 Sa Sg S10 
8 1.le-03 5.5e-04 6.2e-07 1.5e-13 4.6e-14 <€ 
12 5.0e-03 9.6e-04 2.3e-05 3.7e-10 7.le-14 < € 
16 3.6e-03 3.8e-04 1.7e-04 l.3e-07 l.4e-12 6.6e-14 <€ 
20 8.5e-05 l.5e-03 l.4e-04 8.9e-08 l.le-13 l.6e-14 <€ 
24 7.8e-03 9.7e-04 l.7e-04 7.6e-05 6.9e-08 3.4e-14 3.3e-14 <€ 
28 l.5e-03 2.9e-05 2.le-04 5.5e-07 2.le-08 2.9e-13 3.3e-14 <€ 
32 l.5e-03 4.4e-04 l.8e-04 7.0e-06 2.6e-08 3.9e-15 9.4e-15 <€ 
36 2.3e-03 l.8e-03 2.le-04 2.7e-05 l.le-07 l.2e-12 2.6e-15 <€ 
40 7.4e-03 2.0e-03 7.4e-04 3.0e-06 4.2e-07 l.9e-10 5.2e-15 <€ 
44 1.5e-03 l.7e-04 4.8e-04 6.le-05 3.8e-06 5.2e-07 7.9e-11 3.8e-14 <€ 
48 6.6e-03 l.3e-04 l.le-04 7.0e-05 2.8e-06 2.2e-08 5.2e-15 7.8e-15 <€ 
52 2.8e-03 3.0e-03 3.le-04 l.7e-05 l.4e-06 6.0e-10 6.2e-14 3.4e-14 <€ 
56 2.8e-03 2.3e-03 2.2e-05 4.7e-05 l.6e-05 2.le-07 8.4e-11 3.5e-14 <€ 
60 4.0e-03 l.Oe-03 7.0e-04 8.6e-05 3.8e-06 4.6e-08 4.6e-13 6.0e-14 <€ 
64 9.3e-04 2.8e-05 2.7e-04 3.le-06 5.2e-06 3.9e-09 2.2e-13 3.0e-14 <€ 
Table 5.5: Off(A)/N after each sweep for the rectangular implementation. 
of columns. From many runs that we performed while doing this research for different 
number of columns, this suggestion holds for n > 40. 
5.3.2 Accuracy of the methods implemented 
We used the same test runs for the measurements of the accuracy of the orthogonalisation 
of matrix A. The accuracy is measured through computation of the value off(A) which 
simply is the sum of all inner products of column-vectors of A for i =/= j. This product is 
then normalised by the number of rotations Nin one sweep . Each of the tables presents 
the value off(A) for the threshold of 10- 12 . 
We see that the results for the four-column rotation implementation are not infre-
quently, of the order of magnitude, better than the results for the rectangular implemen-
tation. The possible explanation to this is that, with the same number of sweeps in both 
implementations, the four-column implementation converges faster than the two-column 
rotation due to the changed ordering. Using terminology from the paper by Hansen [23), 
the preference factor for the ordering used in the four-column rotation is lower than that 
used in the rectangular implementation. 
Also, the accuracy of the results for the range 100 of random numbers are approxi-
mately two orders of magnitude worse than the results for the range 10. This fact is in 
agreement with our error bound estimation where we have norm of matrix A. 
- - - - ---- -
. 
! 
i 
I 
I 
I 
I 
I 
I 
I 
I 
I 
'I 
i 
r 
I 
I 
I 
,. 
i 
102 CHAPTER 5. CONVERGENCE· 
n ... S2 S3 S4 Ss Ss S1 Ss Sg S10 
8 2.le-03 6.2e-04 3.0e-05 7.6e-11 4.6e-15 <€ 
12 2.8e-03 l.9e-04 7.le-07 l.4e-12 l.7e-14 <€ 
16 7.3e-03 7.9e-04 6.3e-05 3.4e-06 4.6e-11 l.6e-14 <€ 
20 l.2e-03 l.le-04 7.9e-05 2.3e-07 2.8e-11 l.6e-14 <€ 
24 2.4e-04 7.le-04 3.le-04 9.9e-06 3.8e-09 l.Oe-14 <€ 
28 3.6e-03 2.4e-04 l.8e-04 2.le-07 4.0e-10 l.7e-14 < € 
32 7.7e-04 4.0e-04 2.0e-04 4.2e-05 5.0e-07 2.8e-12 6.4e-14 <€ 
36 5.4e-03 l.5e-03 7.9e-04 3.4e-06 4.7e-08 2.7e-12 6.3e-14 <€ 
40 8.9e-04 3.9e-03 l.4e-04 l.6e-05 3.9e-06 l.5e-09 l.7e-13 l.7e-13 <€ 
44 7.5e-03 l.6e-03 8.7e-04 l.le-04 4.2e-06 2.2e-08 6.4e-13 9.6e-15 <€ 
48 2.9e-04 2.8e-04 5.4e-04 8.9e-06 3.2e-06 8.9e-10 4.6e-15 9.9e-15 <€ 
52 l.le-02 3.2e-04 6.4e-04 l.9e-05 l.9e-06 6.6e-09 l.6e-14 9.7e-15 <€ 
56 l.4e-03 2.6e-03 2.le-04 l.4e-05 3.5e-06 l.Oe-07 2.8e-11 l.2e-14 <€ 
60 4.0e-03 8.6e-04 3.5e-04 6.5e-05 4.5e-06 3.le-07 l.7e-11 4.le-14 < € 
64 l.5e-03 9.3e-04 6.9e-05 7.3e-06 l.le-07 2.9e-11 8.7e-14 8.7e-14 <€ 
Table 5.6: Off(A)/N after each sweep for the four-column implementation. 
5.4 Numerical results for convergence rate 
In this section we present results for the speed of convergence for both methods. The 
results presented in Table 5.5 and Table 5.6 were obtained by computing the Off(A) for 
each sweeps and then dividing it by the number of rotations performed in each sweep N. 
The initial data in the matrices were chosen from a uniform distribution in the interval 
(-1, 1). The threshold was 10- 15 . Once more, we can confirm that the four-column rota-
tion does not affect adversely the accuracy and convergence even if the switching between 
four- and two-column rotations is used. Both implementations converge quadratically. 
This fact was confirmed by doing square fit for each n. 
5.5 Conclusions 
In this chapter, we showed that both rectangular and four-column rotation methods do 
not adversely affect either the convergence ratio or the accuracy of the orthogonalisation 
of matrix A and in effect did not decrease the accuracy of the computed singular values. 
The test runs well conformed that fact . Both implementations are very stable mainly 
due to the nature of the Jacobi methods. There are two major facts which influence the 
accuracy. The first is the computation of the rotation angles, which are the functions of 
alphas, and the second is the rotations of the columns. We found out that the first factor 
I 
I 
I 
I 
I 
I 
I 
i 
I 
: 
I 
' 
i 
' 
i 
l 
I 
5.5. CONCLUSIONS 103 
has the ability to correct itself even if there are inaccuracies in computing particularly 
CXij· The inaccuracies in computing alphas will manifest itself in the increased number of 
sweeps and in the extreme case, no convergence at all. The fact that both implementations 
maintain the quadratic convergence also ensures that the improvements presented in this 
thesis are safe to use despite many changes in the data flow and the large reduction of 
floating point operations compared with the original block implementation. 
--
-- ~-
r 
I 
I 
' 
I 
I 
I 
: 
I 
I 
I 
I 
i 
I 
t 
I 
I 
I 
I 
,I 
i 
Chapter 6 
Conclusions 
The main aim of this research was to provide a thorough investigation of porting a singular 
value decomposition algorithm from a single- to a multi-processor environment. 
We believe that the conceptual scheme provided in this thesis is not only restricted 
to the singular value decomposition algorithms but can also be applied to many other 
numerical algorithms. The following key issues need to be resolved in order to make the 
move from a single- to a multi-processor environment the best possible way: 
• Choice of the algorithm. 
• Choice of the parallel architecture. 
• Data distribution and data flow. 
• Performance on each parallel memory hierarchy level. 
Let us summarize the findings of this research regarding each of these key issues. 
We showed, in Chapter 2, by using Amdahl's Law, that in the process of parallelisa-
tion, the overall flop count of a particular algorithm is not that important for the best 
performance. The main guide-line to follow for the best choice of the algorithm to be 
implemented is its parallelisability - the easiness of implementing it onto the parallel com-
puter. The Jacobi like algorithms, the Hestenes algorithm in particular, clearly showed 
its superiority in this regard over the QR based methods due to its inherent parallelism. 
We do acknowledge that the choice of the parallel architecture is not always available. 
If such a choice is possible, it should mirror the algorithm's data flow and its synchronous 
or asynchronous nature . 
105 
106 CHAPTER 6. CONCLUSIONS 
In Chapter 2, we also showed, how the data distribution of the systolic array imple-
mentation of the Hestens algorithm can be easily extended by the simple block approach. 
Thus, the hardware based constrains of the size of the input matrices could be removed. 
By using Brent-Luk ordering, we provided the neat data flow for all of the implementations 
presented. 
In Chapter 3, we gave two specific methods of improving the floating point performance 
of the Hestenes algorithm. Firstly, we reduced the number of floating point operations 
which, from the point of view of the algorithm, could be avoided. We employed two 
ideas: scaled columns which balanced the multiplication to addition ratio in the rotation 
procedures and reduced inner product computations which compute the inner products 
implicitly rather than explicitly. Secondly, we improved the execution code generated by 
the C compiler in order to efficiently use the SPARC's FPU pipelining and its ability to 
execute independently the FADDD and FMULD instructions1 . The execution time was 
improved, at its best, by 54%. We compared the improved versions of the algorithm with 
the basic block implementation. 
We observed that all the implementations which accommodate the improvement of 
the floating point performance did not adversely affect the convergence ratio of the algo-
rithm and the number of sweeps needed for the algorithm to converge generally remained 
unchanged and in few cases was even smaller. 
We need to stress that the improvements of the floating point performance do not 
utilise any of the properties of the parallel environment and can be applied to the single-
processor version of the Hestenes algorithm as well. 
In Chapter 4, we realised the last of the key issues presented earlier - the analysis 
of the algorithm performance on each of the parallel memory hierarchy level. In the 
parallel-processors environment, we distinguish three memory levels: other processors 
level (communication), the cache level, and the register level. Let us describe the results 
of these optimisations. 
-
The first method - rectangular processor configuration utilises - the fact that the 
amount of data reuse on the communication level is proportional to the number of columns 
stored per cell. Also the rectangular configuration fits better the physical mesh like 
topology of the APlOOO communication network . The largest relative execution time 
/ 
1 These principles can be applied to other scalar FPUs, and to a lower extent, vector FPUs. 
107 
improvement achieved was 80% for the matrices with long column-vectors. The best 
configurations for the square matrices corresponded to the square configurations of the 
processors. 
The second method - partitioning - is used in conjunction with the rectangular method. 
This method improves the cache reuse by rearranging the way the columns are rotated 
and by working on the small~r sub-sets of the data. For the matrices with a large number 
of columns per processor, the execution times improved by further 10-20%, relative to 
the results for the rectangular configuration where the partitioning was not used. 
The third method - four-column rotation - improves the performance by a better 
reuse of the registers. It can be used efficiently in the cases where all of the four columns 
require -rotation and, despite extra floating operations introduced, is faster than two-
column rotation . For the cases of large number of columns stored per processor, this 
method gave 2-10% improvement in the execution time over the two column rotation 
used in the partitioned, rectangular implementation. We need to make a remark that the 
code of the four-column rotation can still be improved and has the potential to achieve 
approximately 4 MFLOPS. At present its pick floating performance is approximately 3.1 
MFLOPS. 
The average improvement for the Hestenes algorithm achieved using all of the optimi-
sations presented in this thesis is about 60%, relative to the basic block implementation. 
All of the optimisations presented did not affected adversely the convergence ratio. 
Although the number of sweeps was increased or decreased by one, this did not have a 
measurable impact on the performance. This was also confirmed by the test runs for the 
four-column implementation which uses other ordering than the two-column methods. 
Despite many numerical changes introduced, we retained the main property of the 
cyclic Jacobi methods, i.e. the quadratic convergence. For the chosen threshold of 10-12 , 
the value of otf(A) was in the range of (10-12-10-14) for the different implementations 
and different data-sets used as input . 
We hope that the research presented in this thesis can be helpful for all those who 
wish to implement efficiently any numerical algorithms on the parallel environment. Many 
of the methods and the techniques presented herein can be successfully used on other 
platforms and in particular, on the MIMD based architectures. 
108 CHAPTER 6. CONCLUSIONS 
Future research 
In the near future, we would like to continue the work on the more efficient implementation 
of the four-column method. 
Also during this research, a new direction of the possible, further improvements 
emerged. At each sweep, we traced the behaviour of the cos(8i) for each pair of the 
column-segments. This technique gave us a global history of the process of the orthog-
onalisation of matrix A. We gathered all the plots for each sweep and displayed them 
in a form of the move. We observed for many different input matrices very similar dy-
namic patterns of convergence. We also noticed the group of columns which converged 
faster than the others on the local scale. This gave us an idea to trace, in the future 
implementations, the pairs of columns which lag behind the other and therefore delay the 
convergence on the global scale. We may use the processors farming method to do extra 
rotations of those columns. 
Another possible direction for the future research is to extend the ideas used in this 
thesis for the eigenvalues computations where the rotations need to be performed on both 
the columns and the rows of the input matrix A. 
Bibliography 
[1] E. Anderson, Z. Bai, C. Bischof, J. Demmel, and J . Dongarra. LAPACk User's Guide. 
SIAM, Philadelphia, 1992. 
[2] H.C. Andrews and C.L. Patterson. Singular value decompositions and digital image 
processing. IEEE Transactions on Acoustics, Speech and Signal Processing, 24:26-53, 
1976. 
[3] R.P. Brent, A. Czezowski, M. Hegland, P.E. Strazdins, and B.B. Zhou. Linear algebra 
research on the APlOOO. In M. Ishii, editor, Proceedings of the Second Parallel 
Computing Workshop, pages Pl-L. Fujitsu Laboratories LTD, Nov. 1993. 
[4] R.P. Brent and F .T . Luk. The solution of singular-value and eigenvalue problems on 
systolic arrays. In Proceedings of the Centre for Mathematical Analysis, volume 6, 
pages 38-64. Australian National University, 1984. 
[5] R.P. Brent and P.E. Strazdins. Implementation of the BLAS level 3 and LINPACK 
benchmark on the APlOOO. Fujisu Scientific & Technical Journal, 29:61-70, 1993. 
[6] Edited by E.F . Deprettere. SVD and Signal Processing Algorithms, Applications and 
Architectures. Elsevier Science Publishers B.V., North-Holland, 1988. 
[7] E.J. Cody. Algorithm 665 machar: A subroutine to dynamically determine machine 
parameters. acm-cs, 14:303-311, 1988. 
[8] A. Czezowski and P.Strazdins. Optimisations for the Memory Hierarchy of a Singular 
Value Decomposition Algorithm implemented on the MIMD Architecture. In Lecture 
Notes in Computer Science, pages 83-88. Springer Verlag, 1994. 
109 
110 BIBLIOGRAPHY 
[9] A. Czezowski and P. Strazdins. Singular Value Computations on the APlOOO. In 
R.P. Brent, editor, Proceedings of the Second Fujitsu-ANU CAP Workshop, pages 
Jl-J12. Australian National University, Nov. 1991. 
[10] A. Czezowski and P. Strazdins. Ways of enhancing performance of a SVDA on the 
APlOOO array multiprocessor. In J. Hulskamp and D. Jones, editors, Proceedings of 
the 5th Australian Transputer and OCCAM Users Group Conference, pages 83-88. 
Royal Melbourne Institute of Technology, Nov. 1993. 
[11] A. Czezowski and P. Strazdins. Optimisations for the Memory Hierarchy of a Singular 
Value Decomposition Algorithm implemented on the MIMD Architecture. Technical 
Report TR-CS-94-03, Australian National University, 1994. 
[12] J. Demmel and W. Kahan. Accurate singular values of bidiagonal matrices. SIAM 
Journal of Scientific and Statistical Computing, 11:873-912, 1990. 
[13] C. Eckart and G. Young. The approximation of one matrix by another of lower rank. 
Psychometrika, 1:211-218, 1936. 
[14] K. Fan and A. Hoffman. Some metric inequalities in the space of matrices. Proc. 
Amer. Math. Soc., 6:111-116, 1955. 
[15] G.E. Forsythe and P. Henrici. The cyclic jacobi method for computing the principal 
values of a complex matrix. Trans. Amer. Math. Soc., 94:1-23, 1960. 
[16] J. Francis. The QR transformation. A unitary analogue to the LR transformation. 
Journal of Computations?, 4:265-271, 1962. 
[17] W.M. Gentleman. Least squares computations by Givens transformations without 
square roots. J. Inst. Math. Appl., 12, 1973. 
[18] G. H. Golub. Least squares, singular values and matrix approximations. Aplikace 
Matematiky, 12:44-51, 1968. 
[19] G. H. Golub and W. Kahan. Calculating the singular values and pseudo-inverse of a 
matrix. SIAM Journal on Numerical Analysis, 2:205-224, 1965. 
BIBLIOGRAPHY 111 
[20] G. H. Golub and C. Reinsch. Singular value decomposition and least squares solu-
tions. J.H Wilkinson and C. Reinish Handbook for Automatic Computation, 2:134-
151, 1971. 
[21] G.H. Golub and C .F . Van Loan. Matrix Computation. The Johns Hopkins University 
Press, Baltimore and London, 1989. 
[22] B. Green. The orthogonal approximation of an oblique structure in factor analysis. 
Psychometrika, 17:429-440, 1952. 
[23] E.R. Hansen. On cyclic jacobi methods. J. Soc. Indust. Appl. Math., 11:448-459, 
1963. 
[24] Warren J. Hehre, Leo Radom, Paul v.R. Schleyer, and John A. Pople. Numerical 
recipes in C: the art of scientific computing. John Wiley and Sons, New York, 1986. 
[25] P. Henrici. On the speed of convergence of cyclic and quasicyclic jacobi methods for 
computing eigenvalues of hermitian matrices. J. Soc. Indust. Appl. Math., 6:144-162, 
1958. 
[26] M.R. Hestenes. Inversion of matrices by biorthogonalization and related results. J. 
Soc. Indust. Appl. Math., 6:51-90, 1958. 
[27] H. Ishihata, T . Horie, et al. An architecture of highly parallel computer APlOOO. 
In IEEE Pacific Rim Conj. on Communications, Computers and Signal Processing, 
pages 13- 16. IEEE, 1991. 
[28] C.G.J . Jacobi. Uber ein leichtes Verfahren die in der Theorie der Sacularstorungen 
vorkommenden Gleichungen numerisch aufzulosen. Journal fur die reine and ange-
wandte Mathematik, 94:30-51, 1846. 
[29] E. G. Kobetliantz. Solution of linear equations by diagonalization of coefficients 
matrix. Quart. Appl. Math., 13:123-132, 1955. 
[30] J. Komorowski. Od Liczb Zespolonych do Tensorow, Spinorow, Algebr Liego i 
Kwadryk. Polskie Wydawnictwo Naukowe, Warszawa, 1978. 
[31] V. N. Kublanovskaja. Some algorithms for the solution of the complete problem of 
eigenvalues. V. Vycisl. Mat. i Mat. Fiz., 1:555-570, 1961. 
112 BIBLIOGRAPHY 
[32] H.T. Kung and C.E. Leioerson. Systolic arrays (for vlsi) . SIAM, pages 256-282, 1979. 
[33] F.T. Luk and H. Park. On parallel jacobi orderings. SIAM Journal of Scientific and 
Statistical Computing, 1988. 
[34] M.A. Malcolm. Algorithms to reveal properties of floating-point arithmetic. acm-tms, 
15:949-951, 1972. 
[35] C. Moler and D. Morrison. Singular value analysis of cryptograms. The American 
Mathematical Monthly, 90:78-87, 1983. 
[36] R. Penrose. A generalized inverse for matrices. Proc. Cambridge Philos. Soc., 51:406-
413, 1955. 
[37] C. Romine and K. Sigmon. Reducing inner product computation in the parallel one 
sided Jacobi algorithm. In Proceedings of the Fifth Distributed Memory Conference. 
IEEE Press, 1990. 
[38] H. Rutishauser. The Jacobi method for real symmetric matrices. J.H Wilkinson and 
C. Reinish Handbook for Automatic Computation, 2:202-211, 1971. 
[39] A.H. Sameh and D.J. Kuck. A Parallel QR Algorithm for Symmetric Tridiagonal 
Matrices. IEEE Transactions on Computers, C-26:147- 153, 1977. 
[40] T. Shimizu, T. Rorie, et al. Performance Evaluation of the APlOOO. Fujisu Scientific 
& Technical Journal, 29:15-24, 1993. 
[41] J.M. Speiser and H.J. Whitehouse. Architectures for real-time matrix operations. In 
Proc. 1980 Government Microcircuits Applications Conj. Houston,Texas, 1980. 
[42] G. W. Stewart. On the early history of the singular value decomposition. Technical 
Report TR-2855, University of Maryland, 1992. 
[43] P.E. Strazdins. Control Structures for Mesh-connected Networks. PhD thesis, The 
Australian National University, 1990. 
[44] P.E. Strazdins and R.P. Brent. The implementation of BLAS level 3 on the APlOOO: 
Preliminary Report. In Proceedings of the Second Fujitsu-ANU CAP Workshop, 
volume 1, pages Hl- H17. ANU, 1991. 
BIBLIOGRAPHY 113 
[45] J .H. Wilkinson . Calculations of the eigenvalues of a symmetric tridiagonal matrix by 
the method of bisection. Numerische Mathematik, 4:362-367, 1962. 
[46] J .H. Wilkinson. Note on the quadratic convergence of the cyclic jacobi process. 
Numerische Mathematik, 4:296-300, 1962. 
[47] J .H. Wilkinson. The Algebraic Eigenvalue Problem. Clarendoc Press - Oxford, Ox-
ford, 1965. 
