Tiled Algorithms for Matrix Computations on Multicore Architectures by Bouwmeester, Henricus
TILED ALGORITHMS FOR MATRIX COMPUTATIONS ON MULTICORE
ARCHITECTURES
by
Henricus M Bouwmeester
B.S., Colorado Mesa University, 1998
M.S., Colorado State University, 2000
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Applied Mathematics
2012
ar
X
iv
:1
30
3.
31
82
v1
  [
cs
.N
A]
  1
3 M
ar 
20
13
This thesis for the Doctor of Philosophy degree by
Henricus M Bouwmeester
has been approved
by
Julien Langou, Advisor
Lynn Bennethum, Chair
Gita Alaghband
Elizabeth R. Jessup
Stephen Billups
October 26, 2012
ii
Bouwmeester, Henricus M (Ph.D., Applied Mathematics)
Tiled Algorithms for Matrix Computations on Multicore Architectures
Thesis directed by Associate Professor Julien Langou
ABSTRACT
Current computer architecture has moved towards the multi/many-core structure.
However, the algorithms in the current sequential dense numerical linear algebra li-
braries (e.g. LAPACK) do not parallelize well on multi/many-core architectures. A
new family of algorithms, the tile algorithms, has recently been introduced to circum-
vent this problem. Previous research has shown that it is possible to write efficient
and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU
factorization, and a QR factorization. The goal of this thesis is to study tiled al-
gorithms in a multi/many-core setting and to provide new algorithms that exploit
the current architecture to improve performance relative to current state-of-the-art
libraries while maintaining the stability and robustness of these libraries.
In Chapter 2, we confront the problem of computing the inverse of a symmetric
positive definite matrix with tiled algorithms. We observe that, using a dynamic
task scheduler, it is relatively painless to translate existing LAPACK code to ob-
tain a ready-to-be-executed tile algorithm. However we demonstrate that nontrivial
compiler techniques (array renaming, loop reversal and pipelining) need to be ap-
plied to further increase the parallelism of our application, both theoretically and
experimentally.
Chapter 3 revisits existing algorithms for the QR factorization of rectangular
matrices composed of p× q tiles, where p ≥ q, for an unlimited number of processors.
Within this framework, we study the critical paths and performance of algorithms
such as Sameh-Kuck, Fibonacci, Greedy, and those found within PLASMA. We
iii
also provide a monotonically increasing function to transform the elimination list of a
coarse-grain algorithm to a tiled algorithm. Although the optimality from the coarse-
grain Greedy algorithm does not translate to the tiled algorithms, we propose a new
algorithm and show that it is optimal in the tiled context.
In Chapters 2 and 3, our context includes an unbounded number of processors.
The exercise was to find algorithmic variants with short critical paths. Since the
number of resources is unlimited, any task is executed as soon as all its dependencies
are satisfied. In Chapters 4 and 5, we set ourselves in the more realistic context
of bounded number of processors. In this context, at a given time, the number
of ready-to-go tasks can exceed the number of available resources, and therefore a
schedule which prescribes which tasks to execute when needs to be defined. For
the Cholesky factorization, we study standard schedules and find that the critical
path schedule is best. We also derive a lower bound on the time to solution of the
optimal schedule. We conclude that the critical path schedule is nearly optimal for
our study. For the QR factorization problem, we study the problem of optimizing
the reduction trees (therefore the algorithm) and the schedule simultaneously. This
is considerably harder than the Cholesky factorization where the algorithm is fixed
and so, for Cholesky factorization, we are concerned only with schedules. We provide
a lower bound for the time to solution for any tiled QR algorithm and any schedule.
We also show that, in some cases, the optimal algorithm for an unbounded number of
processors (found in Chapter 3) cannot be scheduled to solve optimally the combined
problem. We compare our algorithms and schedules with our lower bound.
Finally, in Chapter 6 we study a recursive tiled algorithm in the context of matrix
multiplication using the Winograd-Strassen algorithm using a dynamic task sched-
uler. Whereas most implementations obtain either one or two levels of recursion, our
implementation supports any level of recursion.
iv
The form and content of this abstract are approved. I recommend its publication.
Approved: Julien Langou
v
TABLE OF CONTENTS
Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapter
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. Cholesky Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Tile in-place matrix inversion . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Algorithmic study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . 15
3. QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 The QR factorization algorithm . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.2 Elimination lists . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.3 Execution schemes . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Critical paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Coarse-grain algorithms . . . . . . . . . . . . . . . . . . . . . 30
3.2.1.1 Sameh-Kuck algorithm . . . . . . . . . . . . . . . . . 30
3.2.1.2 Fibonacci algorithm . . . . . . . . . . . . . . . . . . . 31
3.2.1.3 Greedy algorithm . . . . . . . . . . . . . . . . . . . . 31
3.2.2 Tiled algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4. Scheduling of Cholesky Factorization . . . . . . . . . . . . . . . . . . . . . 67
4.1 ALAP Derived Performance Bound . . . . . . . . . . . . . . . . . . 68
4.2 Critical Path Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Scheduling with synchronizations . . . . . . . . . . . . . . . . . . . . 71
4.4 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
vi
4.5 Toward an αopt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.7 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . 76
5. Scheduling of QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1 Performance Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Elimination Tree Scheduling . . . . . . . . . . . . . . . . . . . . . . 81
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6. Strassen Matrix-Matrix Multiplication . . . . . . . . . . . . . . . . . . . . 84
6.1 Strassen-Winograd Algorithm . . . . . . . . . . . . . . . . . . . . . 85
6.2 Tiled Strassen-Winograd Algorithm . . . . . . . . . . . . . . . . . . 86
6.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Appendix
A. Integer Programming Formulation of Tiled QR . . . . . . . . . . . . . . . . 99
A.1 IP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
A.1.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
A.1.2 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
A.1.2.1 Precedence constraints . . . . . . . . . . . . . . . . . . 105
A.1.3 Objective function . . . . . . . . . . . . . . . . . . . . . . . . 107
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
vii
FIGURES
Figure
1.1 Three variants for the Cholesky decomposition . . . . . . . . . . . . . . . 3
1.2 Data layout of tiled matrix . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Three variants of the Cholesky decomposition applied to a tiled matrix of
4× 4 tiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 DAGs of Step 3 of the Tile Cholesky Inversion (t = 4). . . . . . . . . . . 12
2.2 Scalability of Algorithm 2.1 (in place) and its out-of-place variant intro-
duced in § 2.2, using our dynamic scheduler against vecLib, ScaLAPACK
and LAPACK libraries. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Impact of loop reversal on performance. . . . . . . . . . . . . . . . . . . 15
3.1 Icon representations of the kernels . . . . . . . . . . . . . . . . . . . . . 42
3.2 Critical Path length for the weighted FlatTree on a matrix of 4× 4 tiles. 43
3.3 Illustration of first and second parts of the proof of Theorem 3.14 using
the Fibonacci algorithm on a matrix of 15× 2 tiles. . . . . . . . . . . . 49
3.4 Greedy versus GrASAP on matrix of 15× 2 tiles. . . . . . . . . . . . 50
3.5 Tiled matrices of p × q where the critical path length of GrASAP is
shorter than that of Greedy for 1 ≤ p ≤ 100 and 1 ≤ q ≤ p. . . . . . . . 50
3.6 Upper bound and experimental performance of QR factorization - TT
kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 Overhead in terms of critical path length and time with respect to
Greedy (Greedy = 1) . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.8 Overhead in terms of critical path length and time with respect to
Greedy (Greedy = 1) . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.9 Kernel performance for double complex precision . . . . . . . . . . . . . 56
3.10 Kernel performance for double precision . . . . . . . . . . . . . . . . . . 57
3.11 Upper bound and experimental performance of QR factorization - All kernels 58
viii
3.12 Overhead in terms of critical path length and time with respect to
Greedy (Greedy = 1) . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.13 Overhead in terms of critical path length and time with respect to
Greedy (Greedy = 1) . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1 ALAP execution for 5× 5 tiles . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Example derivation of task priorities via the Backflow algorithm . . . . . 71
4.3 Theoretical results for matrix of 40× 40 tiles. . . . . . . . . . . . . . . . 74
4.4 Values of α for matrices of t× t tiles where 3 ≤ t ≤ 40 . . . . . . . . . . 75
4.5 Asymptotic efficiency versus α = p/n for LU decomposition and versus
α = p/t2 for Tiled Cholesky factorization. . . . . . . . . . . . . . . . . . 76
5.1 Scheduling comparisons for each of the algorithms versus the ALAP De-
rived bounds on a matrix of 20× 6 tiles. . . . . . . . . . . . . . . . . . . 79
5.2 Tail-end execution using ALAP on unbounded resources for GrASAP
and Fibonacci on a matrix of 15× 4 tiles. . . . . . . . . . . . . . . . . 80
5.3 ALAP Derived bound comparison for all algorithms for a matrix of 15×4
tiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 Comparison of speedup for CP Method on GrASAP, ALAP Derived
bound from GrASAP, and optimal schedules for a matrix of 5 × 5 tiles
on 1 to 14 processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.1 Task graph for the Strassen-Winograd Algorithm. Execution time pro-
gresses from left to right. Large ovals depict multiplication and small
ovals addition/subtraction. . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2 Strassen-Winograd DAG for matrix of 4×4 tiles with one recursion. Exe-
cution time progresses from left to right. Large ovals depict multiplication
and small ovals addition/subtraction. . . . . . . . . . . . . . . . . . . . . 88
6.3 Required extra memory allocation for temporary storage for varying re-
cursion levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
ix
6.4 Comparison of tuning parameters nb and r. . . . . . . . . . . . . . . . . 94
6.5 Scalability and Efficiency comparisons on 48 threads with matrices of 64×
64 tiles and nb = 200. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.6 Scalabilty and efficiency comparisons executed on 12 threads with matrices
of 64× 64 tiles and nb = 200. . . . . . . . . . . . . . . . . . . . . . . . . 96
x
TABLES
Table
2.1 Length of the critical path as a function of the number of tiles t. . . . . . 13
3.1 Kernels for tiled QR. The unit of time is
n3b
3
, where nb is the blocksize. . 23
3.2 Time-steps for coarse-grain algorithms. . . . . . . . . . . . . . . . . . . . 32
3.3 Time-steps for tiled algorithms. . . . . . . . . . . . . . . . . . . . . . . . 63
3.4 Neither Greedy nor Asap are optimal. . . . . . . . . . . . . . . . . . . 63
3.5 Three schemes applied to a column whose update kernel weight is not an
integer multiple of the elimination kernel weight. . . . . . . . . . . . . . 64
3.6 Greedy versus PT TT and Fibonacci (Theoretical) . . . . . . . . . . . . 65
3.7 Greedy versus PT TT (Experimental Double) . . . . . . . . . . . . . . . 66
3.8 Greedy versus PT TT (Experimental Double Complex) . . . . . . . . . . 66
3.9 Greedy versus Fibonacci (Experimental Double) . . . . . . . . . . . . . . 66
3.10 Greedy versus Fibonacci (Experimental Double Complex) . . . . . . . . 66
4.1 Task Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Upper bound on speedup and efficiency for 5× 5 tiles . . . . . . . . . . . 69
5.1 Schedule lengths for matrix of 5× 5 tiles . . . . . . . . . . . . . . . . . . 82
6.1 Strassen-Winograd Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2 Recursion levels which minimize the number of tasks for a tiled matrix of
size p× p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.3 128× 128 tiles of size nb = 200 . . . . . . . . . . . . . . . . . . . . . . . 90
6.4 Comparison of the total number of tasks and critical path length for matrix
of p× p tiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
xi
1. Introduction
The High-Performance Computing (HPC) landscape is trending more towards a
multi/many-core architecture [8] as is evidenced by the recent projects of major chip
manufacturers and reports of surveys conducted by consulting companies [52]. The
computational algorithms for dense linear algebra need to be re-examined to make
better use of these architectures and provide higher levels of efficiency. Some of these
algorithms may have a straight forward translation from the current state-of-the-art
libraries while others require much more thought and effort to gain performance in-
creases. In this thesis, we endeavor to achieve algorithms that exploit the architecture
to improve performance relative to current state-of-the-art computational libraries
while maintaining the stability and robustness of these libraries.
Our research will make use of the BLAS (Basic Linear Algebra Subprograms) [15]
and the LAPACK (Linear Algebra PACKage) [7] libraries. The BLAS are a stan-
dard to perform basic linear algebra operations involving vectors and matrices while
LAPACK performs the more complicated and higher level linear algebra operations.
The BLAS divide numerical linear algebra operations into three distinct group-
ings based upon the amount of input data and the computational cost. Operations
involving only vectors are considered Level 1; those involving both vectors and ma-
trices are Level 2; and those involving only matrices are Level 3. For matrices of size
n × n, the Level 3 operations are most coveted since they use O(n2) data for O(n3)
computations and inherently reduce the amount of memory traffic. Since the BLAS
provide fundamental linear algebra operations, hardware and software vendors such
as Intel, AMD, and IBM provide optimized BLAS libraries for a variety of archi-
tectures. The BLAS library can be multithreaded to make use of multi/many-core
architectures and most of the vendor supplied libraries are multithreaded.
The LAPACK library is a collection of subroutines for solving most of the common
problems in numerical linear algebra and was developed to make use of Level 3 BLAS
1
operations as much as possible. Algorithms within LAPACK are written to make use
of panels which can be either a block of columns or a block of rows so that updates
are performed using matrix multiplications instead of vector operations. LAPACK
can make use of a multithreaded BLAS to exploit multi/many-core architectures but
this may not be enough to fully exploit the capability of the architecture.
As an example, let us consider the Cholesky decomposition to factorize a symmet-
ric positive definite (SPD) matrix into its triangular factor. There are three variants
for performing the Cholesky decomposition: bordered, left-looking, and right-looking.
All three work on either the upper or lower triangular portion of the matrix and
produce the same triangular factor, but depending on the usage, one may have an
advantage over the others. In Figure 1.1 we depict the steps involved in each variant
using the lower triangular formulations. At the start of each variant, the matrix is
subdivided into blocks of rows and columns, or panels, which will take advantage of
the Level 3 BLAS.
The ‘bordered’ variant, as depicted in Figure 1.1a, involves a loop over three
steps. The first step updates the purple row block using the already factorized green
portion, the second step updates the next triangular block to be factorized (in red),
and the third step performs the factorization of the triangular block. This is then
repeated until the entire matrix is factorized. Note that the lower portion of the
matrix is not touched by the preceding steps.
The ‘left-looking’ variant (see Figure 1.1b) involves four steps. The first step
updates the triangular block in red which is then factorized in the second step, the
third step updates the block column (in cyan) below the triangular block using the
previous columns, and the last step updates the column block using the factorization
of step 2. It is called left-looking since the algorithm does not affect the portion to
the right of the current block of the matrix and only ‘looks’ to the left for its updates.
The top most triangular portion of the matrix is in its final form and will not change
2
STEP 1 ::
STEP 2 ::
STEP 3 ::
-T
CHOL( )
(a) Bordered variant
STEP 1 ::
STEP 2 ::
STEP 3 ::
STEP 4 ::
CHOL( )
-T
(b) Left-looking variant
STEP 1 ::
STEP 2 ::
STEP 3 ::
CHOL( )
-T
(c) Right-looking variant
Figure 1.1: Three variants for the Cholesky decomposition
3
in the suceeding steps of the algorithm.
The ‘right-looking’ variant (see Figure 1.1c) involves three steps. It performs the
factorization of the red triangular block, updates the block column (in cyna) and then
updates the blue trailing matrix on the right. This is called right-looking since it does
not require anything from the previous factorized matrix and pushes its updates to
the right part of the matrix. The entire matrix to the left of the column block in
which the algorithm is currently working is in its final form.
The advantage of the bordered variant is that it does the least amount of oper-
ations to determine if a matrix is SPD. The advantage of the right-looking variant
is that it provides the most parallelism. A major disadvantage of the left-looking
variant is the added fork-join that it must perform between the steps as compared to
the other two variants which will negatively affect its parallel performance.
The LAPACK scheme of using panels has three distinct disadvantages which
limit its performance. The first can be seen in the third step of the right-looking
Cholesky decomposition (Figure 1.1c) where potentially a large symmetric rank k
operation is performed; the memory architecture will bound the performance of the
algorithm. Secondly, there is some impact of the synchronizations that must be
performed between each step. Third, the idea of panels does not allow for fine-
grained tasks. We alleviate the last two of these restrictions through the use of tiled
algorithms whereas the first is overcome through the use of Level 3 BLAS operations
within the tiled algorithm.
We approach this via tiling a matrix which means reordering the data of the
matrix into smaller regions of contiguous memory as depicted in Figure 1.2. By
varying the tile size, this data layout allows us to tune the algorithm such that the
data needed for the kernels is present in the cache of the processor core. Moreover,
we are able to increase the amount of parallelism and minimize the synchronizations.
Let us revisit the Cholesky decomposition as described earlier and apply each
4
Figure 1.2: Data layout of tiled matrix
step to the tiled matrix. In Figure 1.3 we present the directed acyclic graphs (DAG)
for the three variants on a tiled matrix of 4× 4 tiles. In each of the DAGs, the tasks
are represented as the nodes and the data dependencies are the edges. The dashed
horizontal lines designate a full sweep through all of the steps in each algorithm.
POTRF-1
TRSM-2-1
TRSM-3-1
TRSM-4-1
HERK-2-1
GEMM-3-2-1
GEMM-4-2-1
HERK-3-1
GEMM-4-3-1
HERK-4-1
POTRF-2
TRSM-3-2
HERK-3-2
TRSM-4-2
GEMM-4-3-2
HERK-4-2
POTRF-3
TRSM-4-3
HERK-4-3
POTRF-4
1:1
2:1
3:1
4:1
5:1
6:1
7:1
8:1
9:1
10:1
11:1
12:2
13:1
14:1
15:1
16:1
17:1
18:1
19:1
(a) Bordered variant
POTRF-1
TRSM-2-1 TRSM-3-1TRSM-4-1
HERK-2-1
GEMM-3-2-1GEMM-4-2-1
HERK-3-1
GEMM-4-3-1
HERK-4-1
POTRF-2
TRSM-3-2
HERK-3-2
TRSM-4-2
GEMM-4-3-2
HERK-4-2
POTRF-3
TRSM-4-3
HERK-4-3
POTRF-4
1:1
2:3
3:1
4:1
5:2
6:2
7:1
8:1
9:1
10:1
11:1
12:1
13:1
14:1
15:1
16:1
(b) Left-looking variant
1:1 POTRF-1
TRSM-2-1 TRSM-3-1TRSM-4-12:3
HERK-2-1 GEMM-3-2-1GEMM-4-2-1 HERK-3-1GEMM-4-3-1HERK-4-13:6
POTRF-2
TRSM-3-2TRSM-4-2
HERK-3-2GEMM-4-3-2HERK-4-2
4:1
POTRF-3
TRSM-4-3
HERK-4-3
5:2
6:3
7:1
POTRF-4
8:1
9:1
10:1
(c) Right-looking variant
Figure 1.3: Three variants of the Cholesky decomposition applied to a tiled matrix
of 4× 4 tiles.
The first observation that one makes is that the height of each DAG varies ac-
cording to which variant is chosen. The bordered variant is the tallest since the tasks
5
become almost sequential where the only portion that is not sequential is that of
the row block update. The left-looking variant is almost of height t2 where t is the
number of tiles in a column of the tiled matrix. It gains parallelism from being able
to update the column block of the final step within the loop in parallel. As before,
the right-looking variant is the shortest and provides the most parallelism.
However, in the tiled versions, the synchronizations between each step of the
LAPACK algorithms is superficial and can be removed. By doing so, these three
variants reduce to only the right-looking variant.
The main difference between the tiled version and the blocked version is the
amount of parallelism that is gained from updates of the trailing matrix. Instead of
performing a large symmetric rank k update where k is the number of rows in a row
block or the number of columns in a column block, the operation is decomposed into
smaller symmetric rank nb updates and associated matrix multiplications where nb
is the size of a tile such that N = t · nb. In the right-looking variant, for an N × N
matrix the size of the first trailing matrix is (N − k) × (N − k) so that the update
operation for this first matrix has a computational cost of O(kN2). The tiled update
consists of both symmetric updates and matrix multiplications of tiles of size nb× nb
so that the computational cost per task is O(n3b).
The size of the tiles will determine the granularity of the tasks for a tiled algo-
rithm. For a matrix of size N ×N , the tile size of the tiled t× t matrix can vary from
the entire size of the matrix (t = 1) down to a scalar (t = N), but is held constant
throughout the execution of the algorithm. However, a balance must be kept between
the efficiency of the kernel and the amount of data movement.
Therefore, a tiled algorithm does overcome the restriction on the granularity
imposed by the panels concept of LAPACK as well as alleviate some of the synchro-
nizations and associated overhead. The memory bound is still present due to needing
to move the updates of the trailing matrices.
6
The Parallel Linear Algebra Software for Multicore Architectures (PLASMA) [2]
library provides the framework for the tiled algorithms. For the experimental por-
tions, our main assumption will be a shared memory machine architecture wherein
each processor has direct access to all portions of the memory in which the matrices
are stored.
In Chapter 2 we describe more fully the implementation of the tiled Cholesky
decomposition and further the tiled Cholesky Inversion algorithm. It shows that
translating from LAPACK to PLASMA can be straight forward, but that there are
caveats to be taken into account. In Chapter 3 we have implemented a tiled QR
decomposition showing that the implementation is not straight-forward. Moreover,
results from previous work do not translate directly to the tiled algorithm.
Unlike Chapters 2 and 3 where an unbounded number of processors is assumed,
Chapters 4 and 5 restrict the number of processors and provide bounds on the per-
formance of the algorithms. We observe the theoretical speed-up and efficiency and
provide more realistic bounds on the performance.
Finally, Chapter 6 presents a study on a tiled implementation of the Strassen-
Winograd algorithm for matrix-matrix multiplication. Unlike the other algorithms
presented, it is a recursive algorithm which becomes interesting in the scope of tiled
matrices.
7
2. Cholesky Inversion
In this chapter, we present joint work with Emmanuel Agullo, Jack Dongarra,
Jakub Kurzak, Julien Langou, and Lee Rosenberg [4].
The appropriate direct method to compute the solution of a symmetric positive
definite system of linear equations consists of computing the Cholesky factorization
of that matrix and then solving the underlying triangular systems. It is not recom-
mended to use the inverse of the matrix in this case. However some applications need
to explicitly form the inverse of the matrix. A canonical example is the computation
of the variance-covariance matrix in statistics. Higham [32, p.260,§3] lists more such
applications.
With their advent, multicore architectures [50] induce the need for algorithms and
libraries that fully exploit their capacities. A class of such algorithms – called tile
algorithms [18, 19] – has been developed for one-sided dense factorizations (Cholesky,
LU and QR) and made available as part of the Parallel Linear Algebra Software for
Multicore Architectures (PLASMA) library [2]. In this chapter, we extend this class
of algorithms to the case of the (symmetric positive definite) matrix inversion. Besides
constituting an important functionality for a library such as PLASMA, the study of
the matrix inversion on multicore architectures represents a challenging algorithmic
problem. Indeed, first, contrary to standalone one-sided factorizations that have been
studied so far, the matrix inversion exhibits many anti-dependences [6] (Write after
Read). This is a false or artificial dependency which is reliant on the name of the
data and not the actual data flow. For example, given two operations where the
first only reads the data in the matrix A and the second only writes to the location
of A, then in a parallel execution there may be a case where the data being read
by the first operation is wrong since the second may have already written to the
location. By copying the data from A beforehand, both operations can be executed
in parallel. Those anti-dependences can be a bottleneck for parallel processing, which
8
is critical on multicore architectures. It is thus essential to investigate (and adapt)
well known techniques used in compilation such as using temporary copies of data to
remove anti-dependences to enhance the degree of parallelism of the matrix inversion.
This technique is known as array renaming [6] (or array privatization [28]). Second,
loop reversal [6] is to be investigated. Third, the matrix inversion consists of three
successive steps (first of which is the Cholesky decomposition). In terms of scheduling,
it thus represents an opportunity to study the effects of pipelining [6] those steps on
performance.
The current version of PLASMA (version 2.1) is scheduled statically. Initially de-
veloped for the IBM Cell processor [34], this static scheduling relies on POSIX threads
and simple synchronization mechanisms. It has been designed to maximize data reuse
and load balancing between cores, allowing for very high performance [3] on today’s
multicore architectures. However, in the case of matrix inversion, the design of an
ad-hoc static scheduling is a time consuming task and raises load balancing issues
that are much more difficult to address than for a stand-alone Cholesky decomposi-
tion, in particular when dealing with the pipelining of multiple steps. Furthermore,
the growth of the number of cores and the more complex memory hierarchies make
executions less deterministic. In this chapter, we rely on an experimental in-house
dynamic scheduler [33]. This scheduler is based on the idea of expressing an algorithm
through its sequential representation and unfolding it at runtime using data hazards
(Read after Write, Write after Read, Write after Write) as constraints for parallel
scheduling. The concept is rather old and has been validated by a few successful
projects. We could have as well used schedulers from the Jade project from Stanford
University [42] or from the SMPSs project from the Barcelona Supercomputer Center
[40].
Our discussions are illustrated with experiments conducted on a dual-socket quad-
core machine based on an Intel Xeon EMT64 processor operating at 2.26 GHz. The
9
theoretical peak is equal to 9.0 Gflop/s per core or 72.3 Gflop/s for the whole machine,
composed of 8 cores. The machine is running Mac OS X 10.6.2 and is shipped with
the Apple vecLib v126.0 multithreaded BLAS [15] and LAPACK vendor library, as
well as LAPACK [7] v3.2.1 and ScaLAPACK [13] v1.8.0 references.
2.1 Tile in-place matrix inversion
Tile algorithms are a class of Linear Algebra algorithms that allow for fine gran-
ularity parallelism and asynchronous scheduling, enabling high performance on mul-
ticore architectures [3, 18, 19, 41]. The matrix of order n is split into t × t square
submatrices of order b (n = b × t). Such a submatrix is of small granularity (we
fixed b = 200 in this chapter) and is called a tile. So far, tile algorithms have been
essentially used to implement one-sided factorizations [3, 18, 19, 41].
Algorithm 2.1 extends this class of algorithms to the case of the matrix inver-
sion. As in state-of-the-art libraries (LAPACK, ScaLAPACK), the matrix inversion
is performed in-place, i.e., the data structure initially containing matrix A is directly
updated as the algorithm is progressing, without using any significant temporary
extra-storage; eventually, A−1 replaces A. Algorithm 2.1 is composed of three steps.
Step 1 is a Tile Cholesky Factorization computing the Cholesky factor L (lower tri-
angular matrix satisfying A = LLT ). This step was studied in [19]. Step 2 computes
L−1 by inverting L. Step 3 finally computes the inverse matrix A−1 = L−1TL−1.
Each step is composed of multiple fine granularity tasks (since operating on tiles).
These tasks are part of the BLAS (SYRK, GEMM, TRSM, TRMM) and LAPACK
(POTRF, TRTRI, LAUUM) standards. A more detailed description is beyond the
scope of this extended chapter and is not essential to the understanding of the rest
of the chapter. Indeed, from a high level point of view, an operation based on tile
algorithms can be represented as a Directed Acyclic Graphs (DAG) [22] where nodes
represent the fine granularity tasks in which the operation can be decomposed and the
edges represent the dependences among them. For instance, Figure 2.1a represents
10
Algorithm 2.1: Tile In-place Cholesky Inversion (lower format). Matrix A is
the on-going updated matrix (in-place algorithm).
Input: A, Symmetric Positive Definite matrix in tile storage (t× t tiles).
Result: A−1, stored in-place in A.
1 Step 1: Tile Cholesky Factorization (compute L such that A = LLT );
2 for j = 0 to t− 1 do
3 for k = 0 to j − 1 do
4 Aj,j ← Aj,j − Aj,k ∗ ATj,k (SYRK(j,k)) ;
5 Aj,j ← CHOL(Aj,j) (POTRF(j)) ;
6 for i = j + 1 to t− 1 do
7 for k = 0 to j − 1 do
8 Ai,j ← Ai,j − Ai,k ∗ ATj,k (GEMM(i,j,k)) ;
9 for i = j + 1 to t− 1 do
10 Ai,j ← Ai,j/ATj,j (TRSM(i,j)) ;
11 Step 2: Tile Triangular Inversion of L (compute L−1);
12 for j = t− 1 to 0 do
13 Aj,j ← TRINV (Aj,j) (TRTRI(j)) ;
14 for i = t− 1 to j + 1 do
15 Ai,j ← Ai,i ∗ Ai,j (TRMM(i,j)) ;
16 for k = j + 1 to i− 1 do
17 Ai,j ← Ai,j + Ai,k ∗ Ak,j (GEMM(i,j,k)) ;
18 Ai,j ← −Ai,j ∗ Ai,i (TRMM(i,j)) ;
19 Step 3: Tile Product of Lower Triangular Matrices (compute A−1 = L−1TL−1);
20 for i = 0 to t− 1 do
21 for j = 0 to i− 1 do
22 Ai,j ← ATi,i ∗ Ai,j (TRMM(i,j)) ;
23 Ai,i ← ATi,i ∗ Ai,i (LAUUM(i)) ;
24 for j = 0 to i− 1 do
25 for k = i+ 1 to t− 1 do
26 Ai,j ← Ai,j + ATk,i ∗ Ak,j (GEMM(i,j,k)) ;
27 for k = i+ 1 to t− 1 do
28 Ai,i ← Ai,i + ATk,i ∗ Ak,i (SYRK(i,k)) ;
11
the DAG of Step 3 of Algorithm 2.1.
!"##$%&'(
!"##$%)'(
*+,-%)./'(
0,$$%/.)'( *+,-%).&'(
12$$%/.).&'( *+,-%).3'(!"##$%/'(
*+,-%/.&'( 12$$%/.).3'( 0,$$%&.)'(
0,$$%&./'( *+,-%/.3'( 12$$%&.).3'(
12$$%&./.3'( 0,$$%3.)'(
0,$$%3./'(*,+-%&.3'(
0,$$%3.&'(
!"##$%3'(
(a) In-place (Algorithm 2.1)
!"##$
%&'(
)*+,
%&-.'(
/+$$
%.-&'(
)*+,
%&-0'(
12$$
%.-&-0'(
)*+,
%&-3'(
!"##$
%.'(
)*+,
%.-0'(
12$$
%.-&-3'(
/+$$
%0-&'(
/+$$
%0-.'(
)*+,
%.-3'(
12$$
%0-&-3'(
!"##$
%0'(
12$$
%0-.-3'(
/+$$
%3-&'(
/+$$
%3-.'(
)+*,
%0-3'(
/+$$
%3-0'(
!"##$
%3'(
(b) Out-of-place (variant introduced in § 2.2)
Figure 2.1: DAGs of Step 3 of the Tile Cholesky Inversion (t = 4).
Algorithm 2.1 is based on the variants used in LAPACK 3.2.1. Bientinesi, Gunter
and van de Geijn [10] discuss the merits of algorithmic variations in the case of the
computation of the inverse of a symmetric positive definite matrix. Although of
definite interest, this is not the focus of this extended chapter.
We have implemented Algorithm 2.1 using our dynamic scheduler introduced
in the beginning of the chapter. Figure 2.2 shows its performance against state-of-
the-art libraries and the vendor library on the machine described in the beginning
of the chapter. For a matrix of small size, it is difficult to extract parallelism and
have a full use of all the cores [3, 18, 19, 41]. We indeed observe a limited scalabil-
ity (N = 1000, Figure 2.2a). However, tile algorithms (Algorithm 2.1) still benefit
from a higher degree of parallelism than blocked algorithms [3, 18, 19, 41]. There-
fore Algorithm 2.1 (in place) consistently achieves a significantly better performance
than vecLib, ScaLAPACK and LAPACK libraries. A larger matrix size (N = 4000,
Figure 2.2b) allows for a better use of parallelism. In this case, an optimized imple-
mentation of a blocked algorithm (vecLib) competes well against tile algorithms (in
place) on few cores (left part of Figure 2.2a). However, only tile algorithms scale to
a larger number of cores (rightmost part of Figure 2.2b) thanks to a higher degree
of parallelism. In other words, the tile Cholesky inversion achieves a better strong
12
Table 2.1: Length of the critical path as a function of the number of tiles t.
In-place case Out-of-place case
Step 1 3t− 2 3t− 2
Step 2 3t− 3 2t− 1
Step 3 3t− 2 t
scalability than the blocked versions, similarly to what had been observed for the
factorization step [3, 18, 19, 41].
1 2 3 4 5 6 7 8
0
5
10
15
20
25
30
35
40
(#threads)
G
flo
p/
s
Cholesky Inversion (POTRF+POTRI) GUSTv2009.12.01 (run on (#thread+1) threads)
N=1000, NB=200, VecLib, Processor 2 x 2.26GHz Quad−Core Intel Xeon
 
 
theoretical peak
perfect scalability
out of place
in place
vecLib
scalapack
lapack
(a) n = 1000
1 2 3 4 5 6 7 8
0
5
10
15
20
25
30
35
40
(#threads)
G
flo
p/
s
Cholesky Inversion (POTRF+POTRI) GUSTv2009.12.01 (run on (#thread+1) threads)
N=4000, NB=200, VecLib, Processor 2 x 2.26GHz Quad−Core Intel Xeon
 
 
theoretical peak
perfect scalability
out of place
in place
vecLib
scalapack
lapack
(b) n = 4000
Figure 2.2: Scalability of Algorithm 2.1 (in place) and its out-of-place variant in-
troduced in § 2.2, using our dynamic scheduler against vecLib, ScaLAPACK and
LAPACK libraries.
2.2 Algorithmic study
In the § 2.1, we compared the performance of the tile Cholesky inversion against
state-the-art libraries. In this section, we focus on tile Cholesky inversion and we
discuss the impact of several variants of Algorithm 2.1 on performance.
Array renaming (removing anti-dependences). The dependence between
SYRK(0,1) and TRMM(1,0) in the DAG of Step 3 of Algorithm 2.1 (Figure 2.1a)
represents the constraint that the SYRK operation (l. 28 of Algorithm 2.1) needs
to read Ak,i = A1,0 before TRMM (l. 22) can overwrite Ai,j = A1,0. This anti-
dependence (Write after Read) can be removed thanks to a temporary copy of A1,0.
13
Similarly, all the SYRK-TRMM anti-dependences, as well as TRMM-LAUMM and
GEMM-TRMM anti-dependences can be removed. We have designed a variant of
Algorithm 2.1 that removes all the anti-dependences thanks to the use of a large
working array (this technique is called array renaming [6] in compilation [6]). The
subsequent DAG (Figure 2.1b) is split in multiple pieces (Figure 2.1b), leading to a
shorter critical path (Table 2.1). We implemented the out-of-place algorithm, based
on our dynamic scheduler too. Figure 2.2a shows that our dynamic scheduler exploits
its higher degree of parallelism to achieve a much higher strong scalability even on
small matrices (N = 1000). For a larger matrix (Figure 2.2b), the in-place algorithm
already achieved very good scalability. Therefore, using up to 7 cores, their perfor-
mance are similar. However, there is not enough parallelism with a 4000×4000 matrix
to use efficiently all 8 cores with the in-place algorithm; thus the higher performance
of the out-of-place version in this case (leftmost part of Figure 2.2b).
Loop reversal (exploiting commutativity). The most internal loop of each
step of Algorithm 2.1 (l. 8, l. 17 and l. 26) consists in successive commutative GEMM
operations. Therefore they can be performed in any order, among which increasing
order and decreasing order of the loop index. Their ordering impacts the length of the
critical path. Algorithm 2.1 orders those three loops in increasing (U) and decreasing
(D) order, respectively. We had manually chosen these respective orders (UDU)
because they minimize the critical path of each step (values reported in Table 2.1).
A naive approach would have, for example, been comprised of consistently ordering
the loops in increasing order (UUU). In this case (UUU), the critical path of TRTRI
would have been equal to t2− 2t+ 3 (in-place) or (1
2
t2− 1
2
t+ 2) (out-of-place) instead
of 3t − 3 (in-place) or 2t − 1 (out-of-place) for (UDU). Figure 2.3 shows how loop
reversal impacts performance.
Pipelining. Pipelining the multiple steps of the inversion reduces the length of
its critical path. For the in-place case, the critical path is reduced from 9t− 7 tasks
14
1 2 3 4 5 6 7 8
0
5
10
15
20
25
30
35
40
(#threads)
G
flo
p/
s
Cholesky Inversion (POTRF+POTRI) GUSTv2009.12.01 (run on (#thread+1) threads)
N=1000, NB=200, VecLib, Processor 2 x 2.26GHz Quad−Core Intel Xeon
 
 
peak
perfect scalability
out of place(UDU)
out of place (UUU)
in place (UDU)
in place (UUU)
(a) n = 1000
1 2 3 4 5 6 7 8
0
5
10
15
20
25
30
35
40
(#threads)
G
flo
p/
s
Cholesky Inversion (POTRF+POTRI) GUSTv2009.12.01 (run on (#thread+1) threads)
N=4000, NB=200, VecLib, Processor 2 x 2.26GHz Quad−Core Intel Xeon
 
 
peak
perfect scalability
out of place(UDU)
out of place (UUU)
in place (UDU)
in place (UUU)
(b) n = 4000
Figure 2.3: Impact of loop reversal on performance.
(t is the number of tiles) to 9t − 9 tasks (negligible). For the out-of-place case, it
is reduced from 6t − 3 to 5t − 2 tasks. We studied the effect of pipelining on the
performance of the inversion on a 8000 × 8000 matrix with an artificially large tile
size (b = 2000 and t = 4). As expected, we observed almost no effect on performance
of the in-place case (about 36.4 seconds with or without pipelining). For the out-of-
place case, the elapsed time grows from 25.1 to 29.2 seconds (16% overhead) when
pipelining is prevented.
2.3 Conclusion and future work
We have proposed a new algorithm to compute the inverse of a symmetric positive
definite matrix on multicore architectures. An experimental study has shown both
an excellent scalability of our algorithm and a significant performance improvement
compared to state-of-the-art libraries. Beyond extending the class of so-called tile
algorithms, this study brought back to the fore well known issues in the domain of
compilation. Indeed, we have shown the importance of loop reversal, array renaming
and pipelining.
The use of a dynamic scheduler allowed an out-of-the-box pipeline of the differ-
ent steps whereas loop reversal and array renaming required a manual change to the
15
algorithm. The future work directions consist in enabling the scheduler to perform
itself loop reversal and array renaming. We exploited the commutativity of GEMM
operations to perform array renaming. Their associativity would furthermore allow
to process them in parallel (following a binary tree); the subsequent impact on perfor-
mance is to be studied. Array renaming requires extra-memory. It will be interesting
to address the problem of the maximization of performance under memory constraint.
This work aims to be incorporated into PLASMA.
16
3. QR Factorization
In this chapter we present joint work with Mathias Jacquelin, Julien Langou, and
Yves Robert [31].
Given an m-by-n matrix A with n ≤ m, we consider the computation of its QR
factorization, which is the factorization A = QR, where Q is an m-by-n unitary
matrix (QHQ = In), and R is upper triangular.
The QR factorization of an m-by-n matrix with n ≤ m is the time consuming
stage of some important numerical computations. It is needed for solving a linear
least squares problem with m equations (observations) and n unknowns and is used
to compute an orthonormal basis (the Q-factor) of the column span of the initial
matrix A. For example, all block iterative methods (used to solve large sparse linear
systems of equations or computing some relevant eigenvalues of such systems) require
orthogonalizing a set of vectors at each step of the process. For these two usage
examples, while n ≤ m, n can range from n  m to n = m. We note that the
extreme case n = m is also relevant: the QR factorization of a matrix can be used to
solve (square) linear systems of equations. While this requires twice as many flops as
an LU factorization, using a QR factorization (a) is unconditionally stable (Gaussian
elimination with partial pivoting or pairwise pivoting is not) and (b) avoids pivoting
so it may well be faster in some cases.
To obtain a QR factorization, we consider algorithms which apply a sequence of
m-by-m unitary transformations, Ui, (U
H
i Ui = I,), i = 1, . . . , `, on the left of the
matrix A, such that after ` transformations the resulting matrix R = U` . . . U1A is
upper triangular, in which case, R is indeed the R-factor of the QR factorization. The
Q-factor (if needed) can then be obtained by computing Q = UH1 . . . U
H
` . These types
of algorithms are in regular use, e.g., in the LAPACK and ScaLAPACK libraries,
and are favored over others algorithms (Cholesky QR or Gram-Schmidt) for their
stability.
17
The unitary transformation Ui is chosen so as to introduce some zeros in the cur-
rent update matrix Ui−1 . . . U1A. The two basic transformations are Givens rotations
and Householder reflections. One Givens rotation introduces one additional zero;
the whole triangularization requires mn − n(n + 1)/2 Givens rotations for n < m.
One elementary Householder reflection simultaneously introduces m− i zeros in po-
sition i + 1 to m in column i; the whole triangularization requires n Householder
reflections for n < m. (See LAPACK subroutine GEQR2 .) The LAPACK GEQRT
subroutine constructs a compact WY representation to apply a sequence of ib House-
holder reflections, this enables one to introduce the appropriate zeros in ib consecutive
columns and thus leverage optimized Level 3 BLAS subroutines during the update.
The blocking of Givens rotations is also possible but is more costly in terms of flops.
The main interest of Givens rotations over Householder transformations is that
one can concurrently introduce zeros using disjoint pairs of rows, in other words, two
transformations Ui and Ui+1 may be applicable concurrently. This is not possible
using the original Householder reflection algorithm since the transformations work
on whole columns and thus do not exhibit this type of intrinsic parallelism, forcing
this kind of Householder reflections to be applied sequentially. The advantages of
Householder reflections over Givens rotations are that, first, Householder reflections
perform fewer flops, and second, the compact WY transformation enables high se-
quential performance of the algorithm. In a multicore setting, where data locality
and parallelism are crucial algorithmic characteristics for enabling performance, the
tiled QR factorization algorithm combines both ideas: use of Householder reflections
for high sequential performance and use of a scheme such as Givens rotations to en-
able parallelism within cores. In essence, one can think either (i) of the tiled QR
factorization as a Givens rotation scheme but on tiles (mb-by-nb submatrices) instead
of on scalars (1-by-1 submatrices) as in the original scheme, or (ii) of it as a blocked
Householder reflection scheme where each reflection is confined to an extent much
18
less than the full column span, which enables concurrency with other reflections.
Tiled QR factorization in the context of multicore architectures has been intro-
duced in [18, 19, 41]. Initially the focus was on square matrices and the sequence of
unitary transformations presented was analogous to Sameh-Kuck [45], which cor-
responds to reducing the panels with flat trees. The possibility of using any tree in
order to either maximize parallelism or minimize communication is explained in [26].
The focus of this chapter is on maximizing parallelism. We reduce the communica-
tion (data movement between memory hierarchy) within the algorithm to acceptable
levels by tiling the operations. Stemming from the observation that a binary tree is
best for tall and skinny matrices and a flat tree is best for square matrices, Hadri
et al. [30], propose to use trees which combine flat trees at the bottom level with a
binary tree at the top level in order to exhibit more parallelism. Our theoretical and
experimental work explains that we can adapt Fibonacci [36] and Greedy [24, 25]
to tiles, resulting in yet better algorithms in terms of parallelism. Moreover our new
algorithms do not have any tuning parameter such as the domain size in the case
of [30].
The sequential kernels of the Tiled QR factorization (executed on a core) are made
of standard blocked algorithms such as LAPACK encoded in kernels; the development
of these kernels is well understood. The focus of this chapter is on improving the
overall degree of parallelism of the algorithm. Given a p-by-q tiled matrix, we seek
to find an appropriate sequence of unitary transformations on the tiled matrix so as
to maximize parallelism (minimize critical path length). We will get our inspiration
from previous work from the 1970s/80s on Givens rotations where the question was
somewhat related: given an m-by-n matrix, find an appropriate sequence of Givens
rotations as to maximize parallelism. This question is essentially answered in [24, 25,
36, 45]; we call this class of algorithms “coarse-grain algorithms.”
Working with tiles instead of scalars, we introduce four essential differences be-
19
tween the analysis and the reality of the tiled algorithms versus the coarse-grain
algorithms. First, while there are only two states for a scalar (nonzero or zero), a
tile can be in three states (zero, triangle or full). Second, there are more operations
available on tiles to introduce zeros; we have a total of three different tasks which can
introduce zeros in a matrix. Third, in the tiled algorithms, the factorization and the
update are dissociated to enable factorization stages to overlap with update stages;
whereas, in the coarse-grain algorithm, the factorization and the associated update
are considered as a single stage. Lastly, while coarse-grain algorithms have only one
task, we end up with six different tasks: three from the factorizations (zeroing of
tiles) and three for each of the associated updates (since these have been unlinked
from the factorization). Each of these six tasks have different computational weights;
this dramatically complicates the critical path analysis of the tiled algorithms.
While the Greedy algorithm is optimal for coarse-grain algorithms, we show
that it is not in the case of tiled algorithms. However, we have devised and proved
that there does exist an optimal tiled algorithm.
3.1 The QR factorization algorithm
Tiled algorithms are expressed in terms of tile operations rather than elementary
operations. Each tile is of size nb × nb, where nb is a parameter tuned to squeeze
the most out of arithmetic units and memory hierarchy. Typically, nb ranges from
80 to 200 on state-of-the-art machines [5]. Algorithm 3.1 outlines a naive tiled QR
algorithm, where loop indices represent tiles:
Algorithm 3.1: Generic QR algorithm for a tiled p× q matrix.
1 for k = 1 to min(p, q) do
2 forall the i ∈ {k + 1, . . . , p} using any ordering, do
3 elim(i, piv(i, k), k)
In Algorithm 3.1, k is the panel index, and elim(i, piv(i, k), k) is an orthogonal
transformation that combines rows i and piv(i, k) to zero out the tile in position (i, k).
20
However, this formulation is somewhat misleading, as there is much more freedom
for QR factorization algorithms than, say, for Cholesky algorithms (and contrarily to
LU elimination algorithms, there are no numerical stability issues). For instance in
column 1, the algorithm must eliminate all tiles (i, 1) where i > 1, but it can do so
in several ways. Take p = 6. Algorithm 3.1 uses the transformations
elim(2, 1, 1), elim(3, 1, 1), elim(4, 1, 1), elim(5, 1, 1), elim(6, 1, 1)
But the following scheme is also valid:
elim(3, 1, 1), elim(6, 4, 1), elim(2, 1, 1), elim(5, 4, 1), elim(4, 1, 1)
In this latter scheme, the first two transformations elim(3, 1, 1) and elim(6, 4, 1) use
distinct pairs of rows, and they can execute in parallel. On the contrary, elim(3, 1, 1)
and elim(2, 1, 1) use the same pivot row and must be sequentialized. To compli-
cate matters, it is possible to have two orthogonal transformations that execute in
parallel but involve zeroing a tile in two different columns. For instance we can
add elim(6, 5, 2) to the previous transformations and run it concurrently with, say,
elim(2, 1, 1). Any tiled QR algorithm will be characterized by an elimination list,
which provides the ordered list of the transformations used to zero out all the tiles
below the diagonal. This elimination list must obey certain conditions so that the fac-
torization is valid. For instance, elim(6, 5, 2) must follow elim(6, 4, 1) and elim(5, 4, 1)
in the previous list, because there is a flow dependence between these transformations.
Note that, although the elimination list is given as a totally ordered sequence, some
transformations can execute in parallel, provided that they are not linked by a de-
pendence: in the example, elim(6, 4, 1) and elim(2, 1, 1) could have been swapped,
and the elimination list would still be valid.
In order to describe more fully the dependencies inherent in the eliminations
we shall observe a snippet of an example. In Figure 3.1a, to the left we have the
row identifications, the empty circles represent zeroed elements, and the filled circles
21
represent the pivots used to zero out the elements. The first column’s eliminations
are shown in green and the second in red. From the elimination list, we define Is,k as
the set of rows in column k that are zeroed out at time step s.
5
6
7
8
9
10
11
12
13
14
15
(a) Diagram of elimina-
tion list
elim(13, 10, 1)
elim(14, 11, 1)
elim(15, 12, 1)
 Is1,1 ⇒ elim(Is1,1, 1)
elim(9, 5, 1)
elim(10, 6, 1)
elim(11, 7, 1)
elim(12, 8, 1)
 Is2,1 ⇒ elim(Is2,1, 1)
elim(15, 14, 2)
}
Is1,2 ⇒ elim(Is1,2, 2)
elim(12, 9, 2)
elim(13, 10, 2)
elim(14, 11, 2)
 Is2,2 ⇒ elim(Is2,2, 2)
(b) Elimination list
What may not be so evident from the elimination list but is more apparent in
the diagram of the elimination list are the following dependency relationships (note
that ≺ indicates that the operation on the left must finish prior to the operation on
the right starting):
elim(piv(Is,k, k), k − 1) ≺ elim(Is,k, k) (3.1a)
elim(Is,k, k − 1) ≺ elim(Is,k, k) (3.1b)
elim(Is−1,k, k) ≺ elim(Is,k, k) (3.1c)
However, not all of these dependencies may cause an elimination to be locked to a
particular time step. In fact, some dependencies may not be needed for a particular
instance, but the addition of these will not create an artificial lock. For example,
elim(Is2,2, 2) is dependent upon elim(piv(Is2,2), 1), elim(Is2,2, 1), and elim(Is1,2, 2) but
elim(Is1,2, 2) only depends upon elim(piv(Is1,2), 1) and elim(Is1,2, 1).
22
Table 3.1: Kernels for tiled QR. The unit of time is
n3b
3
, where nb is the blocksize.
Panel Update
Operation Name Cost Name Cost
Factor square into triangle GEQRT 4 UNMQR 6
Zero square with triangle on top TSQRT 6 TSMQR 12
Zero triangle with triangle on top TTQRT 2 TTMQR 6
Before formally stating the conditions that guarantee the validity of (the elimi-
nation list of) an algorithm, we explain how orthogonal transformations can be im-
plemented.
3.1.1 Kernels
To implement a given orthogonal transformation elim(i, piv(i, k), k), one can use
six different kernels, whose costs are given in Table 3.1. In this table, the unit of time
is the time to perform
n3b
3
floating-point operations.
There are two main possibilities to implement an orthogonal transformation
elim(i, piv(i, k), k): The first version eliminates tile (i, k) with the TS (Triangle on
top of square) kernels, as shown in Algorithm 3.2:
Algorithm 3.2: Elimination elim(i, piv(i, k), k) via TS (Triangle on top of
square) kernels.
1 GEQRT (piv(i, k), k)
2 TSQRT (i, piv(i, k), k)
3 for j = k + 1 to q do
4 UNMQR(piv(i, k), k, j)
5 TSMQR(i, piv(i, k), k, j)
Here the tile panel (piv(i, k), k) is factored into a triangle (with GEQRT ). The
transformation is applied to subsequent tiles (piv(i, k), j), j > k, in row piv(i, k) (with
UNMQR). Tile (i, k) is zeroed out (with TSQRT ), and subsequent tiles (i, j), j > k,
in row i are updated (with TSMQR). The flop count is 4 + 6 + (6 + 12)(q − k) =
23
10 + 18(q − k) (expressed in same time unit as in Table 3.1). Dependencies are the
following:
GEQRT (piv(i, k), k) ≺ TSQRT (i, piv(i, k), k)
GEQRT (piv(i, k), k) ≺ UNMQR(piv(i, k), k, j) for j > k
UNMQR(piv(i, k), k, j) ≺ TSMQR(i, piv(i, k), k, j) for j > k
TSQRT (i, piv(i, k), k) ≺ TSMQR(i, piv(i, k), k, j) for j > k
TSQRT (i, piv(i, k), k) and UNMQR(piv(i, k), k, j) can be executed in parallel, as well
as UNMQR operations on different columns j, j′ > k. With an unbounded number
of processors, the parallel time is thus 4 + 6 + 12 = 22 time-units.
Algorithm 3.3: Elimination elim(i, piv(i, k), k) via TT (Triangle on top of
triangle) kernels.
1 if k > 0 then
2 TTQRT (i, piv(i, k), k)
3 if k < q then
4 for j = k + 1 to q do
5 if k > 0 then
6 TTMQR(i, piv(i, k), k, j)
7 GEQRT (i, k + 1)
8 for j = k + 2 to q do
9 UNMQR(i, k, j)
The second approach to implement the orthogonal transformation
elim(i, piv(i, k), k) is with the TT (Triangle on top of triangle) kernels, as shown
in Algorithm 3.3. Here tile (i, k) is zeroed out (with TTQRT ) and subsequent tiles
(i, j) and (piv(i, k), j), j > k, in rows i and piv(i, k) are updated (with TTMQR).
Immediately following, tile (i, k + 1) is factored into a triangle and the correspond-
ing transformations are applied to the remaining columns in row i. Necessarily,
TTQRT must have the triangularization of tile (i, k) and (piv(i, k), k) completed
24
in order to proceed. Hence for the first column there are no updates to be ap-
plied from previous columns such that the triangularization of these tiles (with
GEQRT ) is completed and can be considered a preprocessing step. The flop count is
2(4 + 6(q − k)) + 2 + 6(q − k) = 10 + 18(q − k), just as before. Dependencies are the
following:
GEQRT (piv(i, k), k) ≺ UNMQR(piv(i, k), k, j) for j > k (3.2a)
GEQRT (i, k) ≺ UNMQR(i, k, j) for j > k (3.2b)
GEQRT (piv(i, k), k) ≺ TTQRT (i, piv(i, k), k) (3.2c)
GEQRT (i, k) ≺ TTQRT (i, piv(i, k), k) (3.2d)
TTQRT (i, piv(i, k), k) ≺ TTMQR(i, piv(i, k), k, j) for j > k (3.2e)
UNMQR(piv(i, k), k, j) ≺ TTMQR(i, piv(i, k), k, j) for j > k (3.2f)
UNMQR(i, k, j) ≺ TTMQR(i, piv(i, k), k, j) for j > k (3.2g)
Now the factor operations in row piv(i, k) and i can be executed in parallel. Moreover,
the UNMQR updates can be run in parallel with the TTQRT factorization. Thus,
with an unbounded number of processors, the parallel time is 4+6+6 = 16 time-units.
Recall our definition of the set Is,k to be the set of rows in column k that will be
zeroed out at time step s in the coarse-grain algorithm. Thus the following depen-
dencies are a direct consequence of 3.1c as applied to the zeroing of a tile and the
corresponding updates.
TTQRT (Is−1, piv(Is−1, k), k) ≺ TTQRT (Is, piv(Is, k), k) (3.3a)
TTQRT (Is−1, piv(Is−1, k), k, j) ≺ TTMQR(Is, piv(Is, k), k, j) for j > k (3.3b)
In Algorithm 3.2 and 3.3, it is understood that if a tile is already in triangular
form, then the associated GEQRT and update kernels do not need to be applied.
All the new algorithms introduced in this chapter are based on TT kernels. From
an algorithmic perspective, TT kernels are more appealing than TS kernels, as they
25
offer more parallelism. More precisely, we can always break a TS kernel into two TT
kernels: We can replace a TSQRT (i, piv(i, k), k) (following a GEQRT (piv(i, k), k)) by
a GEQRT (i, k) and a TTQRT (i, piv(i, k), k). A similar transformation can be made
for the updates. Hence a TS -based tiled algorithm can always be executed with TT
kernels, while the converse is not true. However, the TS kernels provide more data
locality, they benefit from a very efficient implementation (see §3.3), and several ex-
isting algorithms use these kernels. For all these reasons, and for comprehensiveness,
our experiments will compare approaches based on both kernel types.
At this point, the PLASMA library only contains TS kernels. We have mapped
the PLASMA algorithm to TT kernel algorithm using this conversion. Going from a
TS kernel algorithm to a TT kernel algorithm is implicitly done by Hadri et al. [11]
when going from their “Semi-Parallel” to their “Fully-Parallel” algorithms.
3.1.2 Elimination lists
As stated above, any algorithm factorizing a tiled matrix of size p× q is charac-
terized by its elimination list. Obviously, the algorithm must zero out all tiles below
the diagonal: for each tile (i, k), i > k, 1 ≤ k ≤ min(p, q), the list must contain
exactly one entry elim(i, ?, k), where ? denotes some row index piv(i, k) . There are
two conditions for a transformation elim(i, piv(i, k), k) to be valid:
• both rows i and piv(i, k) must be ready, meaning that all their tiles left
of the panel (of indices (i, k′) and (piv(i, k), k′) for 1 ≤ k′ < k) must
have already been zeroed out: all transformations elim(i, piv(i, k′), k′) and
elim(piv(i, k), piv(piv(i, k), k′), k′) must precede elim(i, piv(i, k), k) in the elim-
ination list
• row piv(i, k) must be a potential annihilator, meaning that tile (piv(i, k), k) has
not been zeroed out yet: the transformation elim(piv(i, k), piv(piv(i, k), k), k)
must follow elim(i, piv(i, k), k) in the elimination list
26
Any algorithm that factorizes the tiled matrix obeying these conditions is called a
generic tiled algorithm in the following.
Theorem 3.1 No matter what elimination list (any combination of TT, TS) is used
the total weight of the tasks for performing a tiled QR factorization algorithm is
constant and equal to 6pq2 − 2q3.
Proof: We have that the quantity of each kernel is given by the following
L1 ::GEQRT = TTQRT + q
L2 :: UNMQR = TTMQR + (1/2)q(q − 1)
L3 :: TTQRT + TSQRT = pq − (1/2)q(q + 1)
L4 :: TTMQR + TSMQR = (1/2)pq(q − 1)− (1/6)q(q − 1)(q + 1)
The quantity of TTQRT provides the number of tiles zeroed out via a triangle on
top of a triangle kernel. Thus equation L1 is composed of two parts: necessarily,
the diagonal tiles must be triangularized and each TTQRT must admit one more
triangularization in order to provide a pairing. The number of updates of these trian-
gularizations, given by L2, is simply the sum of the updates from the triangularization
of the tiles on the diagonal and the updates from the zeroed tiles via TTQRT . The
combination of TTQRT and TSQRT , equation L3, is exactly the total number of
tiles that are zeroed, namely every tile below the diagonal. Hence, the total number
of updates, provided by L4, is the number of tiles below the diagonal beyond the first
column minus the sum of the progression through the columns. Now we define
L5 = 4L1 + 6L2 + 6L3 + 12L4
then
L5 = 4GEQRT + 6TSQRT + 2TTQRT + 6UNMQR + 6TTMQR + 12TSMQR.
As can be noted in L5, the coefficients of each term correspond precisely to the weight
of the kernels as derived from the number of flops each kernel incurs. Simplifying L5,
27
we have our result
L5 = 6pq
2 − 2q3.
A critical result of Theorem 3.1 is that no matter what elimination list is used, the
total weight of the tasks for performing a tiled QR factorization algorithm is constant
and by using our unit task weight of n3b/3, with m = pnb, and n = qnb, we obtain
2mn2 − 2/3n3 flops which is the exact same number as for a standard Householder
reflection algorithm as found in LAPACK (e.g., [14]).
3.1.3 Execution schemes
In essence, the execution of a generic tiled algorithm is fully determined by its
elimination list. This list is statically given as input to the scheduler, and the exe-
cution progresses dynamically, with the scheduler executing all required transforma-
tions as soon as possible. More precisely, each transformation involves several kernels,
whose execution starts as soon as they are ready, i.e., as soon as all dependencies have
been enforced. Recall that a tile (i, k) can be zeroed out only after all tiles (i, k′),
with k′ < k, have been zeroed out. Execution progresses as follows:
• Before being ready for elimination, tile (i, k), i > k, must be updated k−1 times,
in order to zero out the k − 1 tiles to its left (of index (i, k′), k′ < k). The last
update is a transformation TTMQR(i, piv(i, k−1), k−1, k) for some row index
piv(i, k− 1) such that elim(i, piv(i, k− 1), k− 1) belongs to the elimination list.
When completed, this transformation triggers the transformation GEQRT (i, k),
which can be executed immediately after the completion of the TTMQR. In
turn, GEQRT (i, k) triggers all updates UNMQR(i, k, j) for all j > k. These
updates are executed as soon as they are ready for execution.
• The elimination elim(i, piv(i, k), k) is performed as soon as possible when both
rows i and piv(i, k) are ready. Just after the completion of GEQRT (i, k) and
28
GEQRT (piv(i, k), k), kernel TTQRT (i, piv(i, k), k) is launched. When finished,
it triggers the updates TTMQR(i, piv(i, k), k, j) for all j > k.
Obviously, the degree of parallelism that can be achieved depends upon the elim-
inations that are chosen. For instance, if all eliminations in a given column use the
same factor tile, they will be sequentialized. This corresponds to the flat tree elimi-
nation scheme described below: in each column k, it uses elim(i, k, k) for all i > k.
On the contrary, two eliminations elim(i, piv(i, k), k) and elim(i′, piv(i′, k), k) in the
same column can be fully parallelized provided that they involve four different rows.
Finally, note that several eliminations can be initiated in different columns simulta-
neously, provided that they involve different pairs of rows, and that all these rows are
ready (i.e., they have the desired number of leftmost zeros).
The following lemma will prove very useful; it states that we can assume w.l.o.g.
that each tile is zeroed out by a tile above it, closer to the diagonal.
Lemma 3.2 Any generic tiled algorithm can be modified, without changing its exe-
cution time, so that all eliminations elim(i, piv(i, k), k) satisfy i > piv(i, k).
Proof: Define a reverse elimination as an elimination elim(i, piv(i, k), k) where
i < piv(i, k). Consider a generic tiled algorithm whose elimination list contains
some reverse eliminations. Let k0 be the first column to contain one of them. Let
i0 be the largest row index involved in a reverse elimination in column k0. The
elimination list in column k0 may contain several reverse eliminations elim(i1, i0, k0),
elim(i2, i0, k0), . . . , elim(ir, i0, k0), in that order, before row i0 is eventually zeroed out
by the transformation elim(i0, piv(i0, k0), k0). Note that piv(i0, k0) < i0 by definition
of i0. We modify the algorithm by exchanging the roles of rows i0 and i1 in column
k0: the elimination list now includes elim(i0, i1, k0), elim(i2, i1, k0), . . . , elim(ir, i1, k0),
and elim(i1, piv(i0, k0), k0). All dependencies are preserved, and the execution time is
unchanged. Now the largest row index involved in a reverse elimination in column k0
29
is strictly smaller than i0, and we repeat the procedure until there does not remain any
reverse elimination in column k0. We proceed inductively to the following columns,
until all reverse eliminations have been suppressed.
3.2 Critical paths
In this section we describe several generic tiled algorithms, and we provide their
critical paths, as well as optimality results. These algorithms are inspired by algo-
rithms that have been introduced twenty to thirty years ago [45, 36, 25, 24], albeit for
a much simpler, coarse-grain model. In this “old” model, the time-unit is the time
needed to execute an orthogonal transformation across two matrix rows, regardless
of the position of the zero to be created, hence regardless of the length of these rows.
Although the granularity is much coarser in this model, any existing algorithm for
the old model can be transformed into a generic tiled algorithm, just by enforcing
the very same elimination list provided by the algorithm. Critical paths are obtained
using a discrete event based simulator specially developed to this end, based on the
Simgrid framework [47]. It carefully handles dependencies across tiles, and allows for
the analysis of both static and dynamic algorithms.1
3.2.1 Coarse-grain algorithms
We start with a short description of three algorithms for the coarse-grain model.
These algorithms are illustrated in Table 3.2 for a 15× 6 matrix.
3.2.1.1 Sameh-Kuck algorithm
The Sameh-Kuck algorithm [45] uses the panel row for all eliminations in each
column, starting from below the diagonal and proceeding downwards. Time-steps
indicate the time-unit at which the elimination can be done, assuming unbounded
resources. Formally, the elimination list is
{(elim(i, k, k), i = k + 1, k + 2, . . . , p) , k = 1, 2, . . . ,min(p, q)}
1The discrete event based simulator, together with the code for all tiled algorithms, is publicly
available at http://graal.ens-lyon.fr/~mjacquel/tiledQR.html
30
This algorithm is also referred as FlatTree.
3.2.1.2 Fibonacci algorithm
The Fibonacci algorithm is the Fibonacci scheme of order 1 in [36]. Let
coarse(i, k) be the time-step at which tile (i, k), i > k, is zeroed out. These val-
ues are computed as follows. In the first column, there are one 5, two 4’s, three 3’s,
four 2’s and four 1’s (we would have had five 1’s with p = 16). Given x as the least
integer such that x(x+ 1)/2 ≥ p− 1, we have coarse(i, 1) = x− y + 1 where y is the
least integer such that i ≤ y(y + 1)/2 + 1. Let the row indices of the z tiles that are
zeroed out at step s, 1 ≤ s ≤ x, range from i to i+z−1. The elimination list for these
tiles is elim(i+j, piv(i+j, 1), 1), with piv(i+j) = i+j−z for 0 ≤ j ≤ z−1. In other
words, to eliminate a bunch of z consecutive tiles at the same time-step, the algorithm
uses the z rows above them, pairing them in the natural order. Now the elimination
scheme of the next column is the same as that of the previous column, shifted down
by one row, and adding two time-units: coarse(i, k) = coarse(i− 1, k − 1) + 2, while
the pairing obeys the same rule.
3.2.1.3 Greedy algorithm
At each step, the Greedy algorithm [24, 25] eliminates as many tiles as possible
in each column, starting with bottom rows. The pairing for the eliminations is done
exactly as for Fibonacci. There is no closed-form formula to compute coarse(i, k),
the time-step at which tile (i, k) is eliminated, but it is possible to provide recursive
expressions (see [24, 25]).
Consider a rectangular p × q matrix, with p > q. With the coarse-grain model,
the critical path of Sameh-Kuck is p+ q − 2, and that of Fibonacci is x+ 2q − 2,
where x is the least integer such that x(x + 1)/2 ≥ p − 1. The critical path of
Greedy is unknown, but the critical path of Greedy is optimal. For square q × q
matrices, critical paths are slightly different (2q− 3 for Sameh-Kuck, x+ 2q− 4 for
Fibonacci).
31
Table 3.2: Time-steps for coarse-grain algorithms.
(a) Sameh-Kuck (b) Fibonacci (c) Greedy
? ? ?
1 ? 5 ? 4 ?
2 3 ? 4 7 ? 3 6 ?
3 4 5 ? 4 6 9 ? 3 5 8 ?
4 5 6 7 ? 3 6 8 11 ? 2 5 7 10 ?
5 6 7 8 9 ? 3 5 8 10 13 ? 2 4 7 9 12 ?
6 7 8 9 10 11 3 5 7 10 12 15 2 4 6 9 11 14
7 8 9 10 11 12 2 5 7 9 12 14 2 4 6 8 10 13
8 9 10 11 12 13 2 4 7 9 11 14 1 3 5 8 10 12
9 10 11 12 13 14 2 4 6 9 11 13 1 3 5 7 9 11
10 11 12 13 14 15 2 4 6 8 11 13 1 3 5 7 9 11
11 12 13 14 15 16 1 4 6 8 10 13 1 3 4 6 8 10
12 13 14 15 16 17 1 3 6 8 10 12 1 2 4 6 8 10
13 14 15 16 17 18 1 3 5 8 10 12 1 2 4 5 7 9
14 15 16 17 18 19 1 3 5 7 10 12 1 2 3 5 6 8
3.2.2 Tiled algorithms
As stated above, each coarse-grain algorithm can be transformed into a tiled
algorithm, simply by keeping the same elimination list, and triggering the execution
of each kernel as soon as possible. However, because the weights of the factor and
update kernels are not the same, it is much more difficult to compute the critical
paths of the transformed (tiled) algorithms. Table 3.3 is the counterpart of Table 3.2,
and depicts the time-steps at which tiles are actually zeroed out. Note that the tiled
version of Sameh-Kuck is indeed the FlatTree algorithm in PLASMA [18, 19], and
we have renamed it accordingly. As an example, Algorithm 3.4 shows the Greedy
algorithm for the tiled model.
A first (and quite unexpected) result is that Greedy is no longer optimal, as
shown in the first two columns of Table 3.3a for a 15× 2 matrix. In each column and
at each step, “the Asap algorithm” starts the elimination of a tile as soon as there are
at least two rows ready for the transformation. When s ≥ 2 eliminations can start
simultaneously, Asap pairs the 2s rows just as Fibonacci and Greedy, the first row
(closest to the diagonal) with row s+1, the second row with row s+2, and so on. As a
32
matter of a fact, when processing the second column, both Asap and Greedy begin
with the elimination of lines 10 to 15 (at time step 20). However, once tiles (13, 2),
(14, 2) and (15, 2) are zeroed out (i.e. at time step 22), Asap eliminates 4 zeros, in
rows 9 through 12. On the contrary, Greedy waits until time step 26 to eliminate 6
zeros in rows 6 through 12. In a sense, Asap is the counterpart of Greedy at the
tile level. However, Asap is not optimal either, as shown in Table 3.3a for a 15 × 3
matrix. On larger examples, the critical path of Greedy is better than that of Asap,
as shown in Table 3.3b.
We can however use the optimality of the coarse-grain Greedy to devise an
optimal tiled algorithm. Let us define the following algorithm:
Definition 3.3 Given a matrix of p×q tiles, with p > q, the GrASAP (i) algorithm
1. uses Algorithm 3.3 to execute Greedy on the first q− i columns and propagate
the updates through column q.
2. and for column(s) q − i+ 1 through q, apply the Asap algorithm.
Clearly, if we let i = q we obtain the Asap algorithm. We define GrASAP to be
GrASAP (1), i.e., only the elimination of the last column will differ from Greedy,
and we will show that GrASAP is an optimal tiled algorithm.
Although we cannot provide an elimination list for the entire tiled matrix of
size p × q, we do provide an elimination list for the first q − 1 columns. This tiled
elimination list describes the time-steps at which Algorithm 3.3 is complete, i.e., all of
the factorization kernels are complete for k ≤ q−1 and corresponding update kernels
are complete for all columns k < j ≤ q.
We must make note of one consequence from the coarse-grain elimination list
before proceeding. We will use this repeatedly within the proof of translating a
coarse-grain elimination list to a tiled elimination list.
33
Lemma 3.4 Given an elimination list from any coarse-grain algorithm, let
s = coarse(i, k) be the time step at which element (i, k) is eliminated and let
Is,k = {i|s = coarse(Is,k, k)} .
Then for any s we have
s− 1 = coarse(Is,k, k)− 1 ≥ max
coarse(Is,k, k − 1)
coarse(piv(Is,k, k), k − 1)

and in particular
s1 − 1 = max
coarse(Is1,k, k − 1)
coarse(piv(Is1,k, k), k − 1)

where s1 = mink+1≤i≤p(coarse(i, k)).
Proof: This follows directly from the dependencies given in (3.1a)-(3.1c).
Theorem 3.5 Given the elimination list of a coarse-grain algorithm for a matrix of
size p× q, using Algorithm 3.3, the tiled elimination list for all but the last column is
given by
tiled(i, k) = 10k + 6 · coarse(i, k), 1 ≤ i ≤ p, 1 ≤ k < q − 1
where coarse(i, k) is the elimination list of the coarse-grain algorithm.
Proof: In this analysis, when k is clear, we will use Is instead of Is,k. By abuse of no-
tation, we will write GEQRT (i, k) to denote the time at which the task GEQRT (i, k)
is complete and this will be the same for all of the kernels. Thus we will prove that
tiled(i, k) = TTMQR(i, piv(i, k), k, j) for j > k.
Note that j represents the column in which the updates are applied and all columns
j for j > k have the same update history. In Algorithm 3.3, the two j-loops spawn
34
mutually independent tasks. Since we have an unbounded number of processors, these
tasks can all run simultaneously. So j represents any one of these columns.
We will proceed by induction on k. For the first column, k = 1, we do not have
any dependencies which concern the GEQRT operations. Thus from (3.2b) we have
for 1 ≤ i ≤ p,
GEQRT (i, 1) = 4 (3.4)
UNMQR(i, 1, j) = 4 + 6 = 10 (3.5)
Since each column in the coarse-grain elimination list is composed of one or more
time steps, we must also proceed with induction on the time steps. Let
s1 = min
2≤i≤p
(coarse(i, 1)) . (3.6)
In the case k = 1, we have that
s1 = 1.
In other words, the first tasks finish at time step 1 for the coarse-grain algorithm.
This is a complicated manner in which to state that s1 = 1, but it will be needed in
the general setting.
So for s1, from (3.2c) and (3.2d) we have
TTQRT (Is1 , piv(Is1 , 1), 1) = max
GEQRT (piv(Is1 , 1), 1)
GEQRT (Is1 , 1)
+ 2
= 4 + 2
thus
TTQRT (Is1 , piv(Is1 , 1), 1) = 4 + 2s1. (3.7)
35
Now from (3.2e), (3.2f), and (3.2g) we have
TTMQR(Is1 , piv(Is1 , 1), 1, j) = max

TTQRT (Is1 , piv(Is1 , 1), 1)
UNMQR (piv(Is1 , 1), 1, j)
UNMQR (Is1 , 1, j)
+ 6
= 10 · 1 + 6s1
= 10 · 1 + 6 · coarse(Is1 , 1)
Therefore,
TTMQR(Is1 , piv(Is1 , 1), 1, j) = tiled(Is1 , 1). (3.8)
Assume that for 1 ≤ t ≤ s− 1 we have
TTQRT (It, piv(It, 1), 1) = 4 + 2t (3.9)
TTMQR(It, piv(It, 1), 1, j) = 10 + 6t (3.10)
then from (3.2c), (3.2d), and (3.3a) we have
TTQRT (Is, piv(Is, 1), 1) = max

GEQRT (piv(Is, 1), 1)
GEQRT (Is, 1)
TTQRT (Is−1, piv(Is−1, 1), 1)
+ 2
= max

4
4
4 + 2(s− 1)
+ 2
Thus
TTQRT (Is, piv(Is, 1), 1) = 4 + 2s. (3.11)
36
From (3.2e), (3.2f), (3.2g), and (3.3b) we have
TTMQR(Is, piv(Is, 1), 1, j) = max

TTQRT (Is, piv(Is, 1), 1)
UNMQR (piv(Is, 1), 1, j)
UNMQR (Is, 1, j)
TTMQR (Is−1, piv(Is−1, 1), 1, j)

+ 6
= max

4 + 2s
10
10
10 + 6(s− 1)

+ 6
= 10 + 6(s− 1) + 6
= 10 + 6s
= 10 · 1 + 6 · coarse(Is, 1)
Thus
TTMQR(Is, piv(Is, 1), 1, j) = tiled(Is, 1) (3.12)
establishing our base case for the induction on k.
Now assume that for 1 ≤ h ≤ k − 1 we have, for any s in column h,
TTMQR(Is, piv(Is, h), h, j) = tiled(Is, h).
In order to start the elimination of the next column, we must have that all updates
from the elimination of the previous column are complete. Thus using the induction
assumption, we have
GEQRT (i, k) = TTMQR(i, piv(i, k − 1), k − 1, k) + 4
so that
GEQRT (i, k) = 10(k − 1) + 6 · coarse(i, k − 1) + 4 (3.13)
37
and
UNMQR(i, k, j) = max
 GEQRT (i, k)
TTMQR(i, piv(i, k − 1), k − 1, j)
+ 6
so that
UNMQR(i, k, j) = 10k + 6 · coarse(i, k − 1). (3.14)
Again, we must proceed with an induction on the time steps in column k. Let
s1 = min
k+1≤i≤p
(coarse(i, k)) . (3.15)
From (3.2c) and (3.2d) we have
TTQRT (Is1 , piv(Is1 , k), k) = max
GEQRT (piv(Is1 , k), k)
GEQRT (Is1 , k)
+ 2
= max
 10(k − 1) + 6 · coarse(piv(Is1 , k), k − 1) + 4
10(k − 1) + 6 · coarse(Is1 , k − 1) + 4
+ 2
= 10(k − 1) + 6 max
 coarse(piv(Is1 , k), k − 1)
coarse(Is1 , k − 1)
+ 4 + 2
From the application of Lemma 3.4, we have
coarse(Is1 , k)− 1 = max
coarse(Is1,k, k − 1)
coarse(piv(Is1,k, k), k − 1)

such that
TTQRT (Is1 , piv(Is1 , k), k) = 10(k − 1) + 6 [coarse(Is1 , k)− 1] + 4 + 2.
Therefore,
TTQRT (Is1 , piv(Is1 , k), k) = 10(k − 1) + 6s1. (3.16)
38
For the updates, we must again examine the three dependencies which result from
(3.2e), (3.2f), and (3.2g) such that we have
TTMQR(Is1 , piv(Is1 , k), k, j) = max

UNMQR (piv(Is1 , k), k, j)
UNMQR (Is1 , k, j)
TTQRT (Is1 , piv(Is1 , k), k)
+ 6
= max

10k + 6 · coarse(Is1 , k − 1)
10k + 6 · coarse(piv(Is1 , k), k − 1)
10(k − 1) + 6s1
+ 6
Using Lemma 3.4, we have
TTMQR(Is1 , piv(Is1 , k), k, j) = max
10k + 6(s1 − 1)
10(k − 1) + 6s1
+ 6
= 10k + max
6s1 − 6
6s1 − 10
+ 6
= 10k + 6s1
Therefore
TTMQR(Is1 , piv(Is1 , k), k, j) = tiled(Is1 , k). (3.17)
Now assume that for s1 ≤ t ≤ s− 1 we have
TTQRT (It, piv(It, k), k)≤ 10(k − 1) + 6t (3.18)
TTMQR(It, piv(It, k), k, j) = 10k + 6t. (3.19)
39
and note that we do not have equality for s > s1 via Lemma 3.4. From (3.2c), (3.2d),
and (3.3a) we have
TTQRT (Is, piv(Is, k), k) = max

GEQRT (piv(Is, k), k)
GEQRT (Is, k)
TTQRT (Is−1, piv(Is−1, k), k)
+ 2
≤ max

10(k − 1) + 6 · coarse(piv(Is, k), k − 1) + 4
10(k − 1) + 6 · coarse(Is, k − 1) + 4
10(k − 1) + 6(s− 1)
+ 2
Note that from Lemma 3.4
s− 1 ≥ max
coarse(piv(Is, k), k − 1)
coarse(Is, k − 1)

such that
TTQRT (Is, piv(Is, k), k) ≤ 10(k − 1) + 6(s− 1) + 4 + 2.
Thus
TTQRT (Is, piv(Is, k), k) ≤ 10(k − 1) + 6s. (3.20)
For the updates, we must examine the four dependencies which result from (3.2e),
(3.2f), (3.2g), and (3.3b) such that we have
TTMQR(Is, piv(Is, k), k, j) = max

UNMQR (piv(Is, k), k, j)
UNMQR (Is, k, j)
TTQRT (Is, piv(Is, k), k)
TTMQR (Is−1, piv(Is−1, k), k, j)

+ 6
= max

10k + 6 · coarse(Is, k − 1)
10k + 6 · coarse(piv(Is, k), k − 1)
10(k − 1) + 6s
10k + 6(s− 1)

+ 6.
40
As before, Lemma 3.4 allows us to write
TTMQR(Is, piv(Is, k), k, j) = max
10k + 6(s− 1)
10(k − 1) + 6s
+ 6
= 10k + max
6s− 6
6s− 10
+ 6
= 10k + 6s
Therefore
TTMQR(Is, piv(Is, k), k, j) = tiled(i, k). (3.21)
Corollary 3.6 Given an elimination list for a coarse-grain algorithm on a matrix of
size p× q where p > q, the critical path length of the corresponding tiled algorithm is
bounded by
tiled(p, q − 1) + 4 + 2 ≤ CP (p, q) < tiled(p, q).
Proof: For any tiled matrix, the last column will necessarily need to be factorized
which explains the addition of four time steps and since p > q at least one TTQRT will
be present which accounts for the two time steps thereby establishing the lower bound.
By including one more column, the upper bound not only includes the factorization
of column q, but also the respective updates onto column q + 1 such that the critical
path of the p× q tiled matrix must be smaller.
Corollary 3.7 Given an elimination list for a coarse-grain algorithm on a matrix of
size p× q where p = q, the critical path length of the corresponding tiled algorithm is
CP (p, q) = tiled(p, q − 1) + 4.
Proof: In the last column, we need only to factorize the diagonal tile which explains
the additional four time steps. Moreover, there are no further columns to apply any
41
updates to nor any tiles below the diagonal that need to be eliminated. Thus the
result is obtained.
In the remainder of this chapter, we will make use of diagrams to clarify certain
aspects of the proofs and provide examples to further illustrate the points being made.
These diagrams make use of the kernel representations as shown in Figure 3.1.
kernel weight
GEQRT 4
TTQRT 2
kernel weight
UNMQR 6
TTMQR 6
Figure 3.1: Icon representations of the kernels
We have a closed-form expression for the critical path of tiled FlatTree for all
three cases: single tiled column, square tiled matrix, and rectangular tiled matrix of
more than one column.
Proposition 3.8 Consider a tiled matrix of size p×q, where p ≥ q ≥ 1. The critical
path length of FlatTree is
CPft(p, q) =

2p+ 2, if q = 1;
22p− 24, if p = q > 1;
6p+ 16q − 22, if p > q > 1.
Proof: Consider first the case q = 1. We shall proceed by induction on p to
show that the critical path of FlatTree is of length 2p + 2. If p = 1, then from
Table 3.1 the result is obtained since only GEQRT (1, 1) is required. With the base
case established, now assume that this holds for all p − 1 > q = 1. Thus at time
t = 2(p− 1) + 2 = 2p, we have that for all p− 1 ≥ i ≥ 1 tile (i, 1) has been factorized
into a triangle and for all p−1 ≥ i > 1, tile (i, 1) has been zeroed out. Therefore, tile
(p, 1) will be zeroed out with TTQRT (p, 1) at time t+ 2 = 2(p− 1) + 2 + 2 = 2p+ 2.
42
Considering the second case p = q > 1, we will be using Figure 3.2 to illustrate.
We initialize with a triangularization of the first column and send the update to the
remaining column(s), 10 time units. The we fill the pipeline with the updates onto
the remaining column(s) from the zeroing operations of the first column, 6(p−1) time
units. Then for each column after the first, except the last one, we fill the pipeline
with the triangularization, update of triangularization, and update of zeroing for the
bottom most tile, (4+6+6)(p−2) time units. In the last column, we then triangularize
the bottom most tile, 4 time units. Thus
10 + 6(p− 1) + (4 + 6 + 6)(q − 2) + 4 = 6p+ 16q − 24 = 22p− 24
The third case is analogous to the second case but we still need to zero out the
bottom most tile in the last column which explains the difference of 2 in the formula
from the square case.
time 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64
2nd Row 3rd Row 4th Row 2nd Column 3nd Column
Initial: 10 Fill the pipeline: 6(p− 1) Pipeline: 16(q − 2) End: 4
Figure 3.2: Critical Path length for the weighted FlatTree on a matrix of 4 × 4
tiles.
We remind that for the coarse algorithm,
coarse(p, q) =

0, if q < 1;
0, if p = q = 1;
p+ q − 2, if p > q ≥ 1;
2p− 3, if p = q > 1.
So we find that considering a tiled matrix of size p× q, where p ≥ q ≥ 1. The critical
path length of FlatTree is given as
CP (p, q) = 10(q − 1) + 6 (coarse(p, q − 1)) + 4 + 2 (coarse(p, q)− coarse(p, q − 1)) .
43
Proposition 3.9 The critical path length of Fibonacci is greater than 22q−30 and
less than 22q + 6d√2pe.
Proof: The critical path length of the coarse-grain Fibonacci algorithm for a
p× q matrix is
coarse(p, q) = x+ 2q − 2.
Thus from Proposition 3.6 we have
10(q − 1) + 6(x+ 2(q − 1)− 2) + 4 ≤ CP (p, q) < 10q + 6(x+ 2q − 2).
Recall that x is the least integer such that
x(x+ 1)
2
≥ p− 1
whereby
x = −1
2
+
√
8p− 7
2
.
Thus x ≤ ⌈√2p⌉ and therefore
22q − 30 < CP (p, q) < 22q − 12 + 6
⌈√
2p
⌉
.
Similarly to [25] in which an iterate of a column is defined for the coarse-grain
algorithms, we define a weighted iterate and our notation will follow in the same
manner.
A column of length n is a sequence of n integers:
a = an11 · · · anqq
where power means concatenation with the following restrictions:
a1 ≥ 0 ai+1 > ai, 1 ≤ i ≤ q − 1;
ni > 0, 1 ≤ i ≤ q; n1 + · · ·+ nq = n.
44
We define on the set of columns of length n the classical partial ordering of Rn:
x ≤ y ⇐⇒ (xi ≤ yi, i ≤ i ≤ n)
and the s-truncate (1 ≤ s ≤ n) of a is a column of length s composed of the s first
elements of a and is denoted as.
Definition 3.10 Given a task weight of w and column a = an11 · · · anqq , the column
c = cm11 · · · cmpp is called an iterate of a, or c = iter(a), if
(i) c is a column of length n− 1
(ii) a1 + w ≤ c1
(a) if a1 + w ≤ c1 ≤ a2 then m1 ≤ bn1/2c
(b) if there exists an h such that ak−1 + w ≤ ch ≤ ak then
mh ≤ b(n1 + · · ·+ nk−1 −m1 − · · · −mh−1) /2c
for 2 ≤ k ≤ q and 1 ≤ h ≤ p with m0 = 0.
(c) else aj ≤ ch and
mh ≤ b(n1 + · · ·+ nj −m1 − · · · −mh−1) /2c
where j = min(p+ 1, q).
Definition 3.11 Given a task weight of w, let a = an11 · · · anqq be a column iterate of
length n then the sequence b = bm11 · · · bmpp , or b = opiter(a), is defined as
(i) for b1 and m1
(a) if n1 = 1, then b1 = a2 + w and m1 = b(n1 + n2) /2c.
(b) if n1 > 1, then b1 = a1 + w and m1 = b(n1) /2c.
45
(ii) if there exists k such that ak−1 + w ≤ bi−1 ≤ ak, then
ri−1 = n1 + · · ·+ nk−1 −m1 − · · · −mi−1 ≥ 1.
(a) if bi−1 < ak and ri−1 > 1, then bi = bi−1 + w, and mi = bri−1/2c.
(b) else bi = ak + w, mi = b(nk + ri−1)/2c.
(iii) if bi−1 > aj where j = min(i, q), then bi = bi−1 + w, and
mi ≤ b(n1 + · · ·+ nj −m1 − · · · −mi−1) /2c .
Proposition 3.12 Given an iterated column a of length n, the sequence
b = bm11 · · · bmpp ,
or b = optiter(a) is an iterate of a.
Proof: The proof follows directly from the definition.
Proposition 3.13
(i) Let an be a column of length n and cn−1 = iter(an) an iterated column of an.
Then
bn−1 = optiter(an) ≤ iter(an) = cn−1.
(ii) Let an and cn be two columns of length n such that an ≤ cn. Then
optiter(an) ≤ optiter(cn).
Proof:
(i) Clearly, b1 ≤ c1 by definition since b1 is chosen to be as small as possible.
Moreover, by definition bi ≤ cj for i ≤ j since (i) if bi−1 < ak and ri−1 > 1,
meaning there are enough elements available to perform a pairing, then bi is
46
again chosen as small as possible, (ii) otherwise bi = ak+w which is the smallest
again, (iii) else bi−1 > aj and bi is chosen as the next smallest element. Thus
bm1+···+min−1 ≤ cm1+···+min−1 , 1 ≤ i ≤ p
so that bn−1 ≤ cn−1.
(ii) This is another direct application of the definition and follows along the same
argument.
Clearly, letting w = 1 gives the definitions of iter and optiter of the coarse-grain
algorithms as presented in [25]. Definition 3.11 is the Asap algorithm on a single tiled
column and can be viewed as the counter part of the coarse-grain Greedy algorithm
in the tiled case and follows a bottom to top elimination of the tiles. In order to
preserve the bottom to top elimination, the weight of the updates must be an integer
multiple of the iterated column weight.
Theorem 3.14 Given a matrix of p× q tiles, a factorization kernel weight of γ, an
elimination kernel weight of α, and an update kernel weight of β = nα for some
n ∈ N, the GrASAP algorithm is optimal in the context of the class of algorithms
that progress left to right, bottom to top.
Proof: From Theorem 3.5 we have a direct translation from any coarse-grain al-
gorithm in this class to the tiled algorithm for the first q − 1 columns. Thus we are
given the time steps at which rows in column q are available for elimination. Now
we fix the time steps for the elimination of the last column and follow whatever tree
the algorithm provides for this last column. We can replace the elimination of the
first q − 1 columns and updates from these eliminations onto the remaining columns
with the tiled Greedy algorithm. This is possible since the translation function is
monotonically increasing and we know that Greedy is optimal for the coarse-grain
47
algorithms and therefore optimal for the first q−1 columns in the tiled algorithms. In
another manner of speaking, we slow down the eliminations and updates on the first
q− 1 columns when not using the tiled Greedy algorithm. (An illustrative example
is shown in Figure 3.3b.)
Let c be the next to last column of the coarse-grain elimination table which is of
length p− (q − 1) + 1. Now letting
a = (γ + β)(q − 1) + β · coarse(p, q − 1) + γ
provides an iterated column of length p − (q − 1) + 1 for the tiled algorithm. With
w = α we have that b = optiter(a) is an optimal iterated column of length p− q + 1
with the elimination progressing from bottom to top. This can be applied to any tiled
algorithm in this class since we only concern ourselves with the time steps at which the
last column’s elements are available for elimination. In other words, this is a speeding
up of the elimination of the last column while adhering to any restrictions incurred
from the previous columns. (An illustrative example is shown in Figure 3.3c.)
Combining these two ideas, let Greedy be performed on the first q− 1 columns
and then Asap on the remaining column q. This provides an optimal algorithm in
this class of algorithms.
In Figure 3.4 we provide an illustrated example of the Greedy and GrASAP
algorithms on a matrix of 15× 2 tiles where the operations are given by Figure 3.1.
It can be seen that GrASAP finishes before Greedy since Greedy must wait
and progress with the same elimination scheme as the coarse-grain algorithm while
GrASAP can begin eliminating in the last column as soon as a pair of tiles becomes
available. (The elimination of the first column is shown in light gray.)
We have analyzed the critical path length of GrASAP versus that of Greedy
for tiled matrices p × q where 1 ≤ p ≤ 100 and 1 ≤ q ≤ p (see Figure 3.5). In all
cases where there is a difference (which is just over 44% of the cases), the difference
is always two time steps.
48
12
3
4
5
6
7
8
9
10
11
12
13
14
15
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48
(a) Fibonacci
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48
(b) Greedy on first q − 1 columns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48
(c) Asap on column q
Figure 3.3: Illustration of first and second parts of the proof of Theorem 3.14 using
the Fibonacci algorithm on a matrix of 15× 2 tiles.
We now show that without having the update kernel weight an integer multiple
of the elimination kernel weight, the bottom to top progression is nullified and we
cannot provide optimality of the algorithm.
Assume that the update kernel weight is 3 and the elimination kernel weight is
2. Let a11 = 3
764 be column from some elimination scheme. We shall apply three
iterated schemes to this column: (1) an Asap scheme that progresses from bottom to
top, (2) an Asap scheme that can progress in any manner, and (3) an Asap scheme
which may provide a lag.
In Table 3.5 we clearly see that elimination scheme (3) provides the best time for
the algorithm. The reason is that enough of a lag was provided such that a binomial
tree could progress without hindrance. Therefore without integer multiple weights on
the update kernel, we cannot know which scheme will be optimal.
49
12
3
4
5
6
7
8
9
10
11
12
13
14
15
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42
(a) Greedy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42
(b) GrASAP
Figure 3.4: Greedy versus GrASAP on matrix of 15× 2 tiles.
1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
1
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
q
Matrix size for which the critical path of GrASAP is shorter than Greedy
p
Figure 3.5: Tiled matrices of p × q where the critical path length of GrASAP is
shorter than that of Greedy for 1 ≤ p ≤ 100 and 1 ≤ q ≤ p.
The PLASMA library provides more algorithms, that can be informally described
as trade-offs between FlatTree and BinaryTree. (We remind that FlatTree
is the same as algorithm as Sameh-Kuck.) These algorithms are referred to as
PlasmaTree in all the following, and differ by the value of an input parameter
called the domain size BS . This domain size can be any value between 1 and p,
inclusive. Within a domain, that includes BS consecutive rows, the algorithm works
just as FlatTree: the first row of each domain acts as a local panel and is used to
zero out the tiles in all the other rows of the domain. Then the domains are merged:
50
the panel rows are zeroed out by a binary tree reduction, just as in BinaryTree.
As the algorithm progresses through the columns, the domain on the very bottom is
reduced accordingly, until such time that there is one less domain. For the case that
BS = 1, PlasmaTree follows a binary tree on the entire column, and for BS = p,
the algorithm executes a flat tree on the entire column. It seems very difficult for a
user to select the domain size BS leading to best performance, but it is known that
BS should increase as q increases. Table 3.3 shows the time-steps of PlasmaTree
with a domain size of BS = 5. In the experiments of §3.3, we use all possible values
of BS and retain the one leading to the best value.
3.3 Experimental results
All experiments were performed on a 48-core machine composed of eight hexa-
core AMD Opteron 8439 SE (codename Istanbul) processors running at 2.8 GHz.
Each core has a theoretical peak of 11.2 Gflop/s with a peak of 537.6 Gflop/s for
the whole machine. The Istanbul micro-architecture is a NUMA architecture where
each socket has 6 MB of level-3 cache and each processor has a 512 KB level-2 cache
and a 128 KB level-1 cache. After having benchmarked the AMD ACML and Intel
MKL BLAS libraries, we selected MKL (10.2) since it appeared to be slightly faster
in our experimental context. Linux 2.6.32 and Intel Compilers 11.1 were also used in
conjunction with PLASMA 2.3.1.
For all results, we show both double and double complex precision, using all 48
cores of the machine. The matrices are of size m = 8000 and 200 ≤ n ≤ 8000. The
tile size is kept constant at nb = 200, so that the matrices can also be viewed as
p× q tiled matrices where p = 40 and 1 ≤ q ≤ 40. All kernels use an inner blocking
parameter of ib = 32.
In double precision, an FMA (“fused multiply-add”, y ← αx + y) involves three
double precision numbers for two flops, but these two flops can be combined into one
FMA and thus completed in one cycle. In double complex precision, the operation
51
  
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 . . . 20 ]
P
re
d
ic
te
d
G
F
L
O
P
/s
q
1 2 3 4 5 6 7 8 9 10 20 30 40
20
40
60
80
100
120
140
160
(a) Upper bound (double complex)
 
 
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TT) = [1 5 5 5 17 28 8]
G
F
L
O
P
/s
q
1 2 3 4 5 6 7 8 9 10 20 30 40
0
20
40
60
80
100
120
140
160
(b) Experimental (double complex)
 
 
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 . . . 20 ]
P
re
d
ic
te
d
G
F
L
O
P
/s
q
1 2 3 4 5 6 7 8 9 10 20 30 40
20
40
60
80
100
120
140
160
180
(c) Upper bound (double)
 
 
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TT) = [1 3 10 5 17 27 19]
G
F
L
O
P
/s
q
1 2 3 4 5 6 7 8 9 10 20 30 40
0
20
40
60
80
100
120
140
160
180
(d) Experimental (double)
Figure 3.6: Upper bound and experimental performance of QR factorization - TT
kernels
y ← αx + y involves six double precision numbers for eight flops; there is no FMA.
The ratio of computation/communication is therefore, potentially, four times higher in
double complex precision than in double precision. Communication aware algorithms
are much more critical in double precision than in double complex precision.
For each experiment, we provide a comparison of the theoretical performance to
the actual performance. The theoretical performance is obtained by modeling the
limiting factor of the execution time as either the critical path, or the sequential time
divided by the number of processors. This is similar in approach to the Roofline
52
  
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 . . . 20 ]
O
ve
rh
ea
d
in
cp
le
n
gt
h
w
it
h
re
sp
ec
t
to
G
r
e
e
d
y
(G
r
e
e
d
y
=
1)
q
1 2 3 4 5 6 7 8 9 10 20 30 40
1
1.5
2
2.5
3
3.5
4
4.5
5
(a) Theoretical CP length
 
 
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TT) = [1 5 5 5 17 28 8]
O
ve
rh
ea
d
in
ti
m
e
w
it
h
re
sp
ec
t
to
G
r
e
e
d
y
(G
r
e
e
d
y
=
1)
q
1 2 3 4 5 6 7 8 9 10 20 30 40
1
1.5
2
2.5
3
3.5
4
4.5
5
(b) Experimental (double complex)
 
 
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TT) = [1 3 10 5 17 27 19]
O
ve
rh
ea
d
in
ti
m
e
w
it
h
re
sp
ec
t
to
G
r
e
e
d
y
(G
r
e
e
d
y
=
1)
q
1 2 3 4 5 6 7 8 9 10 20 30 40
1
1.5
2
2.5
3
3.5
4
4.5
5
(c) Experimental (double)
Figure 3.7: Overhead in terms of critical path length and time with respect to
Greedy (Greedy = 1)
model [53]. Taking γseq as the sequential performance, T as the total number of flops,
cp as the length of the critical path, and P as the number of processors, the upper
bound on performance, γub, is
γub =
γseq · T
max
(
T
P
, cp
)
Figures 3.6a and 3.6c depict the upper bound on performance of all algorithms which
use the Triangle on top of triangle kernels. Since PlasmaTree provides an addi-
tional tuning parameter of the domain size, we show the results for each value of this
parameter as well as the composition of the best of these domain sizes. Again, it
53
is not evident what the domain size should be for the best performance, hence our
exhaustive search.
Part of our comprehensive study also involved comparisons made to the Semi-
Parallel Tile and Fully-Parallel Tile CAQR algorithms found in [11] which are much
the same as those found in PLASMA. As with PLASMA, the tuning parameter
BS controls the domain size upon which a flat tree is used to zero out tiles below
the root tile within the domain and a binary tree is used to merge these domains.
Unlike PLASMA, it is not the bottom domain whose size decreases as the algorithm
progresses through the columns, but instead is the top domain. In this study, we found
that the PLASMA algorithms performed identically or better than these algorithms
and therefore we do not report these comparisons.
Figure 3.6b and 3.6d illustrate the experimental performance reached byGreedy,
Fibonacci and PlasmaTree algorithms using the TT (Triangle on top of trian-
gle) kernels. In both cases, double or double complex precision, the performance
of Greedy is better than PlasmaTree even for the best choice of domain size.
Moreover, as expected from the analysis in §3.2.2, Greedy outperforms Fibonacci
the majority of the time. Furthermore, we see that, for rectangular matrices, the
experimental performance in double complex precision matches the upper bound on
performance. This is not the case for double precision because communications have
higher impact on performance.
While it is apparent that Greedy does achieve higher levels of performance, the
percentage may not be as obvious. To that end, taking Greedy as the baseline, we
present in Figure 3.7 the theoretical, double, and double complex precision overhead
for each algorithm that uses the Triangle on top of triangle kernel as compared to
Greedy. These overheads are respectively computed in terms of critical path length
and time. At a smaller scale (Figure 3.13), it can be seen that Greedy can perform
up to 13.6% better than PlasmaTree.
54
  
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 . . . 20 ]
O
ve
rh
ea
d
in
cp
le
n
gt
h
w
it
h
re
sp
ec
t
to
G
r
e
e
d
y
(G
r
e
e
d
y
=
1)
q
1 2 3 4 5 6 7 8 9 10 20 30 40
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
1.45
(a) Theoretical CP length
 
 
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TT) = [1 5 5 5 17 28 8]
O
ve
rh
ea
d
in
ti
m
e
w
it
h
re
sp
ec
t
to
G
r
e
e
d
y
(G
r
e
e
d
y
=
1)
q
1 2 3 4 5 6 7 8 9 10 20 30 40
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
1.45
(b) Experimental (double complex)
 
 
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TT) = [1 3 10 5 17 27 19]
O
ve
rh
ea
d
in
ti
m
e
w
it
h
re
sp
ec
t
to
G
r
e
e
d
y
(G
r
e
e
d
y
=
1)
q
1 2 3 4 5 6 7 8 9 10 20 30 40
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
1.45
(c) Experimental (double)
Figure 3.8: Overhead in terms of critical path length and time with respect to
Greedy (Greedy = 1)
For all matrix sizes considered, p = 40 and 1 ≤ q ≤ 40, in the theoretical model,
the critical path length for Greedy is either the same as that of PlasmaTree
(q = 1) or is up to 25% shorter than PlasmaTree (q = 6). Analogously, the critical
path length for Greedy is at least 2% to 27% shorter than that of Fibonacci. In the
experiments, the matrix sizes considered were p = 40 and q ∈ {1, 2, 4, 5, 10, 20, 40}. In
double precision, Greedy has a decrease of at most 1.5% than the best PlasmaTree
(q = 1) and a gain of at most 12.8% than the best PlasmaTree (q = 5). In
double complex precision, Greedy has a decrease of at most 1.5% than the best
PlasmaTree (q = 1) and a gain of at most 13.6% than the best PlasmaTree
55
(q = 2). Similarly, in double precision, Greedy provides a gain of 2.6% to 28.1%
over Fibonacci and in double complex precision, Greedy has a decrease of at most
2.1% and a gain of at most 28.2% over Fibonacci.
Although it is evidenced that PlasmaTree does not vary too far from Greedy
or Fibonacci, one must keep in mind that there is a tuning parameter involved and
we choose the best of these domain sizes for PlasmaTree to create the composite
result, whereas with Greedy, there is no such parameter to consider. Of particular
interest is the fact that Greedy always performs better than any other algorithm2
for p q. In the scope of PlasmaTree, a domain size BS = 1 will force the use of
a binary tree so that both Greedy and PlasmaTree behave the same. However,
as the matrix tends more to a square, i.e., q tends toward p, we observe that the
performance of all of the algorithms, including FlatTree, are on par with Greedy.
As more columns are added, the parallelism of the algorithm is increased and the
critical path becomes less of a limiting factor, so that the performance of the kernels
is brought to the forefront. Therefore, all of the algorithms are performing similarly
since they all share the same kernels.
 
 
TSQRT (in)
GEQRT (in)
TTQRT (in)
GEQRT +TTQRT (in)
GEMM (in)
TSQRT (out)
GEQRT (out)
TTQRT (out)
GEQRT +TTQRT (out)
GEMM (out)
G
F
L
O
P
/s
tile size
100 200 300 400 500 600
0
1
2
3
4
5
6
7
8
9
10
(a) Factorization kernels
 
 
TSMQR (in)
UNMQR (in)
TTMQR (in)
UNMQR +TTMQR (in)
GEMM (in)
TSMQR (out)
UNMQR (out)
TTMQR (out)
UNMQR +TTMQR (out)
GEMM (out)
G
F
L
O
P
/s
tile size
100 200 300 400 500 600
0
1
2
3
4
5
6
7
8
9
10
(b) Update kernels
Figure 3.9: Kernel performance for double complex precision
2When q = 1, Greedy and FlatTree exhibit close performance. They both perform a binary
tree reduction, albeit with different row pairings.
56
  
TSQRT (in)
GEQRT (in)
TTQRT (in)
GEQRT +TTQRT (in)
GEMM (in)
TSQRT (out)
GEQRT (out)
TTQRT (out)
GEQRT +TTQRT (out)
GEMM (out)
G
F
L
O
P
/s
tile size
100 200 300 400 500 600
0
1
2
3
4
5
6
7
8
9
10
(a) Factorization kernels
 
 
TSMQR (in)
ORMQR (in)
TTMQR (in)
ORMQR +TTMQR (in)
GEMM (in)
TSMQR (out)
ORMQR (out)
TTMQR (out)
ORMQR +TTMQR (out)
GEMM (out)
G
F
L
O
P
/s
tile size
100 200 300 400 500 600
0
1
2
3
4
5
6
7
8
9
10
(b) Update kernels
Figure 3.10: Kernel performance for double precision
In order to accurately assess the impact of the kernel selection towards the per-
formance of the algorithms, Figures 3.9 and 3.10 show both the in cache and out
of cache performance using the No Flush and MultCallFlushLRU strategies as pre-
sented in [29, 51]. Since an algorithm using TT kernels will need to call GEQRT
as well as TTQRT to achieve the same as the TS kernel TSQRT , the comparison
is made between GEQRT + TTQRT and TSQRT (and similarly for the updates).
For nb = 200, the observed ratio for in cache kernel speed for TSQRT to GEQRT
+ TTQRT is 1.3374, and for TSMQR to UNMQR + TTMQR is 1.3207. For out
of cache, the ratio for TSQRT to GEQRT + TTQRT is 1.3193 and for TSMQR
to UNMQR + TTMQR it is 1.3032. Thus, we can expect about a 30% difference
between the selection of the kernels, since we will have instances of using in cache
and out of cache throughout the run. Most of this difference is due to the higher
efficiency and data locality within the TT kernels as compared to the TS kernels.
Having seen that kernel performance can have a significant impact, we also com-
pare the TT based algorithms to those using the TS kernels. The goal is to provide
a complete assessment of all currently available algorithms, as shown in Figure 3.11.
For double precision, the observed difference in kernel speed is 4.976 GFLOP/sec for
57
  
FlatTree (TS)
PlasmaTree (TS) (best)
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TS) = [ 1 1 1 1 5 5 5 5 5 5 10 ... 10 ]
Best domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 ... 20 ]
P
re
d
ic
te
d
G
F
L
O
P
/s
q
1 2 3 4 5 6 7 8 9 10 20 30 40
20
40
60
80
100
120
140
160
(a) Upper bound (double complex)
 
 
FlatTree (TS)
PlasmaTree (TS) (best)
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TS) = [1 3 5 10 11 8 4]
Best domain size for PlasmaTree (TT) = [1 5 5 5 17 28 8]
G
F
L
O
P
/s
q
1 2 3 4 5 6 7 8 9 10 20 30 40
0
20
40
60
80
100
120
140
160
(b) Experimental (double complex)
 
 
FlatTree (TS)
PlasmaTree (TS) (best)
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TS) = [ 1 1 1 1 5 5 5 5 5 5 10 ... 10 ]
Best domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 ... 20 ]
P
re
d
ic
te
d
G
F
L
O
P
/s
q
1 2 3 4 5 6 7 8 9 10 20 30 40
50
100
150
200
250
(c) Upper bound (double)
 
 
FlatTree (TS)
PlasmaTree (TS) (best)
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TS) = [1 3 6 11 12 18 32]
Best domain size for PlasmaTree (TT) = [1 3 10 5 17 27 19]
G
F
L
O
P
/s
q
1 2 3 4 5 6 7 8 9 10 20 30 40
0
50
100
150
200
250
(d) Experimental (double)
Figure 3.11: Upper bound and experimental performance of QR factorization - All
kernels
the TS kernels versus 3.844 GFLOP/sec for the TT kernels which provides a ratio
of 1.2945 and is in accordance with our previous analysis. It can be seen that as the
number of columns increases, whereby the amount of parallelism increases, the effect
of the kernel performance outweighs the benefit provided by the extra parallelism
afforded through the TT algorithms. Comparatively, in double complex precision,
Greedy does perform better, even against the algorithms using the TS kernels. As
before, one must keep in mind that Greedy does not require the tuning parameter
of the domain size to achieve this better performance.
58
  
FlatTree (TS)
PlasmaTree (TS) (best)
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TS) = [ 1 1 1 1 5 5 5 5 5 5 10 ... 10 ]
Best domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 ... 20 ]
O
ve
rh
ea
d
in
cp
le
n
gt
h
w
it
h
re
sp
ec
t
to
G
r
e
e
d
y
(G
r
e
e
d
y
=
1)
q
1 2 3 4 5 6 7 8 9 10 20 30 40
1
2
3
4
5
6
7
8
9
10
(a) Theoretical CP length
 
 
FlatTree (TS)
PlasmaTree (TS) (best)
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TS) = [1 3 5 10 11 8 4]
Best domain size for PlasmaTree (TT) = [1 5 5 5 17 28 8]
O
ve
rh
ea
d
in
ti
m
e
w
it
h
re
sp
ec
t
to
G
r
e
e
d
y
(G
r
e
e
d
y
=
1)
q
1 2 3 4 5 6 7 8 9 10 20 30 40
1
2
3
4
5
6
7
8
9
10
(b) Experimental (double complex)
 
 
FlatTree (TS)
PlasmaTree (TS) (best)
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TS) = [1 3 6 11 12 18 32]
Best domain size for PlasmaTree (TT) = [1 3 10 5 17 27 19]
O
ve
rh
ea
d
in
ti
m
e
w
it
h
re
sp
ec
t
to
G
r
e
e
d
y
(G
r
e
e
d
y
=
1)
q
1 2 3 4 5 6 7 8 9 10 20 30 40
1
2
3
4
5
6
7
8
(c) Experimental (double)
Figure 3.12: Overhead in terms of critical path length and time with respect to
Greedy (Greedy = 1)
From these experiments, we showed that in double complex precision, Greedy
demonstrated better performance than any of the other algorithms and moreover,
it does so without the need to specify a domain size as opposed to the algorithms
in PLASMA. In addition, in double precision, for matrices where p  q, Greedy
continues to excel over any other algorithm using the TT kernels, and continues to
do so as the matrices become more square.
3.4 Conclusion
59
  
FlatTree (TS)
PlasmaTree (TS) (best)
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TS) = [ 1 1 1 1 5 5 5 5 5 5 10 ... 10 ]
Best domain size for PlasmaTree (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 ... 20 ]
O
ve
rh
ea
d
in
cp
le
n
gt
h
w
it
h
re
sp
ec
t
to
G
r
e
e
d
y
(G
r
e
e
d
y
=
1)
q
1 2 3 4 5 6 7 8 9 10 20 30 40
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
1.45
(a) Theoretical CP length
 
 
FlatTree (TS)
PlasmaTree (TS) (best)
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TS) = [1 3 5 10 11 8 4]
Best domain size for PlasmaTree (TT) = [1 5 5 5 17 28 8]
O
ve
rh
ea
d
in
ti
m
e
w
it
h
re
sp
ec
t
to
G
r
e
e
d
y
(G
r
e
e
d
y
=
1)
q
1 2 3 4 5 6 7 8 9 10 20 30 40
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
1.45
(b) Experimental (double complex)
 
 
FlatTree (TS)
PlasmaTree (TS) (best)
FlatTree (TT)
PlasmaTree (TT) (best)
Fibonacci (TT)
Greedy
Best domain size for PlasmaTree (TS) = [1 3 6 11 12 18 32]
Best domain size for PlasmaTree (TT) = [1 3 10 5 17 27 19]
O
ve
rh
ea
d
in
ti
m
e
w
it
h
re
sp
ec
t
to
G
r
e
e
d
y
(G
r
e
e
d
y
=
1)
q
1 2 3 4 5 6 7 8 9 10 20 30 40
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
(c) Experimental (double)
Figure 3.13: Overhead in terms of critical path length and time with respect to
Greedy (Greedy = 1)
In this chapter, we have presented Fibonacci, and Greedy, two new algorithms
for tiled QR factorization. These algorithms exhibit more parallelism than state-
of-the-art implementations based on reduction trees. We have provided accurate
estimations for the length of their critical path.
Comprehensive experiments on multicore platforms confirm the superiority of
the new algorithms for p × q matrices, as soon as, say, p ≥ 2q. This holds true
when comparing not only with previous algorithms using TT (Triangle on top of
triangle) kernels, but also with all known algorithms based on TS (Triangle on top
of square) kernels. Given that TS kernels offer more locality, and benefit from better
60
elementary arithmetic performance, than TT kernels, the better performance of the
new algorithms is even more striking, and further demonstrates that a large degree
of a parallelism was not exploited in previously published solutions.
Future work will investigate several promising directions. First, using rectangular
tiles instead of square tiles could lead to efficient algorithms, with more locality and
still the same potential for parallelism. Second, refining the model to account for
communications, and extending it to fully distributed architectures, would lay the
ground work for the design of MPI implementations of the new algorithms, unleashing
their high level of performance on larger platforms. Finally, the design of robust
algorithms, capable of achieving efficient performance despite variations in processor
speeds, or even resource failures, is a challenging but crucial task to fully benefit from
future platforms with a huge number of cores.
61
Algorithm 3.4: Greedy algorithm via TT kernels.
1 for j = 1 to q do
/* nz(j) is the number of tiles which have been eliminated in
column j */
2 nZ(j) = 0
/* nT (j) is the number of tiles which have been triangularized
in column j */
3 nT (j) = 0
4 while column q is not finished do
5 for j = q down to 1 do
6 if j == 1 then
/* Triangularize the first column if not yet done */
7 nTnew = nT (j) + (p− nT (j))
8 if p− nT (j) > 0 then
9 for k = p down to 1 do
10 GEQRT (k, j)
11 for jj = j + 1 to q do
12 UNMQR(k, j, jj)
13 else
/* Triangularize every tile having a zero in the
previous column */
14 nTnew = nZ(j − 1)
15 for k = nT (j) to nTnew − 1 do
16 GEQRT (p− k, j)
17 for jj = j + 1 to q do
18 UNMQR(p− k, j, jj)
/* Eliminate every tile triangularized in the previous step
*/
19 nZnew = nZ(j) + bnT (j)− nZ(j)
2
c
20 for kk = nZ(j) to nZnew − 1 do
21 piv(p− kk) = p− kk − nZnew + nZ(j)
22 TTQRT (p− kk, piv(p− kk), j)
23 for jj = j + 1 to q do
24 TTMQR(p− kk, piv(p− kk), j, jj)
/* Update the number of triangularized and eliminated tiles
at the next step */
25 nT (j) = nTnew
26 nZ(j) = nZnew
62
Table 3.3: Time-steps for tiled algorithms.
(a) Sameh-Kuck (b) Fibonacci (c) Greedy (d) BinaryTree (e) PlasmaTree (BS = 5)
? ? ? ? ?
6 ? 14 ? 12 ? 6 ? 6 ?
8 28 ? 12 48 ? 10 42 ? 8 28 ? 8 28 ?
10 34 50 ? 12 46 70 ? 10 40 64 ? 6 36 56 ? 10 34 50 ?
12 40 56 72 ? 10 42 68 92 ? 8 36 62 86 ? 10 34 70 90 ? 12 40 56 72 ?
14 46 62 78 94 ? 10 40 64 90 114 ? 8 34 56 84 106 ? 6 44 68 104 124 ? 14 46 62 78 94 ?
16 52 68 84 100 116 10 40 62 86 112 136 8 34 56 78 102 128 8 28 78 102 138 158 6 54 74 90 106 122
18 58 74 90 106 122 8 36 62 84 108 134 8 30 52 78 100 122 6 42 62 112 136 172 8 28 82 102 118 134
20 64 80 96 112 128 8 34 58 84 106 130 6 28 50 72 100 118 12 40 76 96 146 170 10 34 50 110 130 146
22 70 86 102 118 134 8 34 56 80 106 128 6 28 50 72 94 116 6 46 74 110 130 180 12 40 56 72 138 158
24 76 92 108 124 140 8 34 56 78 102 128 6 28 50 68 94 116 8 28 80 108 144 164 16 52 68 84 100 166
26 82 98 114 130 146 6 28 56 78 100 122 6 28 44 66 88 110 6 36 56 114 142 178 6 56 80 96 112 128
28 88 104 120 136 152 6 28 50 78 100 122 6 22 44 66 88 110 10 34 64 84 148 176 8 28 84 108 124 140
30 94 110 126 142 158 6 28 44 72 100 122 6 22 44 60 82 104 6 38 62 92 112 182 10 34 50 112 136 152
32 100 116 132 148 164 6 22 44 60 94 116 6 22 38 60 76 98 8 28 66 90 114 134 12 40 56 72 140 164
Table 3.4: Neither Greedy nor Asap are optimal.
(a) Greedy nor Asap are optimal.
(a) Greedy (b) Asap
? ?
12 ? 12 ?
10 42 ? 10 40 ?
10 40 64 10 36 86
8 36 62 8 34 80
8 34 56 8 32 74
8 34 56 8 30 68
8 30 52 8 28 62
6 28 50 6 28 56
6 28 50 6 26 50
6 28 50 6 24 46
6 28 44 6 24 44
6 22 44 6 22 44
6 22 44 6 22 40
6 22 38 6 22 38
(b) Greedy generally outperforms Asap.
q
p Algorithm 16 32 64 128
16
Greedy 310
Asap 310
32
Greedy 360 650
Asap 402 656
64
Greedy 374 726 1342
Asap 588 844 1354
128
Greedy 396 748 1452 2732
Asap 966 1222 1748 2756
63
Table 3.5: Three schemes applied to a column whose update kernel weight is not an
integer multiple of the elimination kernel weight.
a11 (1) (2) (3)
6
6 13 14 12
6 11 8 10
6 9 8 10
6 9 12 8
6 9 9 8
6 7 7 8
6 7 7 8
6 5 5 5
6 5 5 5
6 5 5 5
64
Table 3.6: Greedy versus PT TT and Fibonacci (Theoretical)
p q GREEDY PT TT BS Overhead Gain Fib Overhead Gain
40 1 16 16 1 1.0000 0.0000 22 1.3750 0.2727
40 2 54 60 3 1.1111 0.1000 72 1.3333 0.2500
40 3 74 98 5 1.3243 0.2449 94 1.2703 0.2128
40 4 104 132 5 1.2692 0.2121 116 1.1154 0.1034
40 5 126 166 5 1.3175 0.2410 138 1.0952 0.0870
40 6 148 198 10 1.3378 0.2525 160 1.0811 0.0750
40 7 170 226 10 1.3294 0.2478 182 1.0706 0.0659
40 8 192 254 10 1.3229 0.2441 204 1.0625 0.0588
40 9 214 282 10 1.3178 0.2411 226 1.0561 0.0531
40 10 236 310 10 1.3136 0.2387 248 1.0508 0.0484
40 11 258 336 20 1.3023 0.2321 270 1.0465 0.0444
40 12 280 358 20 1.2786 0.2179 292 1.0429 0.0411
40 13 302 380 20 1.2583 0.2053 314 1.0397 0.0382
40 14 324 402 20 1.2407 0.1940 336 1.0370 0.0357
40 15 346 424 20 1.2254 0.1840 358 1.0347 0.0335
40 16 368 446 20 1.2120 0.1749 380 1.0326 0.0316
40 17 390 468 20 1.2000 0.1667 402 1.0308 0.0299
40 18 412 490 20 1.1893 0.1592 424 1.0291 0.0283
40 19 432 512 20 1.1852 0.1562 446 1.0324 0.0314
40 20 454 534 20 1.1762 0.1498 468 1.0308 0.0299
40 21 476 554 20 1.1639 0.1408 490 1.0294 0.0286
40 22 498 570 20 1.1446 0.1263 512 1.0281 0.0273
40 23 520 586 20 1.1269 0.1126 534 1.0269 0.0262
40 24 542 602 20 1.1107 0.0997 556 1.0258 0.0252
40 25 564 618 20 1.0957 0.0874 578 1.0248 0.0242
40 26 586 634 20 1.0819 0.0757 600 1.0239 0.0233
40 27 608 650 20 1.0691 0.0646 622 1.0230 0.0225
40 28 630 666 20 1.0571 0.0541 644 1.0222 0.0217
40 29 652 682 20 1.0460 0.0440 666 1.0215 0.0210
40 30 668 698 20 1.0449 0.0430 688 1.0299 0.0291
40 31 684 714 20 1.0439 0.0420 710 1.0380 0.0366
40 32 700 730 20 1.0429 0.0411 732 1.0457 0.0437
40 33 716 746 20 1.0419 0.0402 754 1.0531 0.0504
40 34 732 762 20 1.0410 0.0394 776 1.0601 0.0567
40 35 748 778 20 1.0401 0.0386 798 1.0668 0.0627
40 36 764 794 20 1.0393 0.0378 820 1.0733 0.0683
40 37 780 810 20 1.0385 0.0370 842 1.0795 0.0736
40 38 796 826 20 1.0377 0.0363 862 1.0829 0.0766
40 39 812 842 20 1.0369 0.0356 878 1.0813 0.0752
40 40 826 856 20 1.0363 0.0350 892 1.0799 0.0740
65
Table 3.7: Greedy versus PT TT (Experimental Double)
p q GREEDY(d) PT TT(d) BS Overhead Gain
40 1 36.9360 37.5020 1 1.0153 -0.0153
40 2 58.5090 52.7180 3 0.9010 0.0990
40 4 103.2670 90.7940 10 0.8792 0.1208
40 5 115.3060 100.5540 5 0.8721 0.1279
40 10 153.5180 145.8200 17 0.9499 0.0501
40 20 170.8730 171.8270 27 1.0056 -0.0056
40 40 184.5220 182.8160 19 0.9908 0.0092
Table 3.8: Greedy versus PT TT (Experimental Double Complex)
p q GREEDY(z) PT TT(z) BS Overhead Gain
40 1 42.0710 42.7120 1 1.0152 -0.0152
40 2 60.4420 52.1970 5 0.8636 0.1364
40 4 95.1820 84.1120 5 0.8837 0.1163
40 5 107.6370 96.7530 5 0.8989 0.1011
40 10 135.0270 128.4320 17 0.9512 0.0488
40 20 144.4010 146.4220 28 1.0140 -0.0140
40 40 152.9280 151.9090 8 0.9933 0.0067
Table 3.9: Greedy versus Fibonacci (Experimental Double)
p q GREEDY(d) FIB(d) Overhead Gain
40 1 36.9360 26.5610 0.7191 0.2809
40 2 58.5090 49.4870 0.8458 0.1542
40 4 103.2670 100.1440 0.9698 0.0302
40 5 115.3060 115.0020 0.9974 0.0026
40 10 153.5180 152.0090 0.9902 0.0098
40 20 170.8730 170.4780 0.9977 0.0023
40 40 184.5220 180.2990 0.9771 0.0229
Table 3.10: Greedy versus Fibonacci (Experimental Double Complex)
p q GREEDY(z) FIB(z) Overhead Gain
40 1 42.0710 30.2280 0.7185 0.2815
40 2 60.4420 48.9570 0.8100 0.1900
40 4 95.1820 97.1650 1.0208 -0.0208
40 5 107.6370 105.9610 0.9844 0.0156
40 10 135.0270 134.5500 0.9965 0.0035
40 20 144.4010 145.5530 1.0080 -0.0080
40 40 152.9280 150.0980 0.9815 0.0185
66
4. Scheduling of Cholesky Factorization
In Chapter 2 we studied the Cholesky Inversion algorithm which consists of the
three steps: Cholesky factorization, inversion of the factor, and the multiplication of
two triangular matrices. In this chapter, we will focus on the Cholesky factorization
but unlike the previous work where the number of processors was unbounded, we will
consider the factorization in the context of a finite number of processors. By limiting
the number of processors, the scheduling of the tasks becomes an issue. Moreover, the
weight (processing time) of each task must be taken into consideration when creating
the schedule.
As before, we will be considering the critical path length for the algorithm but
not as a function of the number of tiles rather as a function of the weights of the
tasks. The weights are based upon the total computational cost for each kernel and
are provided in Table 4.1. A more in-depth analysis of the length of the critical path
with weighted tasks for the Cholesky Inversion algorithm can be found in [16] which
also provides 9t−10 as the weighted critical path length for the Cholesky factorization
of a matrix of t× t tiles.
Table 4.1: Task Weights
Weights
# flops (in 1
3
n3b flops)
POTRF 1
3
n3b +O(n
2
b) 1
TRSM n3b 3
SYRK n3b +O(n
2
b) 3
GEMM 2n3b +O(n
2
b) 6
The upper bound on performance of perfect speedup and critical path introduced
in Chapter 2 remains too optimistic and does not take into account any information
which can be garnered from the DAG of the algorithm. This work makes progress
towards providing a more representative bound on the performance of the Cholesky
factorization in the tiled setting.
67
We also provide gains toward a bound on the minimum number of processors
required to obtain the shortest possible weighted critical path (minimum make span)
for the Cholesky factorization for a matrix of t× t tiles.
4.1 ALAP Derived Performance Bound
To obtain our bounds, we calculate the latest possible start time for each task
(ALAP) and consider an unbounded number of processors without any costs for
communication. If we did account for communication, we might see the critical path
length increase which would in turn decrease our upper bound. We start at the final
tasks and consider how many processors are needed to execute these tasks without
increasing the length of the critical path. We step backwards in time until such a
point that there are more processors needed to keep the critical path length constant.
Thus we must add enough processors to execute the tasks and in turn create more
idle time for the execution of tasks which are successors. At a certain point, there is
no more need to add processors and this is then the number of processors needed to
obtain the constant length critical path.
By forcing as late as possible (ALAP) start times, any schedule will keep as
many or fewer processors active as the ALAP execution on an unbounded number of
processors. Thus by evaluating the Lost Area (LA), or idle time, for a given number
of processors, p, at the end of the ALAP execution on an unbounded number of
processors, we can increase the sequential time by the amount of LA and divide this
result by the p to obtain the best possible execution time, i.e.,
Tp =
Tseq + LAp
p
(4.1)
and we define this to be the ALAP Derived Performance Bound. Hence the maximum
speedup that we can expect is given by
Tseq · p
Tseq + LAp
.
68
An example will help to further illustrate this technique. In Figure 4.1 we are
given the ALAP execution of a 5× 5 tiled matrix which has Tseq = 125. The ordered
pairs indicated provide the number of processors and idle time, respectively, and
in Table 4.2 are given the values for Tp, speedup, and efficiency. For more than
four processors, there are enough processors to obtain the critical path length which
becomes our limiting factor.
1 2-1
3-1
4-1
5-1 2-1
3-2-1
3-14-2-1
4-3-1
4-1
5-2-1
5-3-1
5-4-1
5-1
2
3-2
4-2
5-2
3-2
4-3-2
4-2
5-3-2
5-4-2
5-2
3
4-3
5-3
4-3
5-4-3
5-34
5-4 5-4 5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
0
1
2
3
4
5
6
7
8
9
10
(1,0)
(2,4)
(3,11)
(4,24)
(5,45)
time
p
ro
c
e
ss
o
rs
POTRF
TRSM
SYRK
GEMM
Figure 4.1: ALAP execution for 5× 5 tiles
Table 4.2: Upper bound on speedup and efficiency for 5× 5 tiles
p Tp Sp Ep
1 125.00 1.00 1.00
2 64.50 1.94 0.97
3 45.33 2.76 0.92
4 37.25 3.36 0.84
5 35.00 3.57 0.71
6 35.00 3.57 0.60
7 35.00 3.57 0.51
8 35.00 3.57 0.45
9 35.00 3.57 0.40
10 35.00 3.57 0.36
4.2 Critical Path Scheduling
In order to provide a critical path schedule, we use the Backflow algorithm to
assign priorities to tasks of a DAG such that each task’s priority adheres to its de-
69
pendencies. The algorithm is described in four steps:
STEP 1 : Beginning at the final task in the DAG, set its priority
to its processing time.
STEP 2 : Moving in the reverse direction, set each incidental
task’s priority to the sum of its the processing time and
the final task’s priority.
STEP 3 : For each task in STEP 2, moving in the reverse direc-
tion, set each incidental task’s priority to the sum of
its processing time and the maximum priority of any
incidental successor task.
STEP 4 : Repeat the procedure until all tasks have been assigned
a priority.
An example is given in Figure 4.2. The processing times are given in parenthesis
and the assigned priorities (cp) are designated in square brackets. Tasks A and B
will be assigned a priority of 16 since cp(A) = 3 + max (cp(C), cp(D)) and cp(B) =
3 + max (cp(D), cp(E)).
By following the path with the highest priorities, a critical path can be discerned
from the weighted DAG. Thus any schedule which then chooses from the available
tasks those with the highest priorities to execute first inherently follows the critical
path. It is well known that a critical path scheduling is not always optimal. As an
example [43], take two processors and four tasks. Let tasks A, B, C, and D have
weights of 3, 3, 1, and 1, respectively and let the only relationship between tasks be
that C is a predecessor of D. Then cp(A) = cp(B) = 3, cp(C) = 2, and cp(D) = 1.
A critical path schedule would choose to schedule tasks A and B simultaneously and
follow up with C and then D, resulting in a schedule of length five. However, if A
and C are scheduled simultaneously and then D follows A on the same processor and
B follows C on the other processor, the length of the schedule is four.
70
START (1)     
A (3)     
B (3)     C (3)    
D (6)     
E (3)    
F (1)    
G (3)    
H (3)    
END (1)    
(a) State at start
START (1) [17]
A (3) [16]
B (3) [16]C (3) [11]
D (6) [13]
E (3) [7]
F (1) [8]
G (3) [7]
H (3) [4]
END (1) [1]
(b) State at finish
Figure 4.2: Example derivation of task priorities via the Backflow algorithm
We will use this critical path information to analyze three schedules by choosing
available tasks via max(cp), rand(cp), or min(cp). The max(cp) will naturally follow
the critical path by scheduling tasks with the highest cp first and, vice versa, the
min(cp) will schedule from the available tasks those with the minimum cp. Between
these two, we also choose randomly amongst the available tasks with rand(cp).
4.3 Scheduling with synchronizations
The right-looking version of the LAPACK Cholesky factorization as depicted
in Figure 1.1c provides an alternative schedule which can be easier to analyze and
understand. We will apply the three steps of the algorithm to a matrix of t × t
tiles. In the tiled setting, we can provide synchronization points between the varying
tasks of each step and simply schedule any of the available tasks since there are no
dependencies between the tasks in each grouping. By adding these synchronizations,
71
this schedule is not able to obtain the critical path no matter how many processors
are available. Algorithm 4.1 is the right-looking variant of the Cholesky factorization
with added synchronization points.
Algorithm 4.1: Schedule for tiled right-looking Cholesky factorization with
added synchronizations to allow for grouping.
1 Tile Cholesky Factorization (compute L such that A = LLT );
2 for j = 0 to t− 1 do
3 schedule POTRF(i) ;
4 synchronize;
5 for j = i+ 1 to t− 1 do
6 schedule TRSM(j,i) ;
7 synchronize;
8 for j = i+ 1 to t− 1 do
9 for k = i+ 1 to j − 1 do
10 schedule GEMM(j,i,k) ;
11 synchronize;
12 for j = i+ 1 to t− 1 do
13 schedule SYRK(j,i) ;
14 synchronize;
Naturally, we can improve upon the above schedule by removing the synchro-
nization between some of the groupings (Algorithm 4.2). The update of the trailing
matrix is composed of two groupings, namely the GEMMs and the SYRKs, which
can be executed in parallel if enough processors are available. Moreover, the added
synchronization point between the update of the trailing matrix and the factorization
of the next diagonal tile may also be removed. This schedule does become more com-
plex, but given enough processors, the schedule is able to obtain the critical path as
the limiting factor to performance. The minimum number of processors, p, needed to
obtain the critical path is p =
⌈
1
2
(t− 1)2⌉ for a matrix of t× t tiles since the highest
degree of parallelism is realized for the update of the first trailing matrix which is of
72
size (t− 1)× (t− 1).
Both of these schedules differ from the critical path scheduling due to the added
synchronization points and will show lower theoretical performance. In the theoretical
results, we only show Algorithm 4.2.
Algorithm 4.2: Improvement upon Algorithm 4.1
1 Tile Cholesky Factorization (compute L such that A = LLT );
2 for j = 0 to t− 1 do
3 schedule POTRF(i) ;
4 synchronize;
5 for j = i+ 1 to t− 1 do
6 schedule TRSM(j,i) ;
7 synchronize;
8 for j = i+ 1 to t− 1 do
9 for k = i+ 1 to j − 1 do
10 schedule GEMM(j,i,k) ;
11 schedule SYRK(j,i) ;
4.4 Theoretical Results
In the following figures, our Rooftop bound will be that which uses the perfect
speedup until the weighted critical path is the limiting factor, i.e.,
Rooftop Bound = max
(
Critical Path,
Tseq
p
)
(4.2)
We will compare this to our ALAP Derived bound which was derived using the ALAP
execution, our various scheduling strategies, and the following lower bound. From [20,
p.221,§7.4.2], given our DAG, we know that the make span, MS, of any list schedule,
σ, for a given number of processors, p, is
MS(σ, p) ≤
(
2− 1
p
)
MSopt(p)
where MSopt is the make span of the optimal list schedule without communication
costs. However, we do not know MSopt and must therefore substitute the make span
73
of the Critical Path Scheduling using the maximum strategy to compute our lower
bound.
The ALAP Derived bound improves upon the Rooftop bound precisely in the area
that is of most concern, namely where there is enough parallelism but not enough
processors to fully exploit that parallelism.
1 16 48 80 112 144 176 208 240 272 304 336 368 400 432
0
20
40
60
80
100
120
140
160
180
# of threads
Sp
ee
du
p
Speedup with p=40
 
 
Rooftop bound
ALAP Derived bound
Critcal Path max
Critcal Path random
Critcal Path min
Algorithm 4.2
(2 − 1/p)*MS
opt
(a) Speedup
1 16 48 80 112 144 176 208 240 272 304 336 368 400 432
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
# of threads
Ef
fic
ie
nc
y
Efficiency with p=40
 
 
Rooftop bound
ALAP Derived bound
Critcal Path max
Critcal Path random
Critcal Path min
Algorithm 4.2
(2 − 1/p)*MS
opt
(b) Efficiency
1 16 48 80 112 144 176 208 240 272 304 336 368 400 432
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
Scalability with p=40
# of threads
ra
tio
 w
ith
 re
sp
ec
t t
o 
AL
AP
 D
er
ive
d 
bo
un
d
 
 
Critical Path max
Critical Path random
Critical Path min
Algorithm 4.2
(c) Comparison to new bound
Figure 4.3: Theoretical results for matrix of 40× 40 tiles.
Figure 4.3(a) shows that the Critical Path schedule is actually quite descent and
comes to within two percent of the ALAP Derived bound. Moreover, the ALAP
Derived bound has significantly reduced the gap between the Rooftop bound and any
of our list schedules.
74
4.5 Toward an αopt
It is interesting to know how many processors one needs to be able to schedule
all of the tasks and maintain the weighted critical path. We will view this problem
in terms of tiles and state the problem as follows:
Given a matrix of t×t tiles, determine the minimum number of processors,
popt, needed to schedule all tasks and achieve an execution time equal to
the weighted critical path.
Toward that end, we will let p = αt2 where 0 < α ≤ 1. Our analysis will be
asymptotic in which we let t tend to infinity. From the analysis of Algorithm 4.2,
we already know that αopt ≤ 12 . Using MATLAB, we have calculated the α value for
matrices of t× t tiles as shown in Figure 4.4. It is our conjecture that αopt ≈ 15 .
5 10 15 20 25 30 35 40
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Values of α for 3 ≤ t ≤ 40
# of tiles
# 
of
 th
re
ad
s 
/ (t
−1
)2
 
 
ALAP Derived
Critical Path max
% difference
Figure 4.4: Values of α for matrices of t× t tiles where 3 ≤ t ≤ 40
4.6 Related Work
For the LU decomposition with partial pivoting, much work has been accom-
plished to discern asymptotically optimal algorithms for all values of α [37, 35, 43].
They consider a problem of size n and assume p = αn processors on which to schedule
the LU decomposition. The critical path length of the optimal schedule in this case
75
0.1 0.2 0.3 0.4 0.5
0.5
1.0
α = p/n
e
ffi
cie
nc
y
Efficiency vesus the ratio α = p / n
 
 
upper bound
optimal algorithm
(a) LU Decomposition
0.1 0.2
0.5
1.0
α = p/t2
Ef
fic
ie
nc
y
Efficiency vesus the ratio α = p / t2
 
 
Rooftop bound
ALAP Derived bound
Critcal Path max
(b) Tiled Choelsky factorization
Figure 4.5: Asymptotic efficiency versus α = p/n for LU decomposition and versus
α = p/t2 for Tiled Cholesky factorization.
is n2 and αopt ≈ 0.347 where αopt is a solution to the equation 3α− α3 = 1.
In Figure 4.5, we make a comparison between the algorithmic efficiency of the LU
decomposition and the tiled Cholesky factorization. In the case of the LU decompo-
sition, the attainable upper bound on efficiency closely resembles our previous bound
of perfect speedup which is then limited by the critical path. On the other hand, the
tiled Cholesky factorization does not exhibit this type of efficiency which can be seen
from the gap between our Rooftop bound and the ALAP Derived bound. Unlike the
work in [37], we do not provide an algorithm which attains the ALAP Derived bound.
4.7 Conclusion and future work
In many research papers, the performance of an algorithm is usually compared to
either the performance of the GEMM kernel or against perfect scalability resulting in
large discrepancies between the peak performance of the algorithm and these upper
bounds. If an algorithm displays a DAG such as that of the tiled Cholesky factor-
ization, it is unrealistic to expect perfect scalability or even make comparisons to the
performance of the GEMM kernel. Thus one needs to consider a new bound which
is more representative of the algorithm and accounts for the structure of the DAG.
Without such a bound it is difficult to access whether there are any performance
76
gains to be achieved. Although we do not have a closed-form expression for this new
bound, we have shown that such a bound exists. Moreover, we have also shown that
any algorithm which schedules the tiled Cholesky factorization while maintaining the
weighted critical path will require O(t2) processors for a matrix of t× t tiles and the
coefficient is somewhere around 0.2.
In this chapter, we have applied a combination of existing techniques to a tiled
Cholesky factorization algorithm to discover a more realistic bound on the perfor-
mance and efficiency. We did so by considering an ALAP execution on an unbounded
number of processors and used this information to calculate the idle time for any
list schedule on a bounded number of processors. This is then used to calculate the
maximum speedup and efficiency that may be observed.
Further work is necessary to provide a closed-form expression of the new bound
dependent upon the number of processors. In addition, we need to include communi-
cation costs in the bound to make it more reflective of the actual scheduling problem
on parallel machines. As can be seen in Figure 4.3(c), the Critical Path Schedule is
within 2% of our ALAP Derived bound. Although scheduling a DAG on a bounded
number of processors is an NP-complete problem, it may be not be the case for the
DAG of the tiled Cholesky factorization. Pursuant investigation might show that the
Critical Path Scheduling is the optimal schedule.
77
5. Scheduling of QR Factorization
In this chapter, we present collaborative work with Jeffrey Larson. We revisit
the tiled QR factorization as presented in Chapter 3 but do so in the context of a
bounded number of resources. (Chapter 3 was concerned with finding the optimal
elimination tree on an unlimited number of processors.) We will be using the same
analytical tools as in Chapter 4 to derive good schedules and to improve upon the
Rooftop bound. The Cholesky factorization has just one DAG, therefore Chapter 4
is a standard scheduling problem, i.e., how to schedule a DAG on a finite number
of processor. Unlike the previous chapter, we will need to consider all of the various
algorithms (i.e., elimination trees) and cannot distill the analysis down to a single
DAG.
5.1 Performance Bounds
Each of the algorithms studied in Chapter 3, namely FlatTree, Fibonacci,
Greedy, and GrASAP, produces a unique DAG for a matrix of p× q tiles. In turn,
the ALAP Derived bounds (4.1) for each elimination tree will also be unique. In Fig-
ure 5.1, we give the computed upper bounds and make comparisons to the scheduling
strategies of maximum, random, and minimum via the Critical Path Method for a
matrix of 20× 6 tiles. The matrix size was chosen such that the critical path length
of Greedy is 136 and the critical path length of GrASAP is 134 (see Figure 3.5 in
§ 3.2.2).
The GrASAP algorithm for a tiled matrix is optimal in the scope of unbounded
resources. However by the manner in which the ALAP Derived bound is computed,
the bound created by using GrASAP cannot hold for all of the other algorithms.
Consider the ALAP execution of the Fibonacci and GrASAP algorithms on a
matrix of 15 × 4 tiles. In Figure 5.2, we show the execution of the last tasks for
GrASAP on the left and Fibonacci on the right. More of the tasks in the ALAP
execution for Fibonacci are pushed towards the end of the execution which means
78
1 8 16 24 32 40 48 56 64 72
0
5
10
15
20
25
30
# of threads
Sp
ee
du
p
Speedup with p=20, q=6
 
 
Rooftop bound
ALAP Derived bound
Critcal Path max
Critcal Path random
Critcal Path min
(a) FlatTree
1 8 16 24 32 40 48 56 64 72
0
5
10
15
20
25
30
# of threads
Sp
ee
du
p
Speedup with p=20, q=6
 
 
Rooftop bound
ALAP Derived bound
Critcal Path max
Critcal Path random
Critcal Path min
(b) Fibonacci
1 8 16 24 32 40 48 56 64 72
0
5
10
15
20
25
30
# of threads
Sp
ee
du
p
Speedup with p=20, q=6
 
 
Rooftop bound
ALAP Derived bound
Critcal Path max
Critcal Path random
Critcal Path min
(c) Greedy
1 8 16 24 32 40 48 56 64 72
0
5
10
15
20
25
30
# of threads
Sp
ee
du
p
Speedup with p=20, q=6
 
 
Rooftop bound
ALAP Derived bound
Critcal Path max
Critcal Path random
Critcal Path min
(d) GrASAP
Figure 5.1: Scheduling comparisons for each of the algorithms versus the ALAP
Derived bounds on a matrix of 20× 6 tiles.
the ALAP Derived bound will be higher than that of GrASAP for a schedule that
uses fewer than 10 processors. In other words, as we add more processors, the Lost
Area (LA) increases much faster for GrASAP than it does for Fibonacci. Since
the critical path length for Fibonacci is greater than that of GrASAP, after a
certain number of processors, the ALAP Derived bound for Fibonacci falls below
that of GrASAP. These observations are evident in Figure 5.3 where we show a
comparison of the bound for each algorithm. Recall that the Rooftop bound (4.2) only
takes into account the critical path length of an algorithm such that for GrASAP
79
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
GrASAP
0 1 2 3 4 5 6 7 8 9 10 11 12 1 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 9 0 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Fibonacci
GEQRT
TTQRT
UNMQR
TTMQR
Figure 5.2: Tail-end execution using ALAP on unbounded resources for GrASAP
and Fibonacci on a matrix of 15× 4 tiles.
it can be considered a bound for all the algorithms since GrASAP is optimal for
unlimited resources and thus has the shortest critical path length.
5.2 Optimality
There is no reason for the tree found optimal in Chapter 3 on an unbounded num-
ber of resources to be optimal on a bounded number resources. We cast the problem
of finding the optimal tree with the optimal schedule as an integer programming
problem. (A complete description of the formulation can be found in Appendix A.)
Similarly, in [1] a Mixed-Integer Linear Programming approach was used to provide
an upper bound on performance. However, the integer programming problem size
grows exponentially as the matrix size increases, thus the largest feasible problem
size was a matrix of 5× 5 tiles. In Figure 5.4 we show the speedup of the GrASAP
algorithm with its bound and make comparisons to an optimal tree with an optimal
schedule and Table 5.1 provides the actual schedule lengths for all of the algorithms
using the CP Method for the matrix of 5× 5 tiles..
80
1 8 16 24 32
0
2
4
6
8
10
12
14
16
# of threads
Sp
ee
du
p
Speedup with p=15, q=4
 
 
Rooftop bound
GrASAP
Greedy
Flat Tree
Fibonacci
(a) Speedup
1 8 16 24 32
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
# of threads
Ef
fic
ie
nc
y
Efficiency with p=15, q=4
 
 
Rooftop bound
GrASAP
Greedy
Flat Tree
Fibonacci
(b) Efficiency
Figure 5.3: ALAP Derived bound comparison for all algorithms for a matrix of 15×4
tiles.
5.3 Elimination Tree Scheduling
We pair up the choice of the elimination tree with a type of scheduling strategy
to obtain the following bounds: GrASAP
Rooftop bound
 ≤
 optimal tree
optimal schedule
 ≤
 GrASAP
optimal schedule
 ≤
 GrASAP
CP schedule

Moreover, we also have GrASAP
Rooftop bound
 ≤
 GrASAP
ALAP Derived Bound
 ≤
 GrASAP
optimal schedule
 ≤
 GrASAP
CP schedule

Combining these inequalities with Table 5.1 gives rise to the following questions:
Given an optimal elimination tree for the tiled QR factorization on an
unbounded number of resources
(Q1) does there always exist a scheduling strategy such that the schedule
on limited resources is optimal?
(Q2) does the ALAP Derived bound for this elimination tree hold true
for any scheduling strategy on any other elimination tree?
81
1 4 8 12
0
1
2
3
4
5
6
# of threads
Sp
ee
du
p
Optimality Comparison with p=5, q=5
 
 
Rooftop bound
GrASAP
Integer Program
ALAP Derived bound
Figure 5.4: Comparison of speedup for CP Method on GrASAP, ALAP Derived
bound from GrASAP, and optimal schedules for a matrix of 5 × 5 tiles on 1 to 14
processors
Table 5.1: Schedule lengths for matrix of 5× 5 tiles
ALAP Derived Optimal CP Method
Procs Bound(GrASAP) Tree/Schedule GrASAP Greedy Fibonacci FlatTree
1 500 500 500 500 500 500
2 255 256 256 256 256 256
3 176 176 178 178 178 176
4 138 138 140 140 140 140
5 116 116 118 118 118 116
6 102 104 104 104 104 104
7 92 94 94 94 94 94
8 86 88 88 88 88 88
9 82 84 84 84 86 86
10 80 80 82 82 86 86
11 80 80 80 80 86 86
12 80 80 80 80 86 86
13 80 80 80 80 86 86
14 80 80 80 80 80 86
82
From Chapter 3 we have that GrASAP is an optimal elimination tree for the tiled
QR factorization. We know that the length of an optimal schedule for GrASAP
on p processors will necessarily be greater or equal to the ALAP Derived bound for
GrASAP on p processors by way of construction of the ALAP Derived bound. Thus
(Q1) implies (Q2). We cannot address the first question directly since the size of
the matrix needed to produce a counter example is too large for verification with the
integer programming formulation.
Therefore we need to find a matrix size for which a schedule exists whose execution
length is smaller than the ALAP Derived bound from GrASAP on the same matrix.
As we have seen in Figure 5.3, the Fibonacci elimination tree on a tall and skinny
tiled matrix provides the best hope.
Consider a matrix of 34 × 4 tiles on 10 processors. The ALAP Derived bound
from GrASAP is 188. Using the Critical Path method to schedule the Fibonacci
elimination tree we obtain a schedule length of 184. Therefore the ALAP Derived
bound from GrASAP does not hold for this schedule. By implication, we have that
(Q1) is false. However, the Rooftop bound from GrASAP is still a valid bound for
all of the schedules.
5.4 Conclusion
In this chapter we have applied the same tools used in Chapter 4 to provide
performance bounds for the tiled QR factorization. Further, we have shown that the
ALAP Derived bound is algorithm dependent. This leaves that only bound we have
for all algorithms is the Rooftop bound as computed using the GrASAP algorithm.
The analysis in this chapter has also shown that an optimal algorithm for an
unbounded number of resources does not imply that a scheduling strategy exists such
that it can be scheduled optimally.
83
6. Strassen Matrix-Matrix Multiplication
Matrix multiplication is the underlying operation in many if not most of the
applications in numerical linear algebra and as such, it has garnered much attention.
Algorithms such as the Cholesky factorization, LU decomposition and more recently
the QR-based Dynamically Weighted Halley iteration for polar decomposition [38],
spend a majority of their computational cost in matrix-matrix multiplication. The
conventional BLAS Level 3 subprogram for matrix-matrix multiplication is of O(nα),
where α = log28 = 3, computational cost but there exist subcubic computational
cost algorithms. In 1969, Volker Strassen [48] presented an algorithm that computes
the product of two square matrices of size n × n, where n is even, using only 7
matrix multiplications at the cost of needing 18 matrix additions/subtractions which
then can be called recursively for each of the 7 multiplications. This compares to
the standard cubic algorithm which requires 8 matrix multiplications and only 4
matrix additions. When Strassen’s algorithm is applied recursively down to a constant
size, the computational cost is O(nα) where α = log2 7 ≈ 2.807. Two years later,
Shmuel Winograd proved that a minimal of 7 matrix multiplications and 15 matrix
additions/subtractions, which is less than the 18 of Strassen’s, are required for the
product of two 2 × 2 matrices, see [54]. These discoveries have spawned a flurry of
research over the years. In 1978, Pan [39] showed that α < 2.796. In the late 1970’s
and early 1980’s, Bini [12] provided α < 2.78 with Scho¨nage [46] following up by
showing α < 2.522 but was usurped the following year by Romani [44] who discerned
α < 2.517. In 1986, Strassen brought forth a new approach which lead to α < 2.497.
In 1990, Coppersmith and Winograd [23] improved upon Strassen’s result providing
the asymptotic exponent α < 2.376. This final result still stands, but it is conjectured
that α = 2 + ε for any ε > 0 where ε can me made as small as possible. Although
the Coppersmith-Winograd algorithm may be reasonable to implement, since the
constant of the algorithm is huge and will not provide an advantage except for very
84
large matrices, we will not consider it and instead focus on the Strassen-Winograd
Algorithm.
6.1 Strassen-Winograd Algorithm
Here we discuss the algorithm as it would be implemented to compute the product
of two matrices A and B where the result is stored into matrix C. The algorithm is
recursive, thus we describe one step. Given the input matrices A, B, and C, divide
them into four submatrices,
A =
A11 A12
A21 A22
 , B =
B11 B12
B21 B22
 , C =
C11 C12
C21 C22

then the 7 matrix multiplications and 15 matrix additions/subtractions are computed
as depicted in Table 6.1 and Figure 6.1 shows the task graph of the Strassen-Winograd
algorithm for one level of recursion.
Table 6.1: Strassen-Winograd Algorithm
Phase 1 T1 = A21 + A22 T5 = B12 −B11
T2 = T1 − A11 T6 = B22 − T5
T3 = A11 − A21 T7 = B22 −B12
T4 = A12 − T2 T8 = T6 −B21
Phase 2 Q1 = T2 × T6 Q5 = T1 × T5
Q2 = A11 ×B11 Q6 = T4 ×B22
Q3 = A12 ×B21 Q7 = A22 × T8
Q4 = T3 × T7
Phase 3 U1 = Q1 +Q2 C11 = Q2 +Q3
U2 = U1 +Q4 C12 = U1 + U3
U3 = Q5 +Q6 C21 = U2 −Q7
C22 = U2 +Q5
In essence, Strassen’s approach is very similar to the observation that Gauss made
concerning the multiplication of two complex numbers. The product of (a + bi)(c +
di) = ac− bd+ (bc+ad)i would naively take four multiplications, but can actually be
85
T1 T2
   Q5   
T4
   Q1   
T3
   Q4   
   Q6   
T5
T6
T8
T7
   Q7   
U1
   Q2   
C11   Q3   
U2
U3
C22
C21
C12
Figure 6.1: Task graph for the Strassen-Winograd Algorithm. Execution time pro-
gresses from left to right. Large ovals depict multiplication and small ovals addi-
tion/subtraction.
accomplished via three multiplications by discerning that bc+ ad = (a+ b)(c+ d)−
bc− ad.
6.2 Tiled Strassen-Winograd Algorithm
In our tiled version, the matrices are subdivided such that each submatrix is of
the form
Mij =

Mij11 Mij12 . . . Mij1n
Mij21 Mij22 . . . Mij2n
...
...
. . .
...
Mijn1Mijn2 . . . Mijnn

where the matrices Mijkl are tiles of size nb × nb. As before one can proceed with
full recursion, unlike before this would not terminate at the scalar level, but rather it
would terminate with the multiplication of two tiles using a sequential BLAS Level
3 matrix-matrix multiplication. The recursion can also be cutoff at a higher level
at which point the tiled matrix multiplication of Algorithm 6.1 computes the result-
ing multiplication. For the addition/subtraction of the submatrices in Phase 1 and
86
Phase 3 of the Strassen-Winograd algorithm, a similar approach is used which is also
executed in parallel.
Algorithm 6.1: Tiled Matrix Multiplication (tiled gemm)
/* Input: n× n tiled matrices A and B, Output: n× n tiled
matrix C such that C = A×B */
1 for i = 1 to n do
2 for j = 1 to n do
3 for k = 1 to n do
4 Cij ← Aik ×Bkj + Cij
If the cutoff for the recursion occurs before the tile level, the computation for each
Cij can be executed in parallel. Therefore our tiled implementation of the Strassen-
Winograd algorithm exploits two levels of parallelism. Moreover, this allows some
parts of the matrix multiplications to occur early on as can be seen in Figure 6.2
which shows the DAG for a matrix of 4 × 4 tiles with one level of recursion. Both
Figure 6.2 and Figure 6.1 illustrate one level of recursion but the tiled task graph of
a 4× 4 tiled matrix clearly portrays the high degree of parallelism.
The conventional matrix-matrix multiplication algorithm requires 8 multiplica-
tions and 4 additions whereas the Strassen-Winograd algorithm requires 7 multipli-
cations and 15 additions/subtractions for each level of recursion. Therefore, there
are more tasks for the Strassen-Winograd algorithm as compared to the conven-
tional matrix-matrix multiplication and it would behoove us to reduce the number
of tasks which would also reduce the algorithmic complexity. On the other hand,
since we are reducing the number of multiplications, the computational cost is also
reduced since this requires a cubic operation versus the quadratic operation of the
addition/subtractions.
The total number of tasks, T , of the Strassen-Winograd algorithm is given by
T = 7r
( p
2r
)3
+ 15
r−1∑
i=0
7r−i−1
( p
2r−i
)2
= p3
(
7
8
)r
+ 5p2
((
7
4
)r
− 1
)
87
T5 T6
DGEMM
DGEMM
1 : 2 4 2 : 2 4
T1
T2
DGEMM
T5 T6
DGEMM
T5 T6
DGEMM
DGEMM
T1 T2
DGEMM
T1 T2
DGEMM
T3
DGEMM
DGEMM
T3 DGEMM
DGEMM
T3
DGEMM
DGEMM
T1 T2
T3
DGEMM
DGEMM
T5 T6
T7
T7
T7
T7
T8
DGEMM
DGEMM
3 : 2 4
T8
DGEMM
DGEMM
T8
DGEMM
DGEMM
T4
T4
DGEMM
T4
DGEMM
T4
T8
DGEMM
DGEMM
4 : 1 2
DGEMM
DGEMM
U2
DGEMM DGEMM
DGEMM
DGEMM
DGEMM
DGEMM
U2
DGEMM
DGEMM
U2
DGEMM DGEMM
DGEMM
DGEMM
U2
DGEMM
DGEMM
DGEMM DGEMM
DGEMM DGEMM
DGEMM DGEMM
DGEMM DGEMM
U1
U1
DGEMM
DGEMM
DGEMM DGEMM
DGEMM DGEMM
U3
U3
C11
5 : 1 2
C11
U1
U3
C22
C11
C11
U1
C22
U3
6 :8
C21
C21
C21
C21
C12
C12
C12
C12
7 : 1 2
C22
C22
Figure 6.2: Strassen-Winograd DAG for matrix of 4 × 4 tiles with one recursion.
Execution time progresses from left to right. Large ovals depict multiplication and
small ovals addition/subtraction.
88
Table 6.2: Recursion levels which minimize the number of tasks for a tiled matrix of
size p× p
p rmin Gflop(SW) Gflop(GEMM)
4 1 8.96e-01 1.02e+00
8 1 7.15e+00 8.18e+00
16 1 5.72e+01 6.55e+01
32 1 4.57e+02 5.24e+02
64 2 3.20e+03 4.19e+03
128 3 2.24e+04 3.35e+04
256 4 1.57e+05 2.68e+05
512 5 1.09e+06 2.14e+06
1024 6 7.69e+06 1.71e+07
which is minimized at recursion level rmin when
rmin =

 ln
(
p ln( 87)
5 ln( 74)
)
ln 2

 ,
and the total number of flops, F , is given by
F = m7r
( p
2r
)3
+ 15a
r−1∑
i=0
7r−i−1
( p
2r−i
)2
= mp3
(
7
8
)r
+ 5ap2
((
7
4
)r
− 1
)
,
where m = 2n3b−n2b for the multiplications and a = n2b for the additions/subtractions.
As we increase the recursion, the number of tasks will decrease up to a certain
point, rmin. The reason for this being at each recursion we reduce the number of p
3
tasks by 1
8
while increasing the p2 tasks by 15.
In our experiments, letting the tile size nb = 200 provided the best performance.
Thus Table 6.2 shows the corresponding values for the minimizing recursion level for
various number of tiles. However, as can be seen in Table 6.3, the rmin which provides
the minimum number of tasks does not provide the least amount of computational
cost. The computational costs will be minimized at the full recursion.
Even though the number of tasks and thereby the computational complexity are
minimized for rmin, Table 6.4 shows that the critical path length increases exponen-
89
Table 6.3: 128× 128 tiles of size nb = 200
algorithm recursion tasks Gflop
strassen winograd
1 1,896,448 2.92e+04
2 1,774,592 2.56e+04
3 1,762,048 2.24e+04
4 1,915,712 1.96e+04
5 2,338,288 1.72e+04
6 3,212,252 1.51e+04
7 4,859,338 1.33e+04
tiled DGEMM 4,177,920 3.36e+04
Table 6.4: Comparison of the total number of tasks and critical path length for matrix
of p× p tiles.
p r # tasks CP Ratio
4
0 64 2 32.0
1 116 7 16.6
8
0 512 3 170.7
1 688 9 76.4
2 1,052 52 20.2
16
0 4,096 4 1,023.5
1 4,544 13 349.5
2 5,776 66 87.5
3 8,324 361 23.1
32
0 32,768 5 6,553.6
1 32,512 21 1,548.2
2 35,648 94 379.2
3 44,272 459 96.5
4 62,108 2,524 24.6
64
0 262,144 6 43,690.7
1 244,736 37 6,614.5
2 242,944 150 1,619.6
3 264,896 655 404.4
4 325,264 3,210 101.3
90
tially with the recursion levels. Thus the amount of parallelism is likewise reduced
for each recursion level.
6.3 Related Work
In practice the Strassen-Winograd algorithm imparts a large amount of overhead
for small matrices, thus it is customary to overcome this issue by using it in conjunc-
tion with a conventional matrix multiplication operation. The key idea is to provide
a recursion cutoff point such that once this is reached, the algorithm switches from
calling itself recursively to calling, e.g., the BLAS Level 3 matrix multiplication. The
recursion cutoff point is a tuning parameter which depends upon the machine archi-
tecture and can be either dynamic or static. In [49], one level of recursion is used
while Chou [21] provides two levels of recursion by explicitly coding the 49 matrix
multiplications which then is processed by either 7 or 49 processors. Our approach is
to provide the recursion cutoff point as a parameter which can be set by the user.
There are various methods which can be used to deal with non square matrices
and/or matrices of odd order. Methods such as static padding, dynamic padding,
and dynamic peeling all provide these mechanisms. In this chapter, we only study
matrices of tile order p = 2k, i.e., n = 2k · nb.
In [17], Boyer et al. propose schedules for both in-place and out-of-place imple-
mentations without the need for extra computations. Discovery of these algorithms
was accomplished by using an exhaustive search algorithm. Their out-of-place algo-
rithm makes use of the resultant matrix for temporary storage for the intermediate
computations. This introduces unwanted data dependencies and more data move-
ment leading to a loss of parallelism. Hence, we do not consider their out-of-place
algorithm and focus on the classical Strassen-Winograd algorithm by using temporary
storage allocations for all of the intermediate computations. As an example, given
a tiled matrix of 128 × 128 tiles of size 200 × 200, the input matrices and resultant
matrix require 1.96608 GB of storage and allowing full recursion for our algorithm
91
Algorithm 6.2: Tiled Strassen-Winograd (tiled gesw)
/* p is equal to the recursion cutoff, do the multiplication */
1 if p = r then
2 tiled gemm( p,A,B,C )
3 else
/* p is greater than the recursion cutoff, so we split the
problem in half */
4 p = p/2
5
/* Phase 1 */
6 tiled geadd( p,A21, A22, T1 )
7 tiled geadd( p, T1, A11, T2 )
8 tiled geadd( p,A11, A21, T3 )
9 tiled geadd( p,A12, T2, T4 )
10 tiled geadd( p,B12, B11, T5 )
11 tiled geadd( p,B22, T5, T6 )
12 tiled geadd( p,B22, B12, T7 )
13 tiled geadd( p, T6, B21, T8 )
14
/* Phase 2 */
15 tiled gesw( p, T2, T6, Q1 )
16 tiled gesw( p,A11, B11, Q2 )
17 tiled gesw( p,A12, B21, Q3 )
18 tiled gesw( p, T3, T7, Q4 )
19 tiled gesw( p, T1, T5, Q5 )
20 tiled gesw( p, T4, B22, Q6 )
21 tiled gesw( p,A22, T8, Q7 )
22
/* Phase 3 */
23 tiled geadd( p,Q1, Q2, U1 )
24 tiled geadd( p, U1, Q4, U1 )
25 tiled geadd( p,Q5, Q6, U1 )
26 tiled geadd( p,Q2, Q3, C11 )
27 tiled geadd( p, U1, U3, C12 )
28 tiled geadd( p, U2, Q7, C21 )
29 tiled geadd( p, U2, Q5, C22 )
92
would require an additional 3.2766 GB of temporary storage (see Figure 6.3).
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
3
3.5
4
recursion level
G
ig
ab
yt
es
 (G
B)
Memory Allocation for temporary storage
 n = 25600, p = 128
Figure 6.3: Required extra memory allocation for temporary storage for varying re-
cursion levels.
In [27] a sequential implementation of Strassen-Winograd is studied by Douglas
et al. which also provides means for a hybrid parallel implementation where the lower
level is the sequential Strassen-Winograd and the upper level algorithm is the standard
subdivided matrix multiplication. Comparisons are made to sequential implementa-
tions available in the IBM’s ESSL and Cray’s Scientific and Math library. Although
this would have been an interesting project in and of itself within a tiled framework,
the emphasis in this chapter was to provide the Strassen-Winograd at the upper level.
After this work was completed, Ballard et al. published work which places
Strassen-Winograd in a parallel distributed environment [9]. The paper is well writ-
ten and they do show improvement over the standard algorithm for matrices of order
over 94,000. Their algorithm is communication optimal and applies better to the
distributed environment than the shared memory environment seeing that we do not
have as much control over the memory distribution.
6.4 Experimental results
All experiments were performed on a 48-core machine composed of eight hexa-
core AMD Opteron 8439 SE (codename Istanbul) processors running at 2.8 GHz.
93
Each core has a theoretical peak of 11.2 Gflop/s with a peak of 537.6 Gflop/s for
the whole machine. The Istanbul micro-architecture is a NUMA architecture where
each socket has 6 MB of level-3 cache and each processor has a 512 KB level-2 cache
and a 128 KB level-1 cache. After having benchmarked the AMD ACML and Intel
MKL BLAS libraries, we selected MKL (10.2) since it appeared to be slightly faster
in our experimental context. Using MKL, for DGEMM each core has a peak of 9.7
Gflop/s with a peak of 465.6 Gflops/s for the whole machine. Linux 2.6.32 and Intel
Compilers 11.1 were also used in conjunction with PLASMA 2.3.1.
The parameter for tile size has a direct effect on the amount of data movement
and the efficiency of the kernels. Figure 6.4a presents the performance comparison
for varying tile sizes indicating nb = 200 as the most efficient. As expected, it is
also evidenced that the efficiency of the Strassen-Winograd algorithm is dependent
upon the efficiency of the tiled DGEMM. When running on 48 threads, increasing the
recursion level decreases the performance of the algorithm as depicted in Figure 6.4
(r = 0 is the tiled matrix-matrix multiplication).
200 400 600 800 1000 1200 1400 1600
0
10
20
30
40
50
60
70
80
90
tile size (nb)
G
flo
p/
s
Comparison of tile size on 12 threads, n = 6400
 
 
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
%
 o
f G
EM
M
strassen_winograd, r = 1
tiled_dgemm
(a) Tile size comparison
0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
recursion level
G
flo
p/
s
Recursion level comparison on 48 threads,
 64x64 tiles of size 200x200
0.5 1.5 2.5 3.5 4.5
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
%
 o
f G
EM
M
(b) Recursion level comparison
Figure 6.4: Comparison of tuning parameters nb and r.
Our implementation allows for the tuning of the recursion level and can range
from one recursion up to full recursion. In Figure 6.4b, matrices of 64 × 64 tiles are
94
used and the recursion level ranges from one to five. Although rmin = 2 for 64 × 64
tiles, the best performance using 48 threads is seen at r = 1. This is due to the
amount of parallelism lost by having the critical path length increase as the recursion
level increases which offsets any gains of the reduction in tasks, and computational
cost and complexity.
5 10 15 20 25 30 35 40 45
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
320
340
# threads
G
flo
p/
s
Scalability comparison
 n = 12800, p = 64
 
 
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
%
 o
f G
EM
M
 p
ea
k
strassen_winograd, r = 1
tiled_dgemm
GEMM
(a) Scalability comparison
5 10 15 20 25 30 35 40 45
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
# threads
Ef
fic
ie
nc
y
Efficiency comparison
 n = 12800, p = 64
 
 
strassen_winograd, r = 1
tiled_dgemm
GEMM
(b) Efficiency comparison
Figure 6.5: Scalability and Efficiency comparisons on 48 threads with matrices of
64× 64 tiles and nb = 200.
Figures 6.5a and 6.5b illustrate the performance and efficiency reached by the
Strassen-Winograd implementation, with r = 1, as compared to the multithreaded
MKL DGEMM and the tiled DGEMM implementation. The Strassen-Winograd im-
plementation outperforms the tiled DGEMM up to the point where we loose par-
allelism. However, both show sub-par performance when compared to the multi-
threaded MKL implementation.
If we run on 12 cores (typical current architecture for a node) then we do outper-
form tiled DGEMM (Algorithm 6.1) if a loss of parallelism is not a factor and there is
not too much data movement, i.e., keep nb small enough so that more tiles fit into the
cache but large enough to retain the efficiency of the GEMM kernel. Depicted in Fig-
ure 6.6, the performance of the Strassen-Winograd algorithm is best when r = rmin
95
1 2 3 4 5 6 7 8 9 10 11 12
0
10
20
30
40
50
60
70
80
90
100
# threads
G
flo
p/
s
Scalability comparison of recursion level on 12 threads
 n = 12800, p = 64
 
 
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
%
 o
f G
EM
M
r=1
r=2
r=3
r=4
tiled_dgemm
(a) Scalabilty comparison
1 2 3 4 5 6 7 8 9 10 11 12
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
# threads
Ef
fic
ie
nc
y
Efficiency comparison of recursion level on 12 threads
 n = 12800, p = 64
 
 
r=1
r=2
r=3
r=4
tiled_dgemm
(b) Efficiency comparison
Figure 6.6: Scalabilty and efficiency comparisons executed on 12 threads with ma-
trices of 64× 64 tiles and nb = 200.
since the number of tasks and computational complexity is minimized which reflects
the analysis of § 6.2.
6.5 Conclusion
In this chapter we have shown and analyzed an implementation of the Strassen-
Winograd algorithm in a tiled framework and provided comparisons to the multi-
threaded MKL standard library as well as a tiled matrix multiplication. The interest
in this implementation is that it can support any level of recursion and any level of
parallelism through the use of the recursion and tile size parameters.
Albeit that our implementation did not perform as well as the highly tuned
and optimized multithreaded MKL library, on 12 cores with 2 levels of recursion,
its performance was only lower by about 2%. Ultimately, it is a formidable task
to surpass the MKL implementation considering the computational complexity of a
recursive tiled algorithm.
96
7. Conclusion
In this thesis, we have studied the tiled algorithms both theoretically and exper-
imentally. Our aim was to alleviate the constraints of memory boundedness, task
granularity, and synchronicity imposed by the LAPACK library as brought to light
in the Cholesky Decomposition example in the introduction. Moreover, we have also
detailed that one may translate the LAPACK algorithms directly to tiled algorithms
while in other cases a new approach may provide better performance gains.
In the study of the Cholesky Inversion, the tiled algorithms provided a unique
insight into the interaction of the three distinct steps involved in the algorithm and
how these may be intertwined. We had observed that the choice of the variant in
the inversion of the triangular factor (Step 2) has a great impact on the performance
of the algorithm. In the scope of unlimited number of processors, this choice can
lead to an algorithm which performs the inversion of the matrix in almost the same
amount of time it takes to do the Cholesky factorization. Moreover, we note that
the combination of the variants with the shortest critical path length does not trans-
late into the best performing pipelined algorithm. Even though variant 3 of Step 2
(the triangular inversion) did not provide us with the shortest critical path length
within itself, combined with the other two steps, it does provide a Cholesky Inversion
algorithm that performed the best overall.
We have also observed that a simple translation from the LAPACK routines may
not provide a tiled algorithm with the best performance. We showed that loop-
reversals are needed to alleviate anti-dependencies which negatively affect the perfor-
mance of the tiled algorithms.
In the case where a tiled algorithm already existed, namely the QR Factorization,
we made use of a new tiled algorithm to improve upon performance. These algorithms
exhibit more parallelism than state-of-the-art implementations based on elimination
trees. Using ideas from the 1970/80’s, we have theoretically shown that the new
97
algorithm, GrASAP, is optimal in the scope of unbounded number of processors.
We have provided accurate estimations for the length of the critical paths for all
of these new tiled algorithms and have provided explicit formulas for some of the
algorithms.
In the framework of bounded number of processors, our theoretical work has af-
forded a new bound which more accurately reflects the performance expectations of
the tiled algorithms. In the case of the Cholesky Factorization, the schedule produced
using the Critical Path Method proved to be within seven percent of the ALAP De-
rived bound indicating that this scheduling strategy is well suited for this application.
The ALAP Derived bound has also been used as a tool to show that the optimality
in the scope unbounded number of processors does not translate to optimality in the
scheduling on a bounded number of processors.
Overall, the theoretical and experimental portions of this thesis give credence to
the impact of tiled algorithms for multi/many-core architectures.
98
APPENDIX A. Integer Programming Formulation of Tiled QR
A.1 IP Formulation
We formulate the problem in question using integer programming. An equivalent
binary programming formulation was also constructed (with time as an additional
index), but we found the integer formulations were more quickly solved.
Let i, j, and h denote rows (ranging 1, . . . , p); k, l, l1 denote columns (ranging
1, . . . , q); r, s, and t denote time steps (ranging 1, . . . , T ). The upper bound on the
number of time steps, (T ), may come from any existing algorithm (greedy, ASAP,
GRASP, etc.). Let the decision variables be defined as follows:
A.1.1 Variables
Let all t be an integer ranging from zero to T denoting the time we complete the
following tasks (and t = 0 means the task is never performed).
• UNMQR: Complete applying the reflectors from GEQRT across the row. (Re-
quires 6 units of time.)
wikl = t ∈ [0, T ] : if we finish the update of tile (i, k) at time t. (A.1)
(This update was necessitated byxil = s : l < k, s < t).
• GEQRT: Factor a square tile into a triangle. (Requires 4 units of time.)
xik = t ∈ [0, T ] : if we complete triangularization of tile (i, k) at
the end of time unit t.
(A.2)
• TTMQR: Update the entries in two rows after TTQRT. (Requires 6 units of
time.)
yijkl = t ∈ [0, T ] : if we finish the update of tile (i, k) and (j, k) at
the end of time unit t.
(A.3)
(This update was necessitated by zijl = s : l < k, s < t)
99
• TTQRT: Cancel one triangular tile using another triangle tile. (Requires 2
units of time.)
zijk = t ∈ [0, T ] : if we complete zeroing tile (i, k) using tile (j, k)
at the end of time unit t.
(A.4)
For the y and z actions, it is useful to have a binary variable which is 1 if the action
occurs. Explicitly,
yˆijkl =
1 : if yijkl > 00 : otherwise zˆijk =
1 : if zijk > 00 : otherwise (A.5)
A.1.2 Constraints
1. Time constraints for each of the four actions:
(a) Time for wikl
i. wikl must occur at least 3 time steps after earlier w updates.
wikl ≥ wikl1 + 3 ∀ k ≥ 2, i ≥ l, l1 < l < k (A.6)
ii. wikl must occur at least 3 time steps after earlier y updates.
wikl ≥ yijkl1 + yjikl1 + 3 ∀ k ≥ 2, i ≥ l, j, l1 < l < k (A.7)
iii. wikl must occur at least 3 time steps before y updates in the current
column.
wikl + 3 ≤ yijkl + yjikl + (1− yˆijkl − yˆjikl)T ∀ i, j, l < k, k ≥ 2 (A.8)
iv. wikl must occur at least 2 time steps before x.
wikl + 2 ≤ xik ∀ i ≥ k ≥ 2, l < k (A.9)
v. wikl must occur at least 1 time step before z actions.
wikl + 1 ≤ zijk + zjik + (1− zˆijk − zˆjik)t ∀ i, j, k ≥ 2, l < k (A.10)
100
(b) Time for xik
i. xik must occur 2 time steps after any w updates. (Identical to equation
(A.9).)
ii. xik must occur 2 time steps after any y updates.
xik ≥ yijkl + yjikl + 2 ∀j, l, i ≥ k, l < k (A.11)
iii. xik must occur 1 time steps before any z action.
xik + 1 ≤ zijk + zjik + (1− zˆijk − zˆjik)T ∀ i ≥ k, j (A.12)
(c) Time for yijkl
i. yijkl must occur 3 time steps after w updates originating from the same
column. (Identical to equation A.8.)
ii. yijkl must occur 2 time steps before any xik or xjk. (Identical to
equation A.11.)
iii. yijkl must be 3 time steps before or after any y action involving rows
i or j.
We must enforce
yijkl + yjikl + 3 ≤ yhikl + yihkl + (1− yˆhikl − yˆihkl)T
or
yhikl + yihkl + 3 ≤ yijkl + yjikl + (1− yˆijkl − yˆjikl)T
And
yijkl + yjikl + 3 ≤ yhjkl + yjhkl + (1− yˆhjkl − yˆjhkl)T
or
yhjkl + yjhkl + 3 ≤ yijkl + yjikl + (1− yˆijkl − yˆjikl)T
101
We define the binary variables δ1hijkl, δ
2
hijkl, δ
3
hijkl, and δ
4
hijkl, and include
the disjunctive constraints
yijkl + yjikl + 3 ≤ yhikl + yihkl + (1− yˆhikl − yˆihkl)T + δ1hijklT (A.13)
∀ h, i, j, l < k ≥ 2
yhikl + yihkl + 3 ≤ yijkl + yjikl + (1− yˆijkl − yˆjikl)T + δ2hijklT (A.14)
∀ h, i, j, l < k ≥ 2
δ1hijkl + δ
2
hijkl ≥ 1 ∀ h, i, j, l < k ≥ 2 (A.15)
and
yijkl + yjikl + 3 ≤ yhjkl + yjhkl + (1− yˆhjkl − yˆjhkl)T + δ3hijklT (A.16)
∀ h, i, j, l < k ≥ 2
yhjkl + yjhkl + 3 ≤ yijkl + yjikl + (1− yˆijkl − yˆjikl)T + δ4hijklT (A.17)
∀ h, i, j, l < k ≥ 2
δ3hijkl + δ
4
hijkl ≥ 1 ∀ h, i, j, l < k ≥ 2 (A.18)
iv. yijkl must occur 3 time steps before any z action involving rows i or j.
yijkl + 3 ≤ zhik + zihk + (1− zˆhik − zˆihk)T ∀ h, i, j, k > l, k ≥ 2
yijkl + 3 ≤ zhjk + zjhk + (1− zˆhjk − zˆjhk)T ∀ h, i, j, k > l, k ≥ 2
(A.19)
(d) Time for zijk
i. zijk must occur 1 time step after any w action. (Identical to equation
A.10.)
ii. zijk must occur 1 time step after any y action. (Identical to equation
A.19.)
iii. zijk must occur 1 time step after any x action. (Identical to equation
A.12.)
102
iv. zijk must occur 1 time step before or after any other z action.
A. Case 1. We use (i, k) to zero (j, k) and (i, k) to zero (h, k)
We need to enforce
zjik + 1 ≤ zhik + (1− zˆhik)T
or
zhik + 1 ≤ zjik + (1− zˆjik)T
To do this, we define binary variables δ5hijk and δ
6
hijk and include
the disjunctive constraints as follows.
zjik + 1≤ zhik + (1− zˆhik)T + δ5hijkT ∀ h, i, j, k (A.20)
zhik + 1≤ zjik + (1− zˆjik)T + δ6hijkT ∀ h, i, j, k (A.21)
δ5hijk + δ
6
hijk ≥ 1 ∀ h, i, j, k (A.22)
B. Case 2. We use (i, k) to zero (j, k) and (h, k) to zero (i, k) We
need to enforce
zjik + 1 ≤ zihk + (1− zˆihk)T ∀ h, i, j, k(i > j??) (A.23)
2. A tile can’t zero itself, (and hence we can’t update just a single row afterwards).
ziik = 0 ∀ i, k yiikl = 0 ∀ i, k, l (A.24)
3. Both tiles involved in TTQRT must be triangles before one can zero another
xik ≤ (1− zˆijk)T + zijk ∀ i, j, k, xjk ≤ (1− zˆijk)T + zijk ∀ i, j, k (A.25)
4. Force updates after triangle and zeroing actions.
(a) After a tile is triangularized, updates must occur in the next column.
xik ≤ wilk − 3 ∀ i, k < q, i ≥ k, l > k (A.26)
103
(b) After a tile is zeroed, updates must occur in the next column.
zijk ≤ (yijlk + yjilk) ∀ i, j, k < l (A.27)
5. The updates of (i, k) (arising from pivot l) from triangularizing must occur
before updates from zeroing involving (i, k) (also arising from pivot l).
wikl ≤ (1− yˆijkl − yˆjikl)T + yijkl + yjikl ∀ i, j, k > l (A.28)
6. No updates to a tile can occur after triangularization.
xik ≥ wikl ∀ i ≥ k, k > l xik ≥ yijkl + yjikl ∀ i ≥ k, j, k > l (A.29)
7. After a tile (i, k) is zeroed, we can’t use it to zero.
zijk ≥ zhik ∀ h, i, j, k (A.30)
8. Tiles on or below the diagonal must be triangularized at some point, (and we
can’t finish a triangularization until time step 2.)
xik ≥ 2 ∀ i ≥ k (A.31)
9. Tiles strictly below the diagonal must be zeroed at some point.
∑
j
zˆijk = 1 ∀ i > k (A.32)
10. Can’t triangularize above the diagonal.
xik = 0 ∀ i < k (A.33)
11. Force binary variables
yˆijkl ≤ yijkl yˆijkl ∗ T ≥ yijkl ∀i, j, k, l (A.34)
zˆijk ≤ zijk zˆijk ∗ T ≥ zijk ∀i, j, k (A.35)
104
A.1.2.1 Precedence constraints
The most cumbersome constraints to formulate are those forcing the correspond-
ing TTQRT and TTMQR operations to be executed in the same order.
For each tile (i, k) involved in a zeroing process with (j, k) and (h, k), the order
of the updates must follow the order of the zeroing processes. We want zjik < zihk
(for some h) or zjik < zhik (for some h), to force yij(k+1)k < yih(k+1)k.
1. a1
a1hijk =

1 : if we [use (i, k) to zero (h, k)] and also [use (i, k)
to zero (j, k) ]
0 : otherwise
(A.36)
a1hijk ≤ zˆhik
a1hijk ≤ zˆjik
a1hijk + 1≥ zˆhik + zˆjik
(A.37)
2. a2
a2hijk =

1 : if we [use (h, k) to zero (i, k)] after [using (i, k) to
zero (j, k)]
0 : otherwise
(A.38)
a2hijk ≤ zˆihk
a2hijk ≤ zˆjik
a2hijk + 1≥ zˆihk + zˆjik
(A.39)
3. b
bhijk =
1 : if [zhik > zjik] regardless of whether [zhik = 0]0 : otherwise (A.40)
105
Tbhijk ≥ zhik − zjik
T (bhijk − 1)≤ zhik − zjik
(A.41)
4. c1
c1hijk =
1 : if [zhik > zjik]0 : otherwise (A.42)
c1hijk ≤ a1hijk
c1hijk ≤ bhijk
c1hijk + 1≥ a1hijk + bhijk
(A.43)
5. c
chijk =

1 : if zeroing actions in column k of rows h and i to
occur after the zeroing actions of rows i and j
0 : otherwise
(A.44)
chijk ≥ c1hijk
chijk ≥ a2hijk
chijk ≤ c1hijk + a2hijk
(A.45)
So we now have a variable chijk that is 1 when updates of row h and i must come
before the updates of i and j. We can define similar variables for the updates
rather than the zeros.
6. d
dhijk =

1 : if we [update (i, k) and (h, k) together] and also
[update (i, k) and (j, k) together]. (All updates
occur because of zeroing in column k − 1)
0 : otherwise
(A.46)
dhijk ≤ yˆhik(k−1) + yˆihk(k−1)
dhijk ≤ yˆjik(k−1) + yˆijk(k−1)
dhijk + 1≥ yˆhik(k−1) + yˆihk(k−1) + yˆjik(k−1) + yˆijk(k−1)
(A.47)
106
7. e
ehijk =

1 : if updates in column k of rows h and i to occur
after updates of i and j or the updates in i and j
never happen
0 : otherwise
(A.48)
Tehijk ≥
(
yhik(k−1) + yihk(k−1)
)− (yjik(k−1) + yijk(k−1))
T (ehijk − 1)≤
(
yhik(k−1) + yihk(k−1)
)− (yjik(k−1) + yijk(k−1)) (A.49)
8. f
fhijk =

1 : if the updates in column k of rows h and i to occur
after updates of i and j
0 : otherwise
(A.50)
fhijk ≥ dhijk
fhijk ≥ ehijk
fhijk ≤ dhijk + ehijk
(A.51)
Lastly, to force the updates in order:
chijk ≤ fhijl ∀h, i, j, k < l (A.52)
A.1.3 Objective function
Let total time be a variable such that
total time≥wikl ∀i, k, l
total time≥ xik ∀i, k
total time≥ yijkl ∀i, j, k, l
total time≥ zijk ∀i, j, k
Then our objective function is:
min total time (A.53)
107
REFERENCES
[1] E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, and
S. Tomov. QR Factorization on a Multicore Node Enhanced with Multiple
GPU Accelerators. In Parallel Distributed Processing Symposium (IPDPS), 2011
IEEE International, pages 932 –943, may 2011.
[2] E. Agullo, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, J. Langou, and H. Ltaief.
PLASMA Users’ Guide. Technical report, ICL, UTK, 2009.
[3] E. Agullo, B. Hadri, H. Ltaief, and J. Dongarrra. Comparative study of one-
sided factorizations with multiple software packages on multi-core hardware. In
SC’09: Proceedings of the Conference on High Performance Computing Network-
ing, Storage and Analysis, pages 1–12, New York, NY, USA, 2009. ACM.
[4] Emmanuel Agullo, Henricus Bouwmeester, Jack Dongarra, Jakub Kurzak, Julien
Langou, and Lee Rosenberg. Towards an efficient tile matrix inversion of sym-
metric positive definite matrices on multicore architectures. In Proceedings of the
9th international conference on High performance computing for computational
science, VECPAR’10, pages 129–138, Berlin, Heidelberg, 2011. Springer-Verlag.
[5] Emmanuel Agullo, Bilel Hadri, Hatem Ltaief, and Jack Dongarra. Comparative
study of one-sided factorizations with multiple software packages on multi-core
hardware. In Proceedings of the Conference on High Performance Computing
Networking, Storage and Analysis (SC ’09), pages 1–12. IEEE Computer Society
Press, 2009.
[6] R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A
Dependence-based Approach. Morgan Kaufmann, 2001.
[7] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. W. Demmel, J. Dongarra,
J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen.
LAPACK Users’ Guide. SIAM, 1992.
[8] Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer,
John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John
Wawrzynek, David Wessel, and Katherine Yelick. A view of the parallel com-
puting landscape. Commun. ACM, 52(10):56–67, October 2009.
[9] Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded
Schwartz. Communication-optimal parallel algorithm for Strassen’s matrix mul-
tiplication. In Proceedinbgs of the 24th ACM symposium on Parallelism in algo-
rithms and architectures, SPAA ’12, pages 193–204, New York, NY, USA, 2012.
ACM.
108
[10] P. Bientinesi, B. Gunter, and R. van de Geijn. Families of algorithms related to
the inversion of a symmetric positive definite matrix. ACM Trans. Math. Softw.,
35(1):1–22, 2008.
[11] Bilel Hadri and Hatem Ltaief and Emmanuel Agullo and Jack Dongarra. Enhanc-
ing Parallelism of Tile QR Factorization for Multicore Architectures. Technical
Report 222, LAPACK Working Note, 2009.
[12] D. Bini, M. Capovani, F. Romani, and G. Lotti. O(n2.7799) complexity for n× n
approximate matrix multiplication. Information Processing Letters, 8(5):234–
235, 1979.
[13] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Don-
garra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C.
Whaley. ScaLAPACK Users’ Guide. SIAM, 1997.
[14] Susan Blackford and Jack J. Dongarra. Installation Guide for LAPACK. Tech-
nical Report 41, LAPACK Working Note, jun 1999. originally released March
1992.
[15] BLAS. Basic Linear Algebra Subprograms. http://www.netlib.org/blas/.
[16] Henricus Bouwmeester and Julien Langou. A Critical Path Approach to Ana-
lyzing Parallelism of Algorithmic Variants. Application to Cholesky Inversion.
CoRR, abs/1010.2000, 2010.
[17] Brice Boyer, Jean-Guillaume Dumas, Cle´ment Pernet, and Wei Zhou. Mem-
ory efficient scheduling of Strassen-Winograd’s matrix multiplication algorithm.
In Proceedings of the 2009 international symposium on Symbolic and algebraic
computation, ISSAC ’09, pages 55–62, New York, NY, USA, 2009. ACM.
[18] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. Parallel tiled QR fac-
torization for multicore architectures. Concurrency Computat.: Pract. Exper.,
20(13):1573–1590, 2008.
[19] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear
algebra algorithms for multicore architectures. Parallel Computing, 35(1):38–53,
2009.
[20] H. Casanova, A. Legrand, and Y. Robert. Parallel algorithms. Chapman &
Hall/CRC numerical analysis and scientific computing. CRC Press, 2009.
[21] Chung chiang Chou, Yuefan Deng, Gang Li, and Yuan Wang. Parallelizing
Strassen’s Method for Matrix Multiplication on Distributed-Memory MIMD Ar-
chitectures. In Computers for Mathematics with Applications, 1994.
[22] N. Christofides. Graph Theory: An algorithmic Approach. Academic Press,
1975.
109
[23] D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progres-
sions. In Proceedings of the nineteenth annual ACM symposium on Theory of
computing, STOC ’87, pages 1–6, New York, NY, USA, 1987. ACM.
[24] Michel Cosnard, Jean-Michel Muller, and Yves Robert. Parallel QR decomposi-
tion of a rectangular matrix. Numerische Mathematik, 48:239–249, 1986.
[25] Michel Cosnard and Yves Robert. Complexity of parallel QR factorization. Jour-
nal of the A.C.M., 33(4):712–723, 1986.
[26] J. W. Demmel, L. Grigori, M. Hoemmen, and J. Langou. Communication-
avoiding parallel and sequential QR and LU factorizations: theory and practice.
Technical Report 204, LAPACK Working Note, 2008.
[27] Craig C. Douglas, Michael Heroux, Gordon Slishman, Roger M. Smith, and
Roger M. Gemmw: A Portable Level 3 Blas Winograd Variant Of Strassen’s
Matrix-Matrix Multiply Algorithm, 1994.
[28] R. Eigenmann, J. Hoeflinger, and D. Padua. On the automatic parallelization of
the perfect benchmarks R©. IEEE Trans. Parallel Distrib. Syst., 9(1):5–23, 1998.
[29] Emmanuel Agullo and Jack Dongarra and Rajib Nath and Stanimire Tomov. A
Fully Empirical Autotuned Dense QR Factorization For Multicore Architectures.
Technical Report 242, LAPACK Working Note, 2011.
[30] Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, and Jack Dongarra. Tile QR
Factorization with Parallel Panel Processing for Multicore Architectures. In
24th IEEE Int. Parallel Distributed Processing Symposium IPDPS’10, 2010.
[31] Henricus Bouwmeester and Mathias Jacquelin and Julien Langou and Yves
Robert. Tiled QR factorization algorithms. Technical Report 7601, INRIA,
France, apr 2011. Available at http://hal.inria.fr/docs/00/58/62/39/PDF/
RR-7601.pdf.
[32] N. J. Higham. Accuracy and Stability of Numerical Algorithms. Society for
Industrial and Applied Mathematics, Philadelphia, PA, USA, second edition,
2002.
[33] J. Kurzak and J. Dongarra. Fully dynamic scheduler for numerical computing on
multicore processors. University of Tennessee CS Tech. Report, UT-CS-09-643,
2009.
[34] J. Kurzak and J. Dongarra. QR factorization for the Cell Broadband Engine.
Sci. Program., 17(1-2):31–42, 2009.
[35] R. E. Lord, J. S. Kowalik, and S. P. Kumar. Solving Linear Algebraic Equations
on an MIMD Computer. J. ACM, 30(1):103–117, January 1983.
110
[36] J.J. Modi and M.R.B. Clarke. An alternative Givens ordering. Numerische
Mathematik, 43:83–90, 1984.
[37] Mounir Marrakchi and Yves Robert. Optimal algorithms for Gaussian elimina-
tion on an MIMD computer. Parallel Computing, 12(2):183 – 194, 1989.
[38] Yuji Nakatsukasa and Nicholas J. Higham. Stable and Efficient Spectral Divide
and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the
SVD. http://eprints.ma.man.ac.uk/1824/, preprint.
[39] V. Ya. Pan. Strassen’s algorithm is not optimal trilinear technique of aggregat-
ing, uniting and canceling for constructing fast algorithms for matrix operations.
In Foundations of Computer Science, 1978., 19th Annual Symposium on, pages
166 –176, oct. 1978.
[40] J. M. Perez, P. Bellens, R. M. Badia, and J. Labarta. CellSs: Making it easier to
program the Cell Broadband Engine processor. IBM J. Res. & Dev., 51(5):593–
604, 2007.
[41] G. Quintana-Ort´ı, E. S. Quintana-Ort´ı, R. A. van de Geijn, F. G. Van Zee, and
Ernie Chan. Programming matrix algorithms-by-blocks for thread-level paral-
lelism. ACM Transactions on Mathematical Software, 36(3), 2009.
[42] M. C. Rinard, D. J. Scales, and M. S. Lam. Jade: A high-level, machine-
independent language for parallel programming. Computer, 6:28–38, 1993.
[43] Yves Robert. The Impact of Vector and Parallel Architectures on the Gaussian
Elimination Algorithm. Manchester University Press, 1991.
[44] F. Romani. Some Properties of Disjoint Sums of Tensors Related to Matrix
Multiplication. SIAM Journal on Computing, 11(2):263–267, 1982.
[45] A.H. Sameh and D.J. Kuck. On stable parallel linear systems solvers. J. ACM,
25:81–91, 1978.
[46] A. Scho¨nhage. Partial and Total Matrix Multiplication. SIAM Journal on Com-
puting, 10(3):434–455, 1981.
[47] SimGrid. URL: http://simgrid.gforge.inria.fr.
[48] Volker Strassen. Gaussian elimination is not optimal. Numerische Mathematik,
13:354–356, 1969. 10.1007/BF02165411.
[49] Fre´de´ric Suter. Mixed Parallel Implementations of the Top Level Step of Strassen
and Winograd Matrix Multiplication Algorithms. In Proceedings of the 15th In-
ternational Parallel and Distributed Processing Symposium (IPDPS’01) - Volume
1, IPDPS ’01, pages 10008.2–, Washington, DC, USA, 2001. IEEE Computer So-
ciety.
111
[50] H. Sutter. A fundamental turn toward concurrency in software. Dr. Dobb’s
Journal, 30(3), 2005.
[51] R. Clint Whaley and Anthony M. Castaldo. Achieving accurate and context-
sensitive timing for code optimization. Softw. Pract. Exper., 38:1621–1642, De-
cember 2008.
[52] Christopher G. Willard and Addison Snell Laura Segervall. HPC User Site
Census: Systems. http://www.intersect360.com/industry/reports.php?
id=67, 2012.
[53] Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an in-
sightful visual performance model for multicore architectures. Commun. ACM,
52:65–76, April 2009.
[54] Shmuel Winograd. On Multiplication of 2× 2 Matrices. Linear Algebra and Its
Applications, 4:381–388, 1971.
112
