suCAQR: A Simplified Communication-Avoiding QR Factorization Solver Using the TBLAS Framework by Zheng, Weijian et al.
suCAQR: A Simplified Communication-Avoiding
QR Factorization Solver Using the TBLAS
Framework
Weijian Zheng and Fengguang Song
Department of Computer Science
Indiana University Purdue University Indianapolis
Email: {wz26, fgsong}@iupui.edu
Lan Lin
Department of Computer Science
Ball State University
Email: llin4@bsu.edu
Zizhong Chen
Department of Computer Science
University of California at Riverside
Email: chen@cs.ucr.edu
Abstract—The scope of this paper is to design and implement
a scalable QR factorization solver that can deliver the fastest
performance for tall and skinny matrices and square matrices on
modern supercomputers. The new solver, named scalable univer-
sal communication-avoiding QR factorization (suCAQR), introduces
a simplified and tuning-less way to realize the communication-
avoiding QR factorization algorithm to support matrices of any
shapes. The software design includes a mixed usage of physical
and logical data layouts, a simplified method of dynamic-root
binary-tree reduction, a dynamic dataflow implementation, and
an analytical model to determine an optimal number of fac-
torization domains. Compared with the existing communication
avoiding QR factorization implementations, suCAQR has the
benefits of being simpler, more general, and more efficient. By
balancing the degree of parallelism and the proportion of faster
computational kernels, it is able to achieve scalable performance
on clusters of multicore nodes. The software essentially combines
the strengths of both synchronization-reducing approach and
communication-avoiding approach to achieve high performance.
Based on the experimental results using 1,024 CPU cores, suCAQR
is faster than DPLASMA by up to 30%, and faster than
ScaLAPACK by up to 30 times.
Index Terms—high performance computing; computational
science application; performance modeling and optimization;
dataflow runtime system.
I. INTRODUCTION
QR factorization is a fundamental computational kernel for
many important scientific, engineering, and big data analytics
applications. It has been applied to solving linear systems,
least-squares problems, linear regression problems, and the
production function modeling, as well as assessing the con-
ditioning of these problems [1]–[3]. The QR factorization of
a matrix A of dimension m × n (m ≥ n) takes the form of
A = QR, where Q is an m × m orthogonal matrix, and R
(=QTA) is an upper triangular matrix with zeros below its
diagonal. The design and implementation of more scalable
QR factorizations will accelerate a wide range of domain
applications.
The most widely used parallel algorithm to solve QR
factorizations is the block QR factorization algorithm used
by the de facto standard ScaLAPACK library [4], [5]. As
displayed in Figure 1, matrix A is divided into a thin panel
(i.e., A11A21 ) of dimension M × NB, a block of rows A12, and
R11
 V21
R12
A22
A11
A21
A12
A22
M
NB
Fig. 1. The classic block QR factorization algorithm used by ScaLAPACK.
a trailing submatrix A22. The block algorithm first applies
level 1 PBLAS subroutines to the panel (A11A21 ), next it forms
the triangular factor from the panel, finally it uses level 3
PBLAS to factor A12 and update A22. However, since the
panel is computed one column after another—resulting in a
large communication overhead and surface-to-volume ratio —
the block algorithm does not scale well for tall and skinny
matrices (i.e., matrices with much more rows than columns).
To reduce the large communication overhead with tall and
skinny matrices, Demmel et al. then designed an algorithm
called Communication-Avoiding QR factorization (CAQR) [6]–
[8]. As explained in Figure 2, instead of computing a sequence
of column-by-column operations, CAQR can perform a set of
level 3 BLAS operations in the panel. Then it merges the
output of the level 3 BLAS operations to get the final factor
R. Not only does the algorithm convert level 1 BLAS to
level 3 BLAS, but also it significantly reduces the number of
communication messages. The original work of Demmel et al.
[6] offered an estimated performance speedup of CAQR. Later,
Song et al. [9] developed the first parallel implementation of
CAQR, which is referred to as distributed tiled CAQR [9].
While the distributed tiled CAQR demonstrates good strong
scalability and weak scalability on different HPC systems, it
only attains good performance for tall and skinny matrices.
The issue is that its performance starts to decrease when the
input matrix goes from skinny to square. The distributed tiled
CAQR algorithm uses a static block distribution method to
map blocks of rows to different processes. Although the block
distribution method can minimize the inter-node communica-
___________________________________________________________________
This is the author's manuscript of the article published in final edited form as:
Zheng, W., Song, F., Lin, L., & Chen, Z. (2016, December). suCAQR: A Simplified Communication-Avoiding QR Factorization Solver Using 
the TBLAS Framework. In Parallel and Distributed Systems (ICPADS), 2016 IEEE 22nd International Conference on (pp. 1092-1099). IEEE.  
http://dx.doi.org/10.1109/ICPADS.2016.0144
tion, it makes certain processes turn into idle processes on non-
skinny matrices (see more details in Section II). Another work
of Hierarchical QR [10] investigates the effect of using a wide
variety of reduction trees to improve the CAQR performance.
However, determining the optimal tree and the optimal number
of CAQR domains is dependent on the actual input and the
number of CPUs, which are not known until executing the
program. In this paper, we target designing a new solver that
has a simpler design and implementation, does not require a lot
of parameter tuning, and scales well on matrices of any shapes,
ranging from extremely tall and skinny to square matrices.
To achieve the goal, we design a QR factorization solver
called “Scalable Universal Communication-Avoiding QR Fac-
torization” (suCAQR). The suCAQR solver consists of two
types of computations: 1) process-local computation, and 2)
global binary-tree reduction among all processes. It also mixes
two types of data layout: 1) physical block data layout, and
2) logical block-cyclic data layout. The suCAQR employs a
logic block cyclic data distribution method to achieve load
balancing while applying a tiled QR factorization method to
the physical data storage directly to avoid communication. It
uses a dynamic-root binary-tree to merge results from different
factorization domains (domain is a group of computational
tasks that is independent of other groups). Also, suCAQR can
maintain a good tradeoff between the degree of parallelism and
the number of faster kernels for matrices of different shapes by
changing the number of factorization domains. An analytical
model is developed to determine the optimal number of
factorization domains without any tuning work. Furthermore,
suCAQR uses a dynamic dataflow-driven TBLAS runtime [11]
to enable efficient synchronization-reducing executions on a
specific communication-avoiding algorithm.
We measure and evaluate the weak scalability for three
different libraries: ScaLAPACK, DPLASMA [10], [12], and
suCAQR. We also test different matrix shapes ranging from
extremely tall and skinny to square using 1,024 cores. Based
on the experimental results, suCAQR can outperform ScaLA-
PACK by 30 times and DPLASMA by 30%. Based on the
runtime efficiency analysis, we also find that suCAQR’s total
wall clock execution time is almost equal to its computation
time (i.e., the CPU time taken by the computational kernels).
This implies that there is no CPU idle time and the communi-
cation cost is totally hidden by computations (further discussed
in Section VII).
To the best of our knowledge, this work makes the following
A0 
A1 
A2 
A3 
Q00 R00 
Q10 R10 
Q20 R20 
Q30 R30 
Q01 R01 
Q11 R11 
Q02 R02 
Fig. 2. Communication-Avoiding QR (CAQR) performs level 3 BLAS on the
panel (i.e., A0, A1, A2, A3) followed by a parallel reduction.
contributions: 1) we design and implement a scalable suCAQR
solver that works efficiently for both tall and skinny and square
matrices; 2) it demonstrates that we can simplify the software
implementation by using a unique reduction tree meanwhile
we can provide scalable performance; 3) we introduce an
analytical model to accurately predict the optimal number of
factorization domains by avoiding any tuning work, which
is required by all existing CAQR implementations; and 4)
detailed algorithm analysis and performance analysis reveal
that the task-based dynamic scheduling approach is efficient
in avoiding CPU idle cycles and hiding expensive communi-
cations, and should be adopted into more parallel software.
The rest of the paper is organized as follows. Next section
introduces the performance issues in the existing distributed
tiled CAQR solver. Section III illustrates the algorithm used
by the suCAQR solver. Section IV analyzes suCAQR with
respect to both computation and communication costs. Section
V presents an analytical model to predict the optimal number
of domains. Section VI describes suCAQR’s dynamic dataflow
execution within TBLAS and Section VII shows the experi-
mental results. The final two sections present the related work
and summarize this work.
II. BACKGROUND
This section briefly introduces the previous work of dis-
tributed tiled CAQR [9] and its existing performance bottle-
neck.
Given a matrix A with mb × nb blocks (mb >> nb),
the distributed tiled CAQR divides the matrix into a number
of groups of rows. Each group is called a domain. For
instance, mb rows will be cut into D (horizontal) domains.
Assuming P MPI processes are created, each process will be
responsible for a number of DP consecutive domains. Each
domain corresponds to a local QR factorization and requires
no communication. The results from all the domains must be
merged (or reduced) to a single result. To implement the global
parallel reduction, two level of trees are deployed: domains-
level reduction tree and inter-process reduction tree.
However, the distributed tiled CAQR algorithm cannot scale
well if the matrix is not tall and skinny. Given a matrix with
mb rows of tiles and P processes, each process only computes
its own mbP rows. As the algorithm visits and computes the
matrix from top left to bottom right, more processes become
idle, which results in load imbalance and poor performance. If
the number of columns nb is greater than mbP , dnb/(mbP )e− 1
processes will become idle at the end of the computation. For
instance, if mb = nb, P − 1 processes will become idle at
the last iteration of the algorithm. In this paper, we design a
more general QR factorization solver that can achieve high
scalability for all matrix shapes.
III. THE ALGORITHM
suCAQR essentially consists of three components: 1) the tile
data storage, 2) the logical/physical data layouts, and 3) the
suCAQR computation. After a matrix is distributed to different
processes, the data can be viewed from two perspectives: a
logical data layout and a physical data layout. The rest of the
section describes the three components.
A. Tile Data Storage
suCAQR uses a tile data storage. We divide a matrix into
b × b square blocks (also known as tiles). Each tile is stored
in a contiguous memory block. If the tile size is tuned
appropriately, we can fit multiple tiles in caches and maximize
the cache hit rate. For instance, a 12 × 12 matrix can be
represented as follows: a1,1 a1,2 a1,3a2,1 a2,2 a2,3
a3,1 a3,2 a3,3
 , where
ai,j represents a tile of size 4 × 4. In the algorithm, every
computational kernel takes as input several individual tiles
(i.e., the basic unit) and computes new output.
Given an input matrix with mb×nb tiles, suCAQR allocates
a 2-D mb × nb array of pointers, each of which points to a
memory block that stores a distinct tile.
B. Usage of a Mixed Logical/Physical Layout
We use a block-cyclic distribution method to distribute mb
rows of tiles to P processes. The method first divides all the
tile rows into groups of size B, then distributes the groups to
different processes. With the block-cyclic distribution method,
the i-th tile row is stored in process iB mod P . After the
distribution, each process has mbP rows of tiles.
As shown in Figure 3. (a), twelve rows of a matrix are
distributed to four processes (P0, P1, P2, and P3) in a block-
cyclic way, where the group size B = 1. We call this data
layout a logical (or “virtual”) data layout. However, physically,
each process stores three virtually interleaved rows together
in its local memory. Figure 3. (b) shows the corresponding
physical data layout where each process owns three rows.
Note that suCAQR uses both physical layout and logical
layout. While most parallel solvers only need to use the
physical data layout, suCAQR must follow the logical cyclic
P0
P0
P0
P1
P0
P1
P1
P1
P2
P3
P3
P3
P2
P3
P2
P3
P3
(a) Logical data layout (b) Physical data layout
P0
P0
P1
P1
P2
P2
P2
Fig. 3. After distributing a matrix with 12 × 4 tiles to four processes in
a block cyclic way, we can view the same matrix from two perspectives: a
logical data layout and a physical data layout. Both the logical data layout
and the physical data layout are utilized by the suCAQR algorithm.
pattern (e.g., P0–P3, P0–P3, . . . ) such that the trailing sub-
matrices consistently involve all the processes to achieve load
balancing. It might appear that only one layout is needed
here, but two types of computations working on two different
layouts require that we deal with both of them.
As detailed in next subsection, suCAQR applies local CAQR
factorizations to the physical data layout and applies inter-
process binary tree reduction to the logical data layout at each
iteration. To keep track of both physical data layout and logical
data layout, we use the subroutine of physical 2 logical (as
shown in Algorithm 1) to translate a row index from a physical
data layout to a logical data layout (similar to the virtual
memory idea used in operating systems).
Algorithm 1 Physical to Logical Index Translation
int physical 2 logical(i, B, P, pid)
Input: physical row number i, group size B, P processes.
Output: logical row number.
cycle size ← B×P; /*#rows per cycle*/
/*number of logical rows before the i-th physical row*/
begin row ← bi/Bc×cycle size;
return (begin row+pid×B+i%B);
C. suCAQR Computation
The parallel implementation of suCAQR is shown in Algo-
rithm 2. Given a matrix with mb×nb tiles, the algorithm goes
through nb iterations. At each iteration, there are two stages
of computations: 1) every process performs a local CAQR
factorization independently, followed by 2) a parallel binary-
tree reduction to merge the partial results from stage 1 to get
the final result.
Stage 1: In the first stage (lines 3–9), every process decides
the beginning row that has not computed the final result in
the physical data layout. As the iteration number increases,
more and more rows in the top portion of each process
get the final result. Next, each process computes a local
CAQR factorization on its local matrix, which spans from
the beginning row to the last local row. The subroutines to
determine the beginning row and perform local CAQR are
briefly introduced as follows, respectively.
• get first row position: In Algorithm 3, given a column
number k, it first checks which process the tile [k, k]
belongs to (i.e., the root process root_pid). Depending
on the relative position of the current process to the
root process (e.g., above, same, below), the first row
that has not computed result may lie in one of the three
different locations (corresponding to the three conditional
statements in Algorithm 3), respectively.
• local caqr: Algorithm 4 calls six basic computa-
tional kernels, which are dgeqrt, dormqr, dtsqrt,
dtsssmqr, dttqrt, and dttssmqr. To simplify our
illustration, we will use the notations of QR1, UP1 (stands
for update), QR2, UP2, Merge, and MergeUpdate
to represent the six kernels correspondingly. The local
CAQR factorization will be applied to the submatrix
whose local rows are between phys_1st_row and the
Algorithm 2 Parallel suCAQR Algorithm
1: suCAQR(A, mb, nb, P, D)
2: for each tile column k ← 0 to nb-1 do
3: . stage 1: local CAQR factorization in each process
4: for each process pid ← 0 to P-1 do
5: phys 1st row ← get first row position(k, B, P, pid);
6: if (phys 1st row < bmb/P c) then
7: local caqr(A,phys 1st row, k, B, mb,nb,P,pid,D);
8: end if
9: end for
10: . stage 2: binary-tree merge among processes
11: root pid ← bk/Bc % P;
12: num active procs ← d(mb − k)/Be;
13: if (num active procs ≥ P) then
14: num active procs ← P;
15: end if
16: for (hgt ← 1 to dlog2 num active procse) do
17: d1 ← 0; d2 ← 0+2hgt−1;
18: while (d2 < num active procs) do
19: p1 ← (d1+root pid)%P;
20: p2 ← (d2+root pid)%P;
21: i1 ← get first row position(k, B, P, p1);
22: i2 ← get first row position(k, B, P, p2);
23: merge two rows(A, i1, i2, p1, p2, k, B ,nb, P);
24: d1+=2hgt; d2+=2hgt;
25: end while
26: end for
27: end for
(mb/P )-th row, and whose columns are between the k-th
column and the nb-th column. The local submatrix can
be regarded as D domains of rows. The parameter D is
an argument passed to the solver, which can vary from
one to the number of rows per process. Local CAQR
first computes a partial result for each domain, and then
summarizes the results by using a local binary-tree merge.
The computations from the D domains can be executed
in an embarrassingly parallel way. Section V introduces
how to determine the optimal number of domains.
Stage 2: In the second stage (lines 10–26 in Algorithm 2),
a parallel binary-tree reduction operation is conducted by a
number of processes. It first decides the root of the parallel
reduction tree, which changes in a block-cyclic manner. Next,
it decides which processes are involved in computing the
trailing submatrix (lines 12–15). We refer to the involved
processes as active processes. The set of active processes
will participate in the parallel reduction operation and each
process contributes one row of tiles. The binary tree has a
height of dlog2(ActiveProcesse)e. It merges the results from
all the leaves up to the proper root identified by root_pid.
Subroutine merge_two_rows is responsible for merging the
partial results from two processes. As shown in Algorithm 5,
it merges the i1-th row in process p1 and the i2-th row in
process p2 to obtain a new result.
Dynamic Trees: The parallel reduction operation is a global
operation that involves all the active processes. Since the
suCAQR factorization proceeds from the top left corner to the
bottom right corner of the matrix, and due to the block cyclic
data layout, the root of the binary tree must change in a block
Algorithm 3 Decide the First Row in Physical Layout
int get first row position(k, B, P, pid)
num cycles ← bbk/Bc/P c;
root pid ← bk/Bc % P;
if (pid < root pid) then
phys 1st row ← B+num cycles×B;
end if
if (pid = root pid) then
phys 1st row ← k%B+num cycles×B;
end if
if (pid > root pid) then
phys 1st row ← num cycles×B;
end if
return phys 1st row
Algorithm 4 Local CAQR Factorization
1: local caqr(A, phys 1st row, k, B, mb, nb, P, pid, D)
2: ds ← bmb/P/Dc /*rows per domain*/
3: . step 1: do local factorization for each domain
4: for each domain d ← 0 to D-1 do
5: 1st row ← phys 1st row + d*rows per domain
6: R[1st row,k], V[1st row,k], T[1st row,k]
7: ← dgeqrt (A[1st row,k]);
8: for j ← k+1 to nb-1 do
9: A[1st row,j] ← dormqr
10: (V[1st row,k],T[1st row,k],A[1st row,j]);
11: end for
12: for i ← 1st row+1 to bmb/P c-1 do
13: R[i,k], V[i,k], T[i,k]
14: ← dtsqrt (A[1st row,k],A[i,k]);
15: end for
16: for i ← 1st row+1 to 1st row + bmb/P/Dc - 1 do
17: for j ← k+1 to nb-1 do
18: R[1st row,j], A[i,j] ←
19: dtsssmqr (V[i,k],T[i,k], R[1st row,j],A[i,j]);
20: end for
21: end for
22: end for
23: . step 2: merge results from the D local domains
24: root domain ← bphys 1st row/dsc ;
25: for (hgt ← 1 to dlog2D − root domaine) do
26: d1 ← root domain; d2 ← d1+2hgt−1;
27: while (d2 < D) do;
28: i1 ← d1 × ds;
29: i2 ← d2 × ds;
30: if (d1 = root domain) then
31: i1 ← phys 1st row;
32: end if
33: merge two rows(A, i1, i2, pid, pid, k, B ,nb, P);
34: d1+=2hgt; d2+=2hgt;
35: end while
36: end for
cyclic way. Figure 4 shows how the root of a parallel reduction
tree may change dynamically. Suppose a matrix is distributed
to four processes (P0, P1, P2, P3), the four processes need
to participate in a global parallel reduction at every iteration.
Each parallel reduction tree has a different root and the root
changes in a round robin manner. If there are more than four
iterations, the same pattern will be repeated for multiple times.
An Example: We use Figure 5 to show a simple example
of the parallel suCAQR algorithm. In Figure 5. (a), the input
Algorithm 5 Merge Partial Results from Two Block Rows
merge two rows(A, i1, i2, p1, p2, k, B ,nb, P)
Input: i1 and i2 are the physical row indices
logic i1 ← physical 2 logical(i1, B, P, p1);
logic i2 ← physical 2 logical(i2, B, P, p2);
R[logic i1,k], V[logic i2,k], T[logic i2,k]
← dttqrt (R[logic i1,k],R[logic i2,k]);
for j ← k+1 to nb-1 do;
A[logic i1,j], A[logic i2,j] ← dttssmqr (V[logic i2,k],
T[logic i2,k],A[logic i1,j],A[logic i2,j]);
end for
P0 P2 P3P1 P0 P2 P3P1 P0 P2 P3P1 P0 P2 P3P1
4th iteration2nd iteration 3rd iteration1st iteration
Fig. 4. An example of the parallel reduction operation. Given a matrix
with four columns and four processes, suCAQR performs the global parallel
reduction operation four times (once per iteration). Each iteration corresponds
to a different parallel tree with a different root (e.g., P0 is the root (shown in
red) in 1st iteration, P1 is the root (shown in red) in 2nd iteration, etc.).
matrix has 12 × 4 tiles and is distributed to P0–P3 using
a block cyclic distribution method, where the group size B
equals one and the number of domains per process equals one.
Figure 5. (b) shows the physical data layout of the matrix,
where each process stores three rows from three separated
rows from subfigure (a). At the first iteration, in Figure 5. (b),
every process performs a local CAQR factorization. The local
factorization applies QR1 and QR2 to the first column, applies
UP1 to all tiles on the first row, and applies UP2 to all tiles on
the trailing submatrix. All processes are performing the same
computation in parallel but on different data.
After local factorizations are completed, the global binary-
tree reduction will start. Since the binary tree reduction is
based on a logical data layout, Figure 5. (c), (d), and (e)
use the logical data layout to display the matrix. After (b),
all processes have obtained their partial R factors which are
located on their first rows. Figure 5. (c) shows the status of the
binary tree at the bottom level (i.e., depth=2), where P0 and
P1 merge results meanwhile P2 and P3 merge results. Figure
5. (d) shows the second level of the binary tree (i.e., depth=1),
where P0 and P2 merge results and store the final result to the
root process P0.
As shown in Figure 5. (e), the second iteration will start with
a trailing submatrix with one less row and one less column.
In the second iteration, P1 becomes the root of the parallel
reduction tree, and the same steps will be repeated on the
smaller trailing submatrix.
IV. PARALLEL ALGORITHM ANALYSIS
This section analyzes the time complexity and communica-
tion cost of suCAQR. In the analysis, we assume the tile size
is b, and the matrix is of mb×nb tiles. There are P processes
and each process has a number of D domains. Note that our
analysis is more general than the analysis of distributed tiled
CAQR, which assumes mb >> nb.
A. Operation Count per Process
First, we compute the number of operations per process as
follows:
QR1 (dgeqrt): 2b3 operations per task. At each iteration,
every domain of the process has one QR1 task. Thus,
Tdgeqrt/proc =
nb∑
k=1
D × (1× 2b3) = 2b2nD.
UP1 (dormqr): 3b3 operations per task. At iteration k, every
domain contains (nb − k) UP1 tasks. Thus,
Tdormqr/proc =
nb∑
k=1
D × (nb − k)× 3b3 ≈ 3
2
n2bD.
QR2 (dtsqrt): 103 b
3 operations per task. At iteration k, every
domain has ((mb − k + 1− P )/P −D) QR2 tasks. Thus,
Tdtsqrt/proc =
nb∑
k=1
[(mb − k + 1− P )/P −D]× 10
3
b3
≈ 10
3P
b(mn− n
2
2
− PnbD).
UP2 (dtsssmqr): 4b3 + sb2 operations per task. s is used as
the size of inner tile in a b×b tile. At iteration k, each domain
has [(mb − k + 1− P )/P −D]× (nb − k) tasks. Thus,
Tdtsssmqr/proc =
nb∑
k=1
[(mb − k + 1− P )/P −D]
× (nb − k)× (4b3 + sb2)
≈ 4 +
s
b
P
(
n2m
2
− n
3
6
− n
2bP
2
D).
Merge (dttqrt): 53b
3 operations per task. At iteration k, each
process executes it for (logP +D − 1) times. Thus,
Tdttqrt/proc =
nb∑
k=1
(logP +D − 1)× (5
3
b3)
=
5
3
b2n(logP +D − 1).
R
R
R
R
R
R
QR2          UP2          P1
QR2          UP2          P1
QR2          UP2          P2
QR2          UP2          P2
QR2          UP2          P0
QR2          UP2          P0
QR2          UP2          P3
QR2          UP2          P3
QR1          UP1          P0
QR1          UP1          P1
QR1          UP1          P2
QR1          UP1          P3
P0
P0
P0
P0
P0
P0
P0
P0
P1
P1
P1 P1
P1
P1
P1
P1
P1
P2
P3
P2
P3
P2
P2
P3
P3
P2
P3
P2
P3
P2
P3
P2
P3
P2
P3
(a) (b) (c) (d) (e)
P0
P1
P2
P3
P0
P1
P2
P3
P0
P1
P2
P3
P0
Fig. 5. An example of the parallel suCAQR algorithm given a matrix with
12× 4 tiles and four processes. Different processes are denoted by different
colors. (a) Input matrix in a logical data layout. (b) Physical data layout and
corresponding local CAQR factorization in each process. (c) Merging in the
third level of the binary tree. (d) Merging in the second level of the binary
tree. (e) Matrix status at the beginning of the second iteration.
TABLE I
COMPARISON WITH OTHER QR FACTORIZATIONS FOR n× n MATRICES
Op Count (parallel) #Messages #Words
Lower bound Θ(n
3
P ) Θ(
√
P ) Θ( n
2√
P
)
Scalapack [4] 4n
3
3P
5n
4 log
2 P (n2 + bn) logP
distributed n
3
3P (6 +
s
2b )
tiled CAQR [9] +n
2b
4 (4 +
s
b ) logP (
n
b )
2 logP n2 logP
−n2b2 (3 + sb )
suCAQR n
3
3P (4 +
s
b )
+n
2b
4 (4 +
s
b ) logP (
n
b )
2 logP n2 logP
−n2b2 (3 + sb )
+(D − 1)n( b23 + sn)
MergeUpdate (dttssmqr): 12 (4b
3 + sb2) operations per
task. At iteration k, each process executes it for (logP +
D − 1)(nb − k) times. Thus,
Tdtsssmqr/proc =
nb∑
k=1
(logP +D − 1)(nb − k)× 1
2
(4b3 + sb2)
≈ n2(b+ s
4
)(logP +D − 1).
B. Communication Cost
We consider the process that has the maximum of messages.
Since our calculation is the same as that of distributed tiled
CAQR, we simply present the conclusion here: #messages =
log2(P )
n2
b2
, and #words = log2(P )n2.
C. Comparison with the Related Work
We compare suCAQR with the theoretical lower bound,
ScaLAPACK, and the distributed tiled CAQR with respect
to the operation count, the number of messages, and the
number of words communicated. As displayed in Table I,
suCAQR has less operations than distributed tiled CAQR if
its number of domains D equals one, but has more operations
than ScaLAPACK by a fraction of s4b . Note that s is typically
much less than b. Both distributed tiled CAQR and suCAQR
have more messages than ScaLAPACK. This is because they
are designed to send and receive small messages of tiles
to minimize a process’s CPU idle time by increasing the
degree of parallelism. As later shown in Section VI, our
implementation of asynchronous execution is able to hide the
communication latency by a great number of fine-grain tasks.
V. DECIDING OPTIMAL NUMBER OF DOMAINS
Algorithm 4 defines that each process can have D domains,
where D is an integer between one and mbP (D =
mb
P means
that every title row is a domain). Based on our experiments,
the number of domains has a significant impact on the program
performance. Figure 6 shows that a different D may lead to
totally different performance on a given matrix.
In order to determine an optimal number of domains without
a lot of tuning and searching efforts, we analyze the algorithm
and make the following observations:
1) Within one domain, the degree of task parallelism is
decided by the number of block columns nb. That is,
each domain can contain up to nb parallel tasks (e.g.,
there are nb − k parallel tasks in the k-th iteration).
2) When nb is much greater than the number of threads
per process (i.e., T ), using one domain can attain the
minimum number of operations, the maximum number
of higher-performance UP2 tasks, and thus the best
performance.
3) Whenever D increases, the degree of task parallelism
within each process increases by D times. However,
when D increases a lot, the number of the fastest UP2
tasks starts to decrease and the overall performance will
degrade instead.
Based on the analysis, we develop a novel model to deter-
mine the optimal number of domains. First, we make sure the
degree of task parallelism within a process must be greater
than the number of threads per process (T ). This relationship
can be expressed as follows:
(degreeparallel = nb ×D) ≥ c× T
=⇒ D ≥ dc× T
nb
e
We choose coefficient c = 4 based on our collected perfor-
mance data. As long as nb ≥ 4T , using one domain can
produce the best performance regardless of the matrix shape
being skinny or square. Also, when nb becomes less and
less, the formula allows D to increase correspondingly to
compensate for the diminishing degree of task parallelism.
When nb becomes too small (e.g., 1, 2, or 3 columns), we
set D = T to let T threads compute T domains in an
embarrassingly parallel way. Further creating more domains
than the number of threads per process will slow down the
thread performance due to increased operations, fewer fast
UP2 tasks, and an increment in the working set from irregular
cross-domain merges.
The complete model is provided as below, where c = 4.
D∗ =
{
d c×Tnb e if nb ≥ c
min{T, mbP } if nb < c
Figure 6. (a) shows the actual performance using different
number of domains, and the performance using the predicted
D∗ based on the analytical model. We execute 4 processes and
each process consists of 16 threads. The input matrix size goes
from 128 × 1 blocks to 128 × 128 blocks. For each matrix,
we try all the possible number of domains and collect their
performance. From the data, we can see that our model-based
prediction matches the actual best number of domains closely.
In the second experiment, we take the same matrix size of
128×16 blocks but change the number of threads per process
from one to 32. As shown in Figure 6. (b), the model-based
prediction again has a good match to the actual best number
of domains. In the following experiments displayed in Section
VII, we have also observed that the actual best number of
domains matches the analytical model.
D*=16
D*=16
D*=16
D*=8
D*=4
D*=2 D*=1 D*=1
0
50
100
150
200
250
300
350
400
1 2 4 32 64 128
G
fl
o
p
s
8 16
Number of block columns
D=1
D=2
D=4
D=8
D=16
D=32
our_prediction
(a) The number of block columns changes
D*=1
D*=1
D*=1
D*=2
D*=4
D*=8
0
100
200
300
400
500
1 2 4 8 16 32
G
fl
o
p
s
Number of threads per process
D=1
D=2
D=4
D=8
D=16
D=32
our_prediction
(b) The number of threads per process changes
Fig. 6. Using an analytical model to predict the optimal number of domains
D∗. In (a), each matrix has 128 block rows but different block columns. Each
process has 16 threads. In (b), the input matrix has 128 block rows and 16
block columns but uses a different number of threads.
VI. THE ASYNCHRONOUS DATAFLOW IMPLEMENTATION
The parallel algorithm of suCAQR has been implemented
with the TBLAS [11] runtime system. The TBLAS runtime
system is able to support distributed and dynamic DAG
scheduling using all CPU cores. It consists of three types of
threads: task-generation thread, compute thread, and commu-
nication thread. The task-generation thread executes a task-
based suCAQR program and generates fine-grain tasks to fill in
multiple priority-based task queues. Whenever becoming idle,
a compute thread picks up a ready task from the ready task
queue and executes it. The communication thread is responsi-
ble for sending and receiving data between a parent task and its
children to satisfy the data dependency requirement. The major
distinction of TBLAS is that it does not require representing
or constructing a task graph by users but dynamically solves
data dependencies among all distributed compute nodes.
With TBLAS, our implementation work is to write a sequen-
tial program to generate tasks that are needed in the suCAQR
algorithm. The rest of the parallel implementation is handled
by the TBLAS runtime system, which can automatically detect
data dependencies by simply matching a task’s output to
another task’s input on the fly, and schedule tasks to CPUs
dynamically. In the suCAQR program, we create six types of
tasks: QR1, QR2, UP1, UP2, Merge, and Merge Update. Each
type of task takes multiple tiles as input and writes new result
to the output tiles. To execute the suCAQR program, we launch
a number of MPI processes on different multicore compute
nodes. Every MPI process is managed by one TBLAS runtime
system. The runtime systems coordinate with each other to
QR1
UP1
UP2
QR2
Merge
Merge 
Update
P0:
P1:
P2:
Fig. 7. The corresponding DAG of the suCAQR algorithm computing a matrix
of 6 × 3 tiles with 3 processes. Six tile rows are distributed to the three
processes in a cyclic way such that each process has two rows of tiles.
solve data dependencies, dispatch tasks, and send the output
of a parent task to its children tasks which are waiting for
their input.
The execution of the task-based suCAQR program is driven
by data availability. When the task-based program is being
executed, the frontier portion of the DAG is unrolled dynam-
ically by the runtime system. The size of the active DAG is
controlled by a task window size. Each runtime system has its
own execution point, which follows the data-availability path
to reach a different place in the DAG. Figure 7 shows the
complete DAG of suCAQR which solves a matrix of 6×3 tiles.
The task graph is executed by three processes, each of which
stores two rows of tiles. From the figure, we can see that three
processes execute from the left to the right and communicate
with each other only when it is necessary. Note that there is no
global synchronization point in the execution, and no artificial
order between tasks except for the real data dependencies.
VII. EXPERIMENTAL RESULTS
We first evaluate the impact of tile size on suCAQR’s
computational kernels, and compare suCAQR with the dis-
tributed tiled CAQR to show the performance improvement.
Next, we present the breakdown of the total execution time
to analyze the overhead of different components in our imple-
mentation. Finally, we compare suCAQR with ScaLAPACK
and DPLASMA [12] in weak scalability. For DPLASMA, we
have tuned and selected the best tree type and the best D.
Both suCAQR and DPLASMA use the tuned tile size b=300.
We conducted experiments on the Big Red II Cray
XE6/XK7 supercomputer [13] at Indiana University. Table
II shows the specifications of the computer system. Each
compute node has 32 cores and 64 GB of memory, and runs a
Cray Linux OS. Since the mathematics library of Cray LibSci
provides a faster performance than Intel MKL (see Figure 8),
we use Cray LibSci to run the experiments.
A. Impact of Tile Size on Sequential Computational Kernels
In our implementation, the tile size plays an important role
in the overall performance. To search for the best tile size, we
test a number of tile sizes on the sequential computational
05
10
15
20
8
5
6
1
0
4
1
5
2
2
0
0
2
4
8
2
9
6
3
4
4
3
9
2
4
4
0
4
8
8
5
3
6
5
8
4
6
3
2
6
8
0
7
2
8
7
7
6
8
2
4
8
7
2
9
2
0
9
6
8
G
fl
o
p
s
Tile Size
DGEMM (MKL) DGEMM(libsci)
UP2(MKL) UP2(libsci)
Merge Update(MKL) Merge Update(libsci)
Fig. 8. Performance of sequential suCAQR kernels with different tile sizes
using Intel MKL and Cray LibSci. DGEMM implies a practical upper bound.
kernels. Figure 8 shows the performance of the UP2 and
MergeUpdate kernels in suCAQR, as well as DGEMM pro-
vided by Cray LibSci and Intel MKL. DGEMM demonstrates
the practical maximum performance on the computer system.
From the figure, we can see that the performance of the kernels
increases as the tile size becomes larger. In general, we want
to have a bigger tile size to maximize the kernel performance.
However, a bigger tile size results in less number of tasks
and degraded load balance. Hence, a good tile size is often
a number between 200 and 400. In our implementation, we
choose 300 as the tile size.
The other important observation is that the kernel of UP2
is much faster than the kernel of MergeUpdate. As the
number of domains per process (D) increases, the number
of MergeUpdate tasks will increase. As a result, the overall
performance will drop. This is one of the reasons that we do
not want D to be large unless the degree of parallelism is too
small.
B. Comparison with Previous Distributed Tiled CAQR
We compare suCAQR with distributed tiled CAQR to evalu-
ate the performance gain of our new solver. In the experiment,
we use 512 cores and fix the number of rows for the input
TABLE II
SPECIFICATIONS OF THE EXPERIMENTS
OS Cray Linux Environment
Nodes 1,020 (a maximum of 2048 cores per job)
Memory per node 64 GB
Processors per node 2
Cores per processor 16
FPU info two cores share one FPU
Processor AMD Opteron 6380 2.5GHz
Peak performance/core 13.6 Gflops (with AMD turbo core)
MPI Cray-MPICH 7.2.5
suCAQR linked with Cray LibSci 13.2
ScaLAPACK Cray LibSci 13.2
DPLASMA’s HQR PaRSEC / DPLASMA 2.0.0, PLASMA 2.7.1,
linked with Cray LibSci 13.2,
Tuned and selected the best tree
and D (between 1 and 4), tile size b=300.
0
100
200
300
400
500
600
700
800
900 1800 3600 7200 14400 28800 57600
G
fl
o
p
s
Number of columns
Distributed tiled CAQR
suCAQR
Fig. 9. Comparison of suCAQR with distributed tiled CAQR. All the input
has a fixed number of 57,600 rows but a varying number of columns.
matrices. Each matrix is of dimension 57,600 × n, where n is
increased from 900 to 57,600. Figure 9 shows the performance
of distributed tiled CAQR and suCAQR. When n is greater than
or equal to 14,400 (i.e., #columns#rows =
1
4 ,
1
2 , and 1), suCAQR
starts to be faster than distributed tiled CAQR.
The reason can be illustrated by a simple example shown in
Figure 10. The example shows how the tiled CAQR algorithm
and the suCAQR algorithm would use four processes to factor
an 8 × 5 matrix, respectively. In Figure 10. (a), P0 becomes
idle after two iterations as shown in (2), then P1 also becomes
idle after another two iterations as shown in (3). Note that half
of the processes have no work to do in (3). In Figure 10. (b),
P0-P3 are kept busy all the time from the first iteration to the
last iteration.
C. Runtime Efficiency Analysis
To analyze the runtime efficiency, we measure the time
spent on each component of the suCAQR implementation.
As presented in Section VI, our implementation uses one
generation thread to i) create new tasks, one communication
thread to ii) send/receive messages, and a number of compute
threads to iii) get ready tasks, iv) execute the ready tasks, and
v) fire new tasks that are waiting for the completed tasks. We
P1
R
RR R R R
RR R R
(1) (3)(2)
P1
P2
P2
P3
P3
P0
P0
P1
P1
P2
P2
P3
P3
P2
P2
P3
P3
RR R
RR
P1
P1
P0
P0
P0
P0
RR R R R
RR R R
(a) Distributed tiled CAQRR
RR R R R
RR R R
(1) (3)(2)
P2
P3
P0
P1
P2
P3
P0
P1
P2
P3
P0
P1
P2
P3
P0
P1
P2
P3
RR R
RR
P0
P1
P2
P3
P0
P1
RR R R R
RR R R
(b) The new suCAQR
Fig. 10. (a) shows that more and more processes become idle as distributed
tiled CAQR proceeds. (b) shows that there is no idle process in the new
suCAQR algorithm. Black color represents the completed result.
0% 20% 40% 60% 80% 100%
Component time / Total time
Generate time
Communicate time
Get task time
Fire time
Compute time
Total time
(a) extremely tall and
skinny
0% 20% 40% 60% 80% 100%
Component time / Total time
Generate time
Communicate time
Get task time
Fire time
Compute time
Total time
(b) #columns#rows =
1
16
0% 20% 40% 60% 80% 100%
Component time / Total time
Generate time
Communicate time
Get task time
Total time
Compute time
Fire time
(c) #columns#rows =
1
4
0% 20% 40% 60% 80% 100%
Component time / Total time
Generate time
Get task time
Communicate time
Fire time
Compute time
Total time
(d) square
Fig. 11. Breakdown of the total execution time. The total wall clock time is
nearly equal to the compute time taken by the computational kernels.
instrument the source code to measure the time spent on each
of the five components.
Figure 11 lists the breakdown of the total execution time
using 512 cores. We consider four different matrix shapes
for which the ratio of the number of columns to the number
of rows is: extremely tall and skinny, 116 ,
1
4 , and square,
respectively. As shown in Figure 11, the percentages of the
generate-task time, get-task time, and fire-task time are small
from (a) to (d). Although the percentage of the communication
time increases, it is hidden by a large number of dynamic
scheduling computational tasks. For different matrices, the
percentage of the compute time relative to the total clock time
is 92.7%, 98.7%, 98.7%, and 98.9%, respectively. It implies
that the wall clock time is almost equal to the CPU time that
is taken by the computational kernels. Also, all the other costs
including runtime overhead and communication only occupy at
most 7% of the total time. Therefore, this result demonstrates
the good efficiency of the suCAQR solver.
D. Performance Evaluation in Weak Scalability
We use weak scalability to measure the capability of su-
CAQR to solve larger problems if a user has access to more
CPUs. In the experiments, whenever we double the number
of cores, we also double the total amount of computation
accordingly. The number of matrix elements per core is thus
kept as a constant in each experiment.
Figure 12 displays the metric of Gflops-per-core for four
different matrix shapes using up to 1,024 cores. From all four
subfigures, we can see that suCAQR maintains good scalability
with a nearly constant Gflops-per-core except for the 40% drop
from one core to two cores. The reason for the performance
drop is that two AMD CPU cores are sharing one floating
point unit (FPU) by using the dual stream mode.
As shown in Figure 12. (a), when the matrix is extremely
tall and skinny (having only 6 block columns but many block
0
2
4
6
8
10
12
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
5
1
2
1
0
2
4
G
fl
o
p
s
 p
e
r 
c
o
re
#cores
suCAQR 
DPLASMA
ScaLAPACK
(a) extremely tall and
skinny
0
2
4
6
8
10
12
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
5
1
2
1
0
2
4
G
fl
o
p
s
p
e
r 
c
o
re
#cores
suCAQR
DPLASMA 
ScaLAPACK
(b) #columns#rows =
1
16
0
2
4
6
8
10
12
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
5
1
2
1
0
2
4
G
fl
o
p
s
p
e
r 
c
o
re
#cores
suCAQR
DPLASMA 
ScaLAPACK
(c) #columns#rows =
1
4
0
2
4
6
8
10
12
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
5
1
2
1
0
2
4
G
fl
o
p
s
p
e
r 
c
o
re
#cores
suCAQR
DPLASMA 
ScaLAPACK
(d) square
Fig. 12. Experiments of weak scalability on 1024 cores.
rows), suCAQR can outperform ScaLAPACK by 30 times
and outperform DPLASMA by 13% on 1024 cores. For the
matrix shape of 116 , suCAQR outperforms ScaLAPACK by
78% and outperforms DPLASMA by 23%. For the 14 shape,
suCAQR outperforms ScaLAPACK by 25% and outperforms
DPLASMA by 30%. When the matrix is purely square (in
Figure 12. (d)), suCAQR is as good as ScaLAPACK, and is
6% better than DPLASMA.
We can also see that the performance of ScaLAPACK keeps
rising from tall and skinny matrices to square matrices. This is
because a wider matrix will produce a larger trailing submatrix
which is more abundant with the vendor-optimized DGEMM
kernel which is faster than suCAQR’s kernels (see Figure 8).
Another observation is that the performance of ScaLAPACK
fluctuates a little bit when its 2-D process grid P ×Q is not
in a square shape (i.e., P 6= Q).
Reasons for the Performance Improvement: The suCAQR
is faster than DPLASMA due to the following reasons. First,
we minimize the number of domains to maximize the number
of invocations of the fastest kernel by using the analytical
model of D∗. Also, less domains can result in fewer floating
point operations (refer to Table I). Second, the single-domain’s
consecutive row-by-row update has a smaller working set size
and better locality than multiple domains which work on
multiple different rows. Third, the TBLAS runtime we use
is efficient in hiding communication by computation such that
the wall clock time is simply equal to the pure computation
time, which is almost an ideal case. Furthermore, we modify
DPLASMA to use the same binary tree and same parameters
as suCAQR. The profiling results show that suCAQR is still
faster than DPLASMA owing to TBLAS’s higher data reuse
between tasks and the batched message passing.
VIII. RELATED WORK
Different communication-avoiding algorithms have been
applied to a variety of applications. Yelick et al. developed
a communication avoiding GMRES solver on shared-memory
multicore computers [14]. Grigori et al. designed the com-
munication optimal LU factorization (CALU) algorithm [15].
Khabou et al. also designed a communication avoiding LU
factorization version with panel rank revealing pivoting [16].
Ballard et al. created a communication avoiding algorithm for
Strassen’s matrix multiplication [17]. In this paper, we provide
a faster parallel implementation of the CAQR algorithm with
a simpler design, and is able to handle matrices of any shapes.
There are also other implementations of communication
avoiding QR factorization that target different computing sys-
tems. Anderson et al. implemented the communication avoid-
ing QR factorization entirely on a single GPU to achieve high
performance [18]. Agullo et al. [19] developed a topology-
aware approach that can adapt to a grid environment to com-
pute the CAQR factorization by confining communications to
local sites.
DPLASMA’s HQR [10] studies different tree options such
as flat tree, greedy tree, fibonacci tree, binary tree, and so
on. DPLASMA also has an optional fourth level tree called
domain-level tree. But it is not known when to include and
how to optimize the fourth domain-level tree. Also, what type
of tree is the best is totally dependent on the result of trying
all the different trees at four different levels given a specific
input and a certain number of CPUs. Differently, our work
reveals that these types of complexities can be avoided without
any loss of performance. In addition, we find that the optimal
number of domains per process is only related to the number
of block columns in the matrix and the number of CPU cores
per compute node. We build an accurate analytical model to
calculate the optimal number of domains conveniently. This
knowledge is new and has not been discovered by other work.
IX. CONCLUSION
Communication-avoiding QR factorization is a generic par-
allel algorithm and potentially has many different approaches
to implementing and optimizing it. In this paper, we target the
distributed multicore computer architecture and give a simple,
efficient, and scalable design to handle various matrix shapes.
From the perspective of the algorithm, suCAQR uses a
logical block cyclic data layout on which a dynamic-root
parallel tree reduction method is deployed. It also uses a much
simplified parallel reduction method to leverage the specific
architectural strength of a cluster of multicore compute nodes.
It is for the first time to provide an insight into determining
the important factor of the number of domains without any
tuning and searching. From the perspective of the software
design, a simple unique binary tree can significantly simplify
the software implementation and reduce the cost of parameter
tuning. From the perspective of the implementation, suCAQR
generates a large number of fine-grain tasks to achieve a high
degree of parallelism, and allows for a fully dynamic execution
to completely hide communication by computation using the
dynamic scheduling TBLAS runtime. The experimental results
have shown that suCAQR can perform better than the state-of-
the-art QR factorization libraries on different matrices using
up to 1,024 cores. The runtime efficiency analysis has also
revealed a near optimal circumstance (i.e., wall clock time
equals computation time) for efficient parallel executions.
REFERENCES
[1] E. Anderson, Z. Bai, and J. Dongarra, “Generalized QR factorization
and its applications,” Linear Algebra and its Applications, vol. 162, pp.
243–271, 1992.
[2] A. Bjo¨rck, Numerical methods for least squares problems. SIAM, 1996.
[3] J. W. Demmel, Applied numerical linear algebra. SIAM, 1997.
[4] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel,
I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet et al.,
ScaLAPACK users’ guide. SIAM, 1997, vol. 4.
[5] J. Choi, J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet,
K. Stanley, D. Walker, and R. C. Whaley, “ScaLAPACK: A portable
linear algebra library for distributed memory computers: Design issues
and performance,” in Applied Parallel Computing Computations in
Physics, Chemistry and Engineering Science. Springer, 1996, pp. 95–
106.
[6] J. W. Demmel, L. Grigori, M. F. Hoemmen, and J. Langou,
“Communication-optimal parallel and sequential QR and LU factoriza-
tions,” UTK, LAPACK Working Note 204, August 2008.
[7] G. Ballard, J. Demmel, L. Grigori, M. Jacquelin, N. Knight, and
H. Nguyen, “Reconstructing householder vectors from tall-skinny QR,”
Journal of Parallel and Distributed Computing, vol. 85, pp. 3 – 31,
2015, IPDPS 2014 Selected Papers on Numerical and Combinatorial
Algorithms.
[8] M. Hoemmen, “A communication-avoiding, hybrid-parallel, rank-
revealing orthogonalization method,” in 2011 IEEE International Paral-
lel Distributed Processing Symposium (IPDPS), May 2011, pp. 966–977.
[9] F. Song, H. Ltaief, B. Hadri, and J. Dongarra, “Scalable tile
communication-avoiding QR factorization on multicore cluster sys-
tems,” in Proceedings of the 2010 ACM/IEEE International Conference
for High Performance Computing, Networking, Storage and Analysis
(SC’10), 2010, pp. 1–11.
[10] J. Dongarra, M. Faverge, T. Herault, M. Jacquelin, J. Langou, and
Y. Robert, “Hierarchical QR factorization algorithms for multi-core
clusters,” Parallel Computing, vol. 39, no. 4, pp. 212–232, 2013.
[11] F. Song, A. YarKhan, and J. Dongarra, “Dynamic task scheduling for
linear algebra algorithms on distributed-memory multicore systems,” in
Proceedings of the 2009 ACM/IEEE International Conference for High
Performance Computing, Networking, Storage and Analysis (SC’09).
IEEE, 2009, pp. 1–11.
[12] PaRSEC, “http://icl.utk.edu/parsec.”
[13] BigRedII, “http://rt.uits.iu.edu/bigred2.”
[14] M. Mohiyuddin, M. Hoemmen, J. Demmel, and K. Yelick, “Minimiz-
ing communication in sparse matrix solvers,” in Proceedings of the
Conference on High Performance Computing Networking, Storage and
Analysis. ACM, 2009, p. 36.
[15] L. Grigori, J. W. Demmel, and H. Xiang, “CALU: A communication
optimal LU factorization algorithm,” SIAM Journal on Matrix Analysis
and Applications, vol. 32, no. 4, pp. 1317–1350, 2011.
[16] A. Khabou, J. W. Demmel, L. Grigori, and M. Gu, “LU factorization
with panel rank revealing pivoting and its communication avoiding
version,” SIAM Journal on Matrix Analysis and Applications, vol. 34,
no. 3, pp. 1401–1429, 2013.
[17] G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz,
“Communication-optimal parallel algorithm for Strassen’s matrix multi-
plication,” in Proceedings of the twenty-fourth annual ACM symposium
on Parallelism in algorithms and architectures, 2012, pp. 193–204.
[18] M. Anderson, G. Ballard, J. Demmel, and K. Keutzer, “Communication-
avoiding QR decomposition for GPUs,” in 2011 IEEE International
Parallel & Distributed Processing Symposium (IPDPS), 2011, pp. 48–
58.
[19] E. Agullo, C. Coti, J. Dongarra, T. Herault, and J. Langou, “QR factor-
ization of tall and skinny matrices in a grid computing environment,” in
24th IEEE International Parallel and Distributed Processing Symposium
(IPDPS 2010). IEEE, 2010.
