A novel method for scaling iterative solvers: avoiding latency overhead of parallel sparse-matrix vector multiplies by Selvitopi, R. O. et al.
1045-9219 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2014.2311804, IEEE Transactions on Parallel and Distributed Systems
1
A Novel Method for Scaling Iterative Solvers:
Avoiding Latency Overhead of
Parallel Sparse-Matrix Vector Multiplies
R. Oguz Selvitopi, Mustafa Ozdal, and Cevdet Aykanat
Abstract—In parallel linear iterative solvers, sparse matrix vector multiplication (SpMxV) incurs irregular point-to-point (P2P)
communications, whereas inner product computations incur regular collective communications. These P2P communications cause an
additional synchronization point with relatively high message latency costs due to small message sizes. In these solvers, each SpMxV
is usually followed by an inner product computation that involves the output vector of SpMxV. Here, we exploit this property to propose a
novel parallelization method that avoids the latency costs and synchronization overhead of P2P communications. Our method involves
a computational and a communication rearrangement scheme. The computational rearrangement provides an alternative method for
forming input vector of SpMxV and allows P2P and collective communications to be performed in a single phase. The communication
rearrangement realizes this opportunity by embedding P2P communications into global collective communication operations. The
proposed method grants a certain value on the maximum number of messages communicated regardless of the sparsity pattern of the
matrix. The downside, however, is the increased message volume and the negligible redundant computation. We favor reducing the
message latency costs at the expense of increasing message volume. Yet, we propose two iterative-improvement-based heuristics to
alleviate the increase in the volume through one-to-one task-to-processor mapping. Our experiments on two supercomputers, Cray XE6
and IBM BlueGene/Q, up to 2048 processors show that the proposed parallelization method exhibits superior scalable performance
compared to the conventional parallelization method.
Index Terms—Parallel linear iterative solvers, sparse matrix vector multiplication, point-to-point communication, inner product
computation, collective communication, message latency overhead, avoiding latency, hiding latency, iterative improvement heuristic.
F
1 INTRODUCTION
Iterative solvers are the defacto standard for solving
large, sparse, linear systems of equations on large-scale
parallel architectures. In these solvers, two basic types
of operations are repeatedly performed at each iteration:
sparse matrix vector multiply (SpMxV) of the form q =
Ap and linear vector operations. Linear vector operations
can further be categorized as inner product and DAXPY-
like operations.
In the parallelization of these iterative solvers, linear
vector operations are regular in nature since they operate
on dense vectors and hence, they are easy to parallelize.
On the other hand, SpMxV in general constitutes the
most time consuming operation and it is hard to paral-
lelize due to irregular task-to-task interaction caused by
the irregular sparsity pattern of the coefficient matrix.
Thus, the parallelization of iterative solvers are usually
carried out by performing intelligent partitioning of
matrix A that balances computational loads of the pro-
• R. Oguz Selvitopi and Cevdet Aykanat are with the Department of
Computer Engineering, Bilkent University, Turkey, 06800.
{reha, aykanat}@cs.bilkent.edu.tr
• Muhammet Mustafa Ozdal is with the Strategic CAD Labs of Intel
Corporation, Hillsboro, OR 97124 US.
mustafa.ozdal@intel.com
This work was financially supported by the PRACE-2IP project funded in
part by the EUs 7th Framework Programme (FP7/2007-2013) under grant
agreement RI-283493 and FP7-261557.
cessors while minimizing the communication overhead
that occurs during parallel SpMxV operations. Several
sparse-matrix partitioning models and methods [2], [4],
[14], [27], [28], [30] have been proposed and used in
conjunction with respective parallel SpMxV algorithms.
The matrix partitions obtained by using these models
and methods are also decoded as partitioning linear
vector operations among processors.
With the above-mentioned partitioning and paral-
lelization schemes, parallel SpMxV computations incur
irregular point-to-point (P2P) communication, and in-
ner product operations incur regular global collective
communication, whereas DAXPY-like linear vector op-
erations do not incur any communication. Hence, both
SpMxV and inner product computations cause separate
synchronization points in the parallel solver. In general,
the matrix partitioning schemes proposed and utilized
in the literature mainly aim at minimizing the total com-
munication volume, and this loosely relates to reducing
the total message latency. However, on the current large-
scale high performance computing systems, the mes-
sage latency overhead is also a crucial factor affecting
the performance of the parallel algorithm. Our analy-
sis on two such well-known large scale systems, IBM
BlueGene/Q and Cray XE6, shows that single message
latency (i.e., startup time) is as high as transmitting
four-to-eight kilobytes of data. Specifically, the message
latency overhead caused by the processor that handles
1045-9219 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2014.2311804, IEEE Transactions on Parallel and Distributed Systems
2
the maximum number of messages becomes the deciding
factor for scaling the parallel algorithm. For example,
in a row-parallel SpMxV algorithm [28] that utilizes 1D
rowwise matrix partitioning, a dense column in matrix
A necessitates a processor to send a message to almost
all other processors, which significantly degrades the
overall performance due to high latency overhead.
The motivation of this work is based on our obser-
vation that each SpMxV computation is followed by
an inner product computation that involves the output
vector of SpMxV in most of the Krylov subspace meth-
ods, which are among the most important iterative tech-
niques available for solving large-scale linear systems.
These two successive computational phases performed
at each iteration contain write/read dependency due to
the use of the output vector of the SpMxV computa-
tion with the following inner product computation. This
in turn incurs dependency in the communications in-
volved in these two successive phases. This observation
is directly applicable to the following Krylov subspace
methods: basic Arnoldi Method and its variants, basic
GMRES, the Lanczos Algorithm, Conjugate Gradient,
Conjugate Residual Method, Biconjugate Gradient, Bi-
conjugate Gradient Stabilized, CGNR and CGNE. The
reader is referred to [26] for analyzing computational
dependencies in these Krylov subspace methods.
In this work, we exploit the above-mentioned property
of the iterative solvers to propose a novel parallelization
method that contains a computational and communi-
cation rearrangement scheme. The computational rear-
rangement resolves the computational write/read de-
pendency between two successive computational phases
so that the respective communication dependency can
also vanish. This in turn enables combining P2P com-
munications of SpMxV computations with the collective
communications of inner products into a single com-
munication phase. In other words, the computational
rearrangement paves the way for communication rear-
rangement, which is realized with embedding P2P com-
munication into collective communication operations.
Although invaluable in reducing the overhead due
to the synchronization points, the proposed computa-
tional rearrangement causes redundant computations in
DAXPY-like operations. However, we do not alter the
computational structure of the iterative solver, thus, our
method does not cause any numerical instability. In
addition, the redundant computations are confined to
communicated vector elements. Hence, the objective of
minimizing total communication volume utilized in the
existing intelligent partitioning methods also minimizes
the total redundant computation.
The communication rearrangement is achieved by em-
bedding vector elements communicated via P2P commu-
nication into global collective communication on scalars
of local inner product results. This approach completely
eliminates the message latency costs associated with
the P2P communications and reduces the average and
maximum number of messages handled by a single
TABLE 1: Notation used throughout the paper.
Notation Description
Ax=b Sparse linear system being solved.
p Input vector.
q Output vector.
r Residual vector.
pi, κ, α, β, ρ Scalars.
Pk k
th processor.
Ak Rowwise portion of the matrix owned by Pk .
pk(qk, rk,xk) Portion of the vector p (q, r, x) owned by Pk .
p̂k(q̂k) Augmented p-vector (q-vector) owned by Pk .
pik, κk, ρk The partial scalars computed by Pk .
nk Number of matrix row blocks or
vector entries owned by Pk .
n̂k Size of the augmented vectors owned by Pk .
〈p,q〉 Inner product of vectors p and q.
ALL-REDUCE Collective reduction operation.
SendSet(Pk) Set of processors Pk will send vector entries to.
RecvSet(Pk) Set of processors Pk will receive vector entries from.
pk→l, qk→l Set of vector entries sent from Pk to Pl.
processor (both sent and received) to lgK for a system
with K processors (K being a power of 2) regardless
of the matrix used in the parallel solver. However, this
embedding scheme causes extra communication due to
forwarding of certain vector elements. We favor reducing
the message latency costs at the expense of increasing
message volume, which is invaluable for the scalability
of the parallel algorithm, especially on systems with high
message startup costs.
To address the increase in message volume, we pro-
pose two iterative-improvement-based algorithms. The
main motivation of both algorithms is to keep the pro-
cessors that communicate high volume of data close
to each other in terms of communication pattern of
collective operations so that the communicated vector
elements cause less forwarding. The heuristics differ in
their search space definitions. The first heuristic utilizes
full space while the second one restricts it by considering
only the directly communicating processors in collective
communication operations. We show that the restricted
space algorithm is feasible, and on the average, its
running time remains lower than the partitioning time
up to 2048 processors.
We show the validity of the proposed method on
Conjugate Gradient (CG) algorithm, which is one of the
best known iterative techniques used for solving sparse
symmetric positive definite linear systems. Row-parallel
SpMxV is adopted for the parallelization of CG, and
column-net hypergraph model is used for intelligent
partitioning of the sparse matrix [4]. We tested our
parallelization method on two well-known large-scale
systems Cray XE6 and IBM BlueGene/Q up to 2048
processors, comparing it to the conventional paralleliza-
tion of CG using 16 symmetric matrices selected from
University of Florida Sparse Matrix Collection [8]. The
results on these two architectures show that reducing
message latencies is critical for scalable performance, as
our method obtains much better speedup results.
The rest of the paper is organized as follows. Section 2
presents the necessary background and the literature
survey. In Section 3, our computational rearrangement
1045-9219 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2014.2311804, IEEE Transactions on Parallel and Distributed Systems
3
Algorithm 1: Basic Conjugate Gradient.
B Choose an initial x vector.
B Let r = p = b−Ax and compute ρ = 〈r, r〉.
while ρ >  do1
q = Ap2
pi = 〈p,q〉3
α = ρ/pi4
x = x+ αp5
r = r− αq6
ρnew = 〈r, r〉7
β = ρnew/ρ8
ρ = ρnew9
p = r+ βp10
scheme as well as the conventional parallelization of
CG algorithm are presented. Section 4 explains the
communication rearrangement scheme where message
embedding is accomplished. Two heuristics for reducing
extra communication volume are given in Section 5. Sec-
tion 6 presents the experimental results. The Appendix
is provided in the supplementary files.
2 BACKGROUND AND RELATED WORK
Algorithm 1 displays the basic CG method [21], [26] used
for solving Ax = b, where A is an n× n symmetric pos-
itive definite sparse matrix. The algorithm contains one
SpMxV computation (line 2), two inner product compu-
tations (lines 3 and 7) and three DAXPY operations (lines
5, 6 and 10). The input vector p of the SpMxV at the
subsequent iteration is obtained from the output vector
q of the SpMxV of the current iteration through DAXPY
operations (lines 6 and 10). Furthermore, the inner prod-
uct at line 3 involves both the input and the output
vector of the SpMxV operation. So, for parallelization,
in order to avoid the communication of vector entries
during linear vector operations, a symmetric partitioning
scheme is usually adopted [4], [29], where all vectors in
the solver are divided conformally with the partitioning
of the sparse matrix. As seen in Algorithm 1, the SpMxV
(line 2) and the two inner product computations (lines
3 and 7) are mutually interdependent. Hence there are
three synchronization points: one due to the P2P com-
munications of the SpMxV operation, and two separate
collective communications for reducing the results of the
local inner product computations at all processors.
The studies that address communication requirements
of parallel CG usually adopt one or a combination of the
approaches below:
• Reducing P2P communication overhead of parallel
SpMxV with alternative partitioning strategies;
• Addressing communication requirements of inner
products by utilizing alternative collective routines;
• Overlapping communication and computation;
• Reformulating CG to reduce the communication
overhead of collective communication operations.
There are many works [4], [5], [15], [16], [20], [27], [28],
[30] addressing communication requirements of paral-
lel SpMxV operations. These studies generally center
around sophisticated combinatorial models and intelli-
gent partitioning methods which try to reduce the com-
munication overhead of SpMxV operations and achieve
scalability. Graph and hypergraph models are commonly
employed in these works. The partitioning methods
utilized in these works usually fall under the category
of 1D and/or 2D sparse matrix partitioning.
In [10], authors argue that the communication over-
head of inner products in CG and GMRES(m) become
more significant and affect the scalability negatively
with increasing number of processors. To this end, they
suggest various methods to reduce this overhead. For
CG, they restructure the parallel algorithm to overlap
computations with inner product communications with-
out affecting numerical stability of the iterative solver.
A thorough performance and scalability analysis of
parallel CG is given in [13] on a variety of parallel
architectures. Authors study block-tridiagonal and un-
structured sparse matrices and analyze the effects of
using a diagonal and a truncated Incomplete Cholesky
preconditioner. They conclude that intelligent partition-
ing techniques are mandatory for scaling unstructured
sparse matrices to improve the efficiency of parallel CG.
The work in [17] uses non-blocking collectives to
reduce the communication requirements of parallel CG
and aims at overlapping communication and computa-
tion by avoiding unnecessary synchronization. Note that
although non-blocking interfaces of collective operations
are included in MPI-3 standard, they are not realized in
the widely adopted MPI-2 standard.
A recent work [11], [12] based on a reformulation of
CG described in [7] propose a pipelined CG where the
latency of global reduction is hidden by overlapping
it with the computations of SpMxV or preconditioner.
The authors use a single reduction in an iteration of
the CG. They conduct extensive experiments to measure
the stability of pipelined CG and test their method on a
medium-scale cluster. The experimental results indicate
that their method achieves better scalability while ob-
taining comparable convergence rates with the standard
CG for the tested matrices. This work differs from our
work in the sense that we aim at hiding latency of the
communication due to the SpMxV computations rather
than the latency of the global reduction operation.
Several other works [1], [9], [22], [24], [25] suggest
a reformulation of the CG method in which the two
distinct inner product computations can be performed in
successive steps. This enables reducing results of inner
product computations with a single global collective
communication phase in a possible parallel implementa-
tion, reducing synchronization overheads. Usually, fur-
ther experimental evaluations are performed for testing
stability of these reformulations.
In this study, we use one of these reformulated ver-
sions [1], [24] for parallelization, which we present in
1045-9219 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2014.2311804, IEEE Transactions on Parallel and Distributed Systems
4
Algorithm 2: Reformulated Conjugate Gradient.
B Choose an initial x vector.
B Let r = p = b−Ax and compute ρ = 〈r, r〉.
while ρ >  do1
q = Ap2
pi = 〈p,q〉3
κ = 〈q,q〉4
α = ρ/pi5
β = α · κ/pi − 16
ρ = β · ρ7
x = x+ αp8
r = r− αq9
p = r+ βp10
Algorithm 2. In contrast to the basic CG algorithm, the
inner products in the reformulated version at lines 3
and 4 are independent. Thus, the results of the two
local inner products can be reduced in a single collective
communication phase. However, both of these indepen-
dent inner product computations still depend on the
output vector of the SpMxV computation. Observe that
the property that SpMxV is followed by inner product
computation(s) holds both in the basic and the refor-
mulated CG algorithms given in Algorithms 1 and 2,
respectively. In fact, this property also holds in other
reformulated versions [9], [22] as well. In the rest of
the paper, we focus on this reformulated version, and
whenever we mention the CG algorithm, we will be
referring to Algorithm 2. It is reported in [24] that in this
variant of CG, β can become negative due to rounding
error. So, β should be checked at each iteration and in
the case it is negative, it should be computed again using
the classical formulation.
3 COMPUTATIONAL REARRANGEMENT
This section presents two parallelization methods for
CG, both of which utilize row-parallel SpMxV. The first
one is the conventional parallelization widely adopted
in the literature, where a single iteration consists of two
synchronization points. The other one is the proposed
parallelization with computational rearrangement. The
computational rearrangement provides an opportunity
to perform P2P and collective communications in a
single communication phase, and reduces the number
of synchronization points from two to one. We opted to
explain the conventional parallelization in this section
to facilitate the presentation of the computational rear-
rangement and to make our contribution more clear and
distinctive through direct comparisons.
In the parallel algorithms presented in this section
(Algorithms 3 and 4), a subscript k denotes a local
submatrix or subvector maintained or computed by
processor Pk, whereas a superscript k denotes the result
of a local inner product performed by Pk. A variable
without a subscript or a superscript denotes a local copy
of a global scalar.
3.1 Conventional Parallelization
The row-parallel SpMxV algorithm is based on a given
1D rowwise partition of n × n sparse matrix A of the
form:
A =
[
AT1 . . . A
T
k . . . A
T
K
]T
,
where row stripe Ak is an nk×n matrix for k=1, . . . ,K.
Processor Pk stores row stripe Ak and is held respon-
sible for computing qk = Akp according to the owner
computes rule [19]. The row-parallel algorithm requires
a pre-communication phase in which p-vector entries are
communicated through P2P messages to be used in the
following local SpMxV operations. This communication
phase contains expand-like operations, where individual
p-vector entries are multicast to the processor(s) that
need them. More details about the row-parallel algo-
rithm can be found in [4], [27], [28].
Algorithm 3 presents the conventional parallelization
of the CG method. Note the two distinct communication
phases which are illustrated as the highlighted regions:
P2P communication (lines 2-5) and collective commu-
nication (line 9). At the beginning of each iteration,
processors perform the P2P communications (lines 2-
5) necessary for local SpMxV operations. The sets of
processors which Pk needs to send and receive vector
entries are denoted by SendSet(Pk) and RecvSet(Pk), re-
spectively. Note that SendSet(Pk)=RecvSet(Pk) since A
is symmetric. Pk needs to receive the entries in pl→k from
each Pl ∈ RecvSet(Pk) (RECV(Pl,pl→k)) and send the en-
tries in pk→l to each Pl ∈ SendSet(Pk) (SEND(Pl,pk→l)).
Here, pl→k denotes the set of p-vector entries that are
received by Pk from Pl. After Pk receives all necessary
non-local p-vector entries, it forms its augmented p vector,
which is denoted as p̂k and contains n̂k ≥ nk elements.
After P2P communications, each processor Pk per-
forms its local SpMxV qk = Akp̂k (line 6). Then, Pk
computes the local inner products pik = 〈pk, qk〉 and
κk = 〈qk, qk〉 (lines 7 and 8). Since all processors need a
copy of the global scalars α and β for the local DAXPY
operations, they all need to know the final inner-product
results pi and κ, computed from local inner-product
results as pi =
∑K
k=1 pi
k and κ =
∑K
k=1 κ
k. For this
purpose, a global reduction (ALL-REDUCE) is performed to
compute pi and κ (line 9). After this reduction operation,
each processor Pk computes local copies of the scalars α
and β so that it can update its local xk, rk and pk vectors
through DAXPY operations (lines 13, 14 and 15).
3.2 Proposed Alternative Parallelization
The conventional parallelization that adopts the row-
parallel SpMxV necessitates P2P communications on the
input vector entries prior to the local SpMxV computa-
tions. The main purpose of the computational rearrange-
ment is to embed the P2P communications of SpMxV
computations into the following collective communi-
cations of inner product computations, which involve
1045-9219 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2014.2311804, IEEE Transactions on Parallel and Distributed Systems
5
output vector of SpMxV. To enable this, P2P communi-
cations should be performed on the output vector entries
immediately after local SpMxV computations.
Algorithm 4 shows our proposed technique for paral-
lelizing the CG method. This simple yet effective com-
putational rearrangement scheme presents an alternative
way to form the local augmented vector p̂k, which
constitutes the main dependency among iterations. In
contrast to Algorithm 3, the augmented input vector
p̂k is not directly formed by P2P communication at
the beginning of the current iteration, but rather com-
puted using the respective q̂k and r̂k vectors in DAXPY
operations of the previous iteration (lines 14-15). This
is achieved by communicating q-vector entries in P2P
communication (instead of p-vector entries) and forming
the local augmented vector q̂k at each processor (lines 6-
9). Then, Pk simply performs its two DAXPY operations
on augmented vectors (lines 14-15); first updating r̂k by
setting it to r̂k − αq̂k and then updating p̂k by setting
it to r̂k + βp̂k. The computed local augmented vector p̂k
is then used in the local SpMxV computations of the
next iteration. The P2P communication of qk entries in
Algorithm 4 is performed together with the reduction
operations, thus combining two communication phases
of Algorithm 3 into a single communication phase (illus-
trated in the highlighted parts of the algorithm). Note
that the DAXPY operation on x vector need not be
performed using local augmented entries since it is not
used in forming p̂k. One DAXPY operation (line 13)
in Algorithm 4 is performed on local vectors with nk
elements while two remaining DAXPY operations (lines
14-15) are performed on local augmented vectors with
n̂k elements. Note that different from the conventional
parallelization, the p̂k vector needs to be formed once
before the iterations begin. After the first iteration, p̂k is
not formed through communication but through DAXPY
computations.
Compared to conventional parallelization, the draw-
back of our parallelization is the redundant computa-
tion performed by each processor Pk in two DAXPY
operations (lines 14-15) for n̂k − nk elements. Note that
n̂k − nk = |RecvV ol (Pk) |, where RecvV ol (Pk) denotes
the set of vector elements that Pk receives. That is,
each vector element received by Pk through P2P com-
munication will incur two redundant multiply-and-add
operations in the local DAXPY operations. Hence, the
total redundant computation in terms of the number
of multiply-and-add operations is two times the total
message volume in terms of words transmitted.
Since the main computational burden in a single iter-
ation is on the SpMxV operation in the CG method, this
redundant computation in two linear vector operations
is not of much concern. Nevertheless, the intelligent
partitioning schemes utilized in the literature [4], [15] for
partitioning matrix A, aim at minimizing the total mes-
sage volume incurred in P2P communications. Hence,
the partitioning objective of minimizing the total volume
of communication corresponds to minimizing the total
Algorithm 3: Conventional parallelization.
B Choose an initial x vector.
B Let rk = pk = bk −Akx and compute ρk = 〈rk, rk〉.
B Reduce ρk to form ρ.
while ρ >  do1
B Communicate updated pk entries.
for Pl ∈ SendSet(Pk) do2
SEND(Pl,pk→l)3
for Pl ∈ RecvSet(Pk) do4
RECV(Pl,pl→k) and update p̂k entries5
qk = Akp̂k6
pik = 〈pk,qk〉7
κk = 〈qk,qk〉8
B Reduce pik and κk to obtain global coefficients.
(pi, κ) = ALL-REDUCE(pik, κk)9
α = ρ/pi10
β = α · κ/pi − 111
ρ = β · ρ12
xk = xk + αpk B for nk elements.13
rk = rk − αqk B for nk elements.14
pk = rk + βpk B for nk elements.15
Algorithm 4: Proposed parallelization.
B Choose an initial x vector.
B Let rk = pk = bk −Akx and compute ρk = 〈rk, rk〉.
B Reduce ρk and communicate pk to form ρ and p̂k.
while ρ >  do1
qk = Akp̂k2
pik = 〈pk,qk〉3
κk = 〈qk,qk〉4
B Reduce pik and κk to obtain global coefficients.
(pi, κ) = ALL-REDUCE(pik, κk)5
B Communicate qk entries.
for Pl ∈ SendSet(Pk) do6
SEND(Pl,qk→l)7
for Pl ∈ RecvSet(Pk) do8
RECV(Pl,ql→k) and update q̂k entries9
α = ρ/pi10
β = α · κ/pi − 111
ρ = β · ρ12
xk = xk + αpk B for nk elements.13
r̂k = r̂k − αq̂k B for n̂k ≥ nk elements.14
p̂k = r̂k + βp̂k B for n̂k ≥ nk elements.15
redundant computation as well.
Fig. 1 presents a pictorial comparison of the conven-
tional and alternative parallelization methods for K = 4
processors. The gray parts of the A matrix and the vec-
tors visualize the submatrix and the subvectors assigned
to processor Pk (for k = 3) and the computations per-
1045-9219 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2014.2311804, IEEE Transactions on Parallel and Distributed Systems
6
(a) Conventional parallelization.
(b) Alternative parallelization.
Fig. 1: Illustration of conventional and alternative parallelization of conjugate gradient method.
formed by Pk on them. Light gray and dark gray blocks
of Ak matrix illustrate the off-diagonal and diagonal
blocks, respectively. In the figure, ×’s denote the nonzero
off-diagonal column segments at row stripe Ak and the
respective vector entries to be received by Pk. The figure
also distinguishes the distinct phases of both algorithms
as indicated at the top of their respective phases. The
gray regions in the figure display the communication
phases. As seen in Fig. 1a, the P2P communications
on the input vector are performed just before the local
SpMxV computations, whereas in Fig. 1b, the P2P com-
munications on the output vector are performed after
the local SpMxV computations. Fig. 1 clearly shows that
the conventional parallelization scheme requires two
communication phases, whereas the proposed scheme
requires only one.
4 EMBEDDING P2P COMMUNICATIONS INTO
COLLECTIVE COMMUNICATION
In this section, we describe how to perform P2P and col-
lective communication operations simultaneously. The
main idea here is to use the underlying communi-
cation pattern of collective communication operations
(ALL-REDUCE) for also communicating output vector en-
tries.
In ALL-REDUCE, each processor Pk has its own buffer
and ends up with receiving the result of an associa-
tive operation on the buffers of all other processors.
The ALL-REDUCE operation can be performed in lgK
communication steps [6], [23] in a K-processor system,
where K is a power of two. This reduction algorithm is
called bidirectional exchange and works by simultane-
ous exchange of data between processors. In step d, each
processor exchanges a message with the processor in its
2d−1-distance and updates the values in its local buffer
with those in the received message using an associative
operator. We adopt this communication pattern for the
reduction operation and assume that K is a power of
two for the simplicity of presentation.
In the ALL-REDUCE algorithm described above, Pk
does not directly communicate with all processors in
SendSet(Pk). Now assume that Pk needs to send a set
of q-vector entries to one such processor, Pl. Since it
is definite that a message from Pk will eventually reach
Pl in the reduction operation, it is possible to embed
the vector elements that Pk needs to send to Pl into
the corresponding messages. In other words, Pk may
need to send some vector elements with the help of the
processors it directly communicates with by embedding
the necessary vector elements into its messages. Then,
these processors would simply forward them to target
1045-9219 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2014.2311804, IEEE Transactions on Parallel and Distributed Systems
7
processors in SendSet(Pk) that Pk does not directly
communicate with.
Fig. 2 illustrates the communication steps of the
ALL-REDUCE algorithm. The embedding process of P1
with SendSet(P1) = {P0, P2, P4, P6} is displayed via
solid arrows in the figure. In this example, P1 can
directly send the vector elements required by P0 in Step 1
without any need for embedding. For P1 to send its
vector elements to P2, it needs to embed them into its
message to P0 at Step 1, which are then forwarded from
P0 to P2 at Step 2. For sending vector elements to P4,
P1 also embeds them into its message to P0 at Step 1,
then P0 waits for one step and forwards them to P4 at
Step 3. For P6, P1 embeds them into its message to P0 at
Step 1, which is then forwarded from P0 to P2 at Step 2,
and from P2 to its destination P6 at Step 3. Note that the
vector elements that are sent by P1 to processors P2, P4
and P6 are forwarded in certain steps of the algorithm.
Embedding vector elements into the communication
pattern of ALL-REDUCE avoids startup costs for all mes-
sages due to P2P communications and establishes an
exact value on the average and maximum number of
messages being handled (sent and received) by a pro-
cessor, which is lgK. As will be shown by experiments,
this is a significant advantage, and it is the key factor
that leads to better scalability of the parallel solver. On
the down side however, the message volume is likely to
increase due to store-and-forward overhead associated
with the forwarding of respective vector entries embed-
ded in ALL-REDUCE operations. There exists a trade-off
between avoiding message startup costs and increasing
total volume of communication. We exploit the fact that
if the number of communicated vector elements is not
large, the startup costs can still be the dominating factor
in total communication cost in spite of the increased
volume. Thus, avoiding them will possibly compensate
the increase in the message volume.
Note that the embedding scheme requires buffering
due to the store-and-forward overhead. In the worst
case, where each processor needs to send a message
to every other processor in the system, the buffering
overhead of a processor at a single step of the ALL-REDUCE
algorithm is bounded by O(K).
5 PART TO PROCESSOR MAPPING
Consider a given row partition R = {R1, R2, . . . , RK} of
matrix A and a set of processors P = {P1, P2, . . . , PK},
where the number of row parts is equal to the number
of processors. In row partition R, a column ci is said
to be a coupling column if more than one row parts
contain at least one nonzero in ci. Observe that, in the
conventional parallel algorithm, only the input vector
(i.e., p) entries associated with the coupling columns
necessitate communication, whereas in the proposed
parallel algorithm, only the output vector (i.e., q) entries
associated with such columns necessitate communica-
tion. Let Λ (ci) denote the set of row parts that contain
P1 P3 P4 P5 P6 P7P0 P2
P0 P1 P2 P3 P4 P5 P6 P7
P0 P1 P3 P4 P5 P6 P7P2
Step 1: SendSet(P1) = {P0, P2, P4, P6}
Step 2: SendSet(P1) = {P2, P4, P6}
Step 3: SendSet(P1) = {P4, P6}
Fig. 2: Embedding messages of P1 into ALL-REDUCE for
SendSet(P1) = {P0, P2, P4, P6}.
at least one nonzero in ci. Without loss of generality, let
row ri be assigned to row part Rk ∈ R. Now consider
an identity mapping function M : R → P where the row
block Rk is mapped to processor PM(k)=k, for 1 ≤ k ≤ K.
Then, due to the symmetric partitioning requirement,
qi is assigned to Pk. Besides, since all diagonal entries
are nonzero, we have Rk ∈ Λ (ci) . So, {Pl : Rl ∈
Λ(ci) and Rl 6= Rk} denotes the set of processors to
which qi should be sent (multicast) by processor Pk.
Thus, |Λ(ci)|−1 gives the volume of communication that
is incurred by coupling column ci. We define the set of
processors that participate in the communication of qi
as ProcSet(qi) = {Pl : Rl ∈ Λ(ci)}, which includes the
owner Pk of qi as well, hence, |ProcSet(qi)| = |Λ(ci)|.
For any arbitrary mapping, this definition becomes
ProcSet(qi) = {Pl : ∃Rm ∈ Λ(ci) s.t. M(m) = l}. (1)
For conventional parallelization, the total message
volume is independent of the mapping, i.e., different
part-to-processor mappings incur the same amount of
message volume, which is:
ComV ol(R) =
∑
qi: ci∈cc(R)
(|Λ(qi)| − 1), (2)
where cc(R) denotes the set of coupling columns of the
row partition R. However, in the proposed paralleliza-
tion scheme, the total message volume depends on the
mapping of parts to processors due to forwarding of
vector elements in the embedding process.
As an example, in Fig. 2, assume that two parts Ra and
Rb are mapped to processors P1 and P6, respectively, and
P1 needs to send vector elements to P6. These vector
elements need to be forwarded in two steps, increas-
ing communication volume compared to a single P2P
communication between these two processors. However,
if Rb were mapped to P0 (or P3, or P5), these vector
elements would not be forwarded, and they would incur
no extra communication volume at all.
Based on this observation, the objective of mapping
should be to minimize the extra communication volume
due to forwarding. In other words, we should try to keep
1045-9219 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2014.2311804, IEEE Transactions on Parallel and Distributed Systems
8
the pairs of processors that communicate a large number
of vector elements close to each other. The closeness here
is defined in terms of the communication pattern of the
ALL-REDUCE algorithm described in the previous section.
We now introduce assumptions and notations used
to discuss the formulation adopted for computing total
cost of a mapping M for a given row partition R. We
assume that the number of processors is an exact power
of two (i.e., K = 2D) and the processors are organized
as a virtual D-dimensional hypercube topology H as
the utilized ALL-REDUCE algorithm implies. In H , each
processor is represented by a D-bit binary number. A
dimension d is defined as the set of 2D−1 virtual bidi-
rectional communication links connecting pairs of neigh-
boring processors of which only differ in bit position d.
Tearing along dimension d is defined as halving Hd into
two disjoint (d − 1)-dimensional subcubes, H0d and H1d ,
such that their respective processors are connected along
dimension d in a one-to-one manner. In this view, step
d of the ALL-REDUCE algorithm can be considered as K/2
processors exchanging information along the K/2 virtual
links of dimension d for d = 0, 1, . . . , D − 1.
For any coupling column ci, the cost of communicating
vector entry qi is defined to be the number of ALL-REDUCE
steps in which qi is communicated. If qi is communicated
in step d of the ALL-REDUCE operation, we define the
corresponding communication cost of qi in this step as
one, regardless of how many times qi is communicated
in this step because all communications of qi in a single
step are handled concurrently. Thus, in step d, qi incurs a
cost of one if the processors in ProcSet(qi) are scattered
across different subcubes H0d and H
1
d of the tearing
along dimension d. Otherwise, qi does not incur any
communication which corresponds to the case where
all processors in ProcSet(qi) are confined to the same
subcube of the tearing. Note that this latter case can
be identified as all processors having the same value
(either 0 or 1) at bit position d in their D-bit binary
representations. Therefore, the communication cost of
qi∈cc(R) is defined as:
cost(qi) =
D−1∑
d=0
(∧
Pk,d ⊗
∨
Pk,d
)
Pk∈ProcSet(qi)
. (3)
In this equation, Pk,d denotes the dth bit of Pk in its D-bit
binary representation, and ∧, ⊗, and ∨ denote the logical
“AND”, “XOR”, and “OR” operators, respectively. Then,
the total cost of mapping M is simply given by:
cost(M) =
∑
qi: ci∈cc(R)
cost(qi). (4)
We should note here that the cost definition in (4)
captures an objective that is in between the total and
concurrent communication overheads. In fact, it repre-
sents the sum of the number of distinct q-vector entries
communicated in each step of the ALL-REDUCE algorithm.
In other words, (3) corresponds to the total concurrent
cost associated with forwarding qi to the processors that
it should be sent. The total message volume could easily
be captured by counting exactly how many times qi
is communicated in each step of ALL-REDUCE instead of
counting it only once. We preferred this cost definition
in order to capture some form of concurrency in the
optimization objective.
In order to find a good mapping, we propose two
Kernighan-Lin (KL) [18] based heuristics. As typical in
KL-type algorithms, the proposed heuristics start from a
given initial mapping and perform a number of moves in
the search space to improve the given mapping. For both
heuristics, the move operator is defined as the swapping
of the processor mapping of two row blocks. The gain
of a swap operation is given as the reduction in the
total communication cost of the mapping, as defined
in (4). Both heuristics perform a number of passes till
their improvement rate drops below a predetermined
threshold. In each iteration of a single pass, the swap
operation with the highest gain is chosen, tentatively
performed and the respective row blocks are locked to
prevent any further operations on them in the same pass.
Best swaps with negative gains are also allowed to be
selected in order to enable hill-climbing. At the end of
a pass, a prefix of the performed swap operations with
the highest cumulative cost improvement is selected as
the resultant mapping to be used in the following pass.
Although both heuristics utilize the same move opera-
tors, they differ in their move neighborhood definitions.
The first heuristic, KLF, considers the full move neighbor-
hood with all possible K(K − 1)/2 swaps, whereas the
second heuristic, KLR, restricts the neighborhood over the
adjacent processors of the virtual hypercube topology. In
other words, KLR allows swapping only the parts at the
processors that directly communicate in the ALL-REDUCE
algorithm. Restricting the swap neighborhood has the
following advantages over searching the full neighbor-
hood: (i) Initial number of swaps reduces from K(K −
1)/2 to K lgK/2, (ii) gain updates performed after a
swap operation become confined to the swap operations
that are in the same dimension as the performed swap,
and (iii) gain updates performed after a swap operation
can be done in constant time. The obvious disadvantage
of KLR is the possible loss in the quality of the generated
mappings compared to KLF. However, as we show in the
experiments, this loss is very small, only around 10%. In
this sense, there is a tradeoff between running time and
mapping quality, where KLR favors time and KLF favors
quality.
In this paper, we only focus on describing the KLR
heuristic because of its significantly better running time
performance and algorithmic elegance. The detailed al-
gorithms of KLR and a comprehensive complexity anal-
ysis are provided in Section 1 of Appendix.
1045-9219 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2014.2311804, IEEE Transactions on Parallel and Distributed Systems
9
TABLE 2: Test matrices and their properties.
Number of Nonzeros per row/col
Matrix rows/cols nonzeros avg min max
bcsstk25 15,439 252,241 16.34 2 59
ncvxbqp1 50,000 349,968 7.00 2 9
tandem-dual 94,069 460,493 4.90 2 5
finan512 74,752 596,992 7.99 3 55
cbuckle 13,681 676,515 49.45 26 600
cyl6 13,681 714,241 52.21 36 721
copter2 55,476 759,952 13.70 4 45
Andrews 60,000 760,154 12.67 9 36
pli 22,695 1,350,309 59.50 11 108
pcrystk03 24,696 1,751,178 70.91 24 81
598a 110,971 1,483,868 13.37 5 26
opt1 15,449 1,930,655 124.97 44 243
wave 156,317 2,118,662 13.55 3 44
pkuskt07 16,860 2,418,804 143.46 39 267
kkt-power 2,063,494 12,771,361 7.28 2 96
crankseg-2 63,838 14,148,858 221.64 48 3423
6 EXPERIMENTS
6.1 Experimental Framework
Four schemes are tested in the experiments: CONV,
EMB, EMB-KLF and EMB-KLR. CONV refers to the
conventional parallelization scheme described in Sec-
tion 3.1 (Algorithm 3). EMB, EMB-KLF and EMB-KLR
refer to the proposed parallelization scheme described
in Section 3.2 (Algorithm 4). Hereafter, we will use
notation EMB* to refer to these three embedded schemes.
In all four schemes, row-parallel SpMxV algorithm is
utilized, where the row partitions are obtained using the
hypergraph partitioning tool PaToH on the column-net
model [4] with default parameters. This model aims at
minimizing total communication volume under the com-
putational load balancing constraint. The load imbalance
for all schemes is set to 10%. CONV and EMB rely on
random row-part-to-processor mapping. EMB-KLF and
EMB-KLR utilize the KLF and KLR row-part-to-processor
mapping heuristics described in Section 5.
The number of passes for KLF and KLR is set to 10
and 20, respectively. Although lower number of passes
could be used for these heuristics, we opted to keep
them high to improve the mapping quality to a greater
extent. In fact, a few number of passes would have
been sufficient for KLF as it searches the full move
neighborhood, whereas lgK passes would have been
sufficient for KLR as it restricts the move neighborhood
to the particular steps of the ALL-REDUCE.
Table 2 displays the properties of 16 structurally
symmetric matrices collected from University of Florida
Sparse Matrix Collection [8]. Matrices are sorted with
respect to their nonzero counts.
We used two parallel systems in the experiments: Cray
XE6 (XE6) and IBM Blue Gene/Q (BG/Q). A node on
XE6 consists of 32 cores (two 16-core AMD processors)
with 2.3 GHz clock frequency and 32 GB memory. The
nodes are connected with a high speed 3D torus net-
work called CRAY Gemini. A node on BG/Q consists
of 16 cores (single PowerPC A2 processor) with 1.6
GHz clock frequency and 16 GB memory. The nodes are
TABLE 3: Performance comparison of mapping heuris-
tics KLF and KLR averaged over 16 matrices.
mapping time % improvement in
normalized wrt mapping cost
partitioning time wrt random map.
K KLF KLR KLF KLR
16 0.02 0.05 39.4 32.7
32 0.11 0.10 44.1 39.2
64 0.42 0.14 46.5 41.9
128 1.45 0.24 45.9 41.0
256 4.71 0.44 47.8 42.9
512 13.24 0.61 46.5 41.1
1024 42.35 0.67 45.1 40.1
2048 129.64 1.21 41.4 37.9
connected with 5D torus chip-to-chip network. We used
K∈16, 32, . . . , 1024 cores on XE6 and K∈16, 32, . . . , 2048
cores on BG/Q for running parallel CG.
6.2 Mapping Performance Analysis
Table 3 compares the KLF and KLR heuristics in terms
of preprocessing time and mapping cost (computed ac-
cording to Equation (4)). The mapping times in the table
are normalized with respect to the partitioning times of
PaToH. For each instance, first, a row partition of the
input matrix is computed using PaToH, and a random
part-to-processor mapping is generated. Then, the KLF
and KLR heuristics are applied separately on this initial
solution to obtain two different mapping results. The
improvement rates obtained using these heuristics are
reported separately as average over all 16 test matrices.
As seen in Table 3, KLR’s lower algorithmic complexity
is reflected on its running time; as K increases, the
average increase in KLR’s mapping time is much lower
than that of KLF’s. Especially for large K values, KLR is
more preferable than KLF because KLR’s mapping time
becomes higher than the partitioning time. The mapping
time of KLR remains well below the partitioning time up
to 2048 processors. KLR’s faster mapping times are due
to its successful move neighborhood restriction.
As Table 3 illustrates, KLF obtains better mappings
than KLR because it uses a broader search space. The
mappings obtained by KLR are marginally worse, only
8%–12% on average. There is a trade-off between run-
ning time and mapping quality. The trade-off here ac-
tually favors KLR since it is orders of magnitude faster
than KLF, but it generates only slightly worse mappings.
6.3 Communication Requirements Assessment
Table 4 compares the performance of four parallel
schemes in terms of their communication requirements
averaged over 16 test matrices. Message counts of CONV
include both P2P and collective communication phases.
For CONV, the maximum message volume value refers
to the maximum volume of communication handled
during P2P operations, whereas for EMB* schemes, it
refers to the sum of the communication volume val-
ues of the processors that handle maximum amount of
1045-9219 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2014.2311804, IEEE Transactions on Parallel and Distributed Systems
10
TABLE 4: Communication statistics averaged over 16 matrices.
message count message volume
CONV EMB* total max
K avg max max (=avg) CONV EMB EMB-KLF EMB-KLR CONV EMB EMB-KLF EMB-KLR
16 9.8 12.8 4 0.522 0.780 0.651 0.719 0.102 0.115 0.103 0.108
32 13.2 19.3 5 0.839 1.509 1.186 1.277 0.083 0.115 0.099 0.108
64 16.1 27.4 6 1.304 2.595 1.951 2.158 0.070 0.114 0.096 0.099
128 19.3 34.5 7 1.986 4.550 3.170 3.453 0.055 0.111 0.086 0.092
256 22.2 43.5 8 2.989 7.901 5.012 5.497 0.047 0.112 0.076 0.078
512 25.2 53.8 9 4.522 14.053 7.981 8.673 0.036 0.114 0.064 0.068
1024 28.2 71.1 10 6.831 25.631 13.118 13.650 0.029 0.116 0.058 0.061
2048 31.3 85.0 11 10.669 49.821 25.211 25.130 0.025 0.122 0.054 0.052
In “message count” column, avg and max denote the average and maximum number of messages, respectively, sent by a single processor. In the
“message volume” column, max denotes maximum message volume handled (sent and received) by a single processor. Message volume values
are given in terms of number of floating points words and they are scaled by the number of rows/columns of the respective matrices.
communication in each step of ALL-REDUCE. Since each
processor sends/receives a single message in each step
of ALL-REDUCE, the maximum message volume effectively
represents the concurrent communication volume as
well. The detailed results per matrix basis are given in
Section 2 of Appendix.
As seen in Table 4, for CONV, maximum message
counts are significantly larger than average message
counts for each K. This is due to the irregular sparsity
patterns of the matrices which incur irregular P2P com-
munications in parallel SpMxV computations. On the
other hand, in EMB* schemes, average and maximum
message counts are both equal to lgK for K processors
independent of the sparsity pattern of the matrix.
In a parallel algorithm, the message latency over-
head is actually determined by the processor that
handles maximum number of messages. In that sense,
as seen in Table 4, EMB* schemes perform signifi-
cantly better than CONV for all K values. For exam-
ple, for pkustk07 test matrix, the maximum message
counts are 16, 25, 34, 34, 47, 60, 90, 96 in CONV, while
they are only 4, 5, 6, 7, 8, 9, 10, 11 in embedded schemes,
for K = 16, 32, . . . , 2048 processors, respectively. This
performance gap between CONV and EMB* schemes
increases with increasing number of processors in favor
of embedded schemes. For example, with K increasing
from 16 to 2048 processors, the maximum message count
increases 7.08 times for CONV whereas it only increase
2.75 times for EMB*, on the average.
As expected, EMB* schemes increase both total and
maximum communication volumes compared to CONV.
Even so, this increase remains rather low, especially for
EMB-KLF and EMB-KLR schemes that utilize intelligent
mapping heuristics. Besides, this increase also remains
considerably low compared to the increase in the mes-
sage latency overhead of CONV. The message latency
overhead of CONV compared to those of EMB* schemes
is greater than the communication volume overhead of
EMB* schemes compared to that of CONV. For example,
at K=2048, CONV incurs 7.73 times the message latency
overhead of EMB* while EMB-KLR incurs only 2.34
times the total message volume overhead and 2.63 times
the maximum message volume overhead of CONV.
The mapping quality improvement rates of KLF and
KLR (utilized in EMB-KLF and EMB-KLR) are roughly
reflected in their reduction of message volume in the
actual runs compared to the random mapping (utilized
in EMB), especially for K ≥ 256. For instance, as seen
in Table 3, for K = 1024, the KLF and KLR improve the
cost of the random mapping on the average by 45.1%
and 40.1%, respectively. In the actual runs, although not
presented explicitly (these values can easily be produced
from Tables 1 and 2 in Section 2 of Appendix), compared
to EMB, EMB-KLF obtains 46.5% less total message
volume and 36.7% less maximum message volume, and
EMB-KLR obtains 42.0% less total message volume and
32.8% less maximum message volume on the average
for K = 1024. In that sense, it can be said that the
objective used for mapping heuristics serves the purpose
of reducing both total and maximum message volume
successfully in the actual runs.
The communication cost of parallel SpMxV opera-
tions mainly depends on the communication cost of
the bottleneck processor, which is by large determined
by the maximum message count and maximum mes-
sage volume requirements. As seen in Table 4, for all
schemes, the maximum message volume requirements
tend to decrease with increasing K. On the other hand,
for CONV, maximum message counts tend to increase
sharply with increasing K, whereas for EMB* schemes,
maximum message counts increase very slowly (loga-
rithmic growth) with increasing K. This implies that as
the number processors increases, the message latency
becomes more and more dominant in the overall com-
munication cost. This fact enables embedded schemes to
scale better, which is confirmed by the speedup curves
reported in the next section.
Recall that EMB* schemes perform redundant
computation due to computational rearrangement.
On average, the EMB* schemes perform
0.1%, 0.2%, 0.3%, 0.4%, 0.6%, 0.8%, 1.2%, 1.7% more
computation than CONV per processor for K = 16, 32,
. . . , 2048, respectively. This computational increase is
very low and thus negligible.
1045-9219 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2014.2311804, IEEE Transactions on Parallel and Distributed Systems
11
 0
 10
 20
 30
 40
 50
 60
 70
 80
16 32 64 128 256 512 1024
sp
ee
du
p
K (number of processors)
bcsstk25
Cray XE6
 0
 20
 40
 60
 80
 100
 120
 140
 160
 180
16 32 64 128 256 512 1024 2048
sp
ee
du
p
K (number of processors)
bcsstk25
Blue Gene/Q
 0
 20
 40
 60
 80
 100
 120
 140
16 32 64 128 256 512 1024
sp
ee
du
p
K (number of processors)
ncvxbqp1
Cray XE6
 0
 50
 100
 150
 200
 250
 300
16 32 64 128 256 512 1024 2048
sp
ee
du
p
K (number of processors)
ncvxbqp1
Blue Gene/Q
 0
 20
 40
 60
 80
 100
 120
 140
 160
 180
16 32 64 128 256 512 1024
sp
ee
du
p
K (number of processors)
tandem-dual
Cray XE6
 0
 50
 100
 150
 200
 250
 300
 350
 400
16 32 64 128 256 512 1024 2048
sp
ee
du
p
K (number of processors)
tandem-dual
Blue Gene/Q
 0
 20
 40
 60
 80
 100
 120
 140
 160
 180
16 32 64 128 256 512 1024
sp
ee
du
p
K (number of processors)
finan512
Cray XE6
 0
 50
 100
 150
 200
 250
16 32 64 128 256 512 1024 2048
sp
ee
du
p
K (number of processors)
finan512
Blue Gene/Q
 0
 20
 40
 60
 80
 100
 120
16 32 64 128 256 512 1024
sp
ee
du
p
K (number of processors)
cbuckle
Cray XE6
 0
 50
 100
 150
 200
16 32 64 128 256 512 1024 2048
sp
ee
du
p
K (number of processors)
cbuckle
Blue Gene/Q
 0
 20
 40
 60
 80
 100
 120
 140
16 32 64 128 256 512 1024
sp
ee
du
p
K (number of processors)
cyl6
Cray XE6
 0
 50
 100
 150
 200
16 32 64 128 256 512 1024 2048
sp
ee
du
p
K (number of processors)
cyl6
Blue Gene/Q
 0
 20
 40
 60
 80
 100
 120
 140
 160
16 32 64 128 256 512 1024
sp
ee
du
p
K (number of processors)
copter2
Cray XE6
 0
 50
 100
 150
 200
 250
 300
 350
16 32 64 128 256 512 1024 2048
sp
ee
du
p
K (number of processors)
copter2
Blue Gene/Q
 0
 20
 40
 60
 80
 100
 120
 140
 160
16 32 64 128 256 512 1024
sp
ee
du
p
K (number of processors)
Andrews
Cray XE6
 0
 50
 100
 150
 200
 250
 300
 350
16 32 64 128 256 512 1024 2048
sp
ee
du
p
K (number of processors)
Andrews
Blue Gene/Q
Fig. 3: Speedup curves for the first 8 of 16 test matrices.
1045-9219 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2014.2311804, IEEE Transactions on Parallel and Distributed Systems
12
 0
 20
 40
 60
 80
 100
 120
 140
16 32 64 128 256 512 1024
sp
ee
du
p
K (number of processors)
pli
Cray XE6
 0
 50
 100
 150
 200
 250
 300
 350
16 32 64 128 256 512 1024 2048
sp
ee
du
p
K (number of processors)
pli
Blue Gene/Q
 0
 50
 100
 150
 200
16 32 64 128 256 512 1024
sp
ee
du
p
K (number of processors)
pcrystk03
Cray XE6
 0
 100
 200
 300
 400
 500
16 32 64 128 256 512 1024 2048
sp
ee
du
p
K (number of processors)
pcrystk03
Blue Gene/Q
 0
 50
 100
 150
 200
 250
 300
16 32 64 128 256 512 1024
sp
ee
du
p
K (number of processors)
598a
Cray XE6
 0
 100
 200
 300
 400
 500
 600
 700
16 32 64 128 256 512 1024 2048
sp
ee
du
p
K (number of processors)
598a
Blue Gene/Q
 0
 20
 40
 60
 80
 100
 120
 140
 160
 180
16 32 64 128 256 512 1024
sp
ee
du
p
K (number of processors)
opt1
Cray XE6
 0
 50
 100
 150
 200
 250
 300
 350
 400
16 32 64 128 256 512 1024 2048
sp
ee
du
p
K (number of processors)
opt1
Blue Gene/Q
 0
 50
 100
 150
 200
 250
 300
16 32 64 128 256 512 1024
sp
ee
du
p
K (number of processors)
wave
Cray XE6
 0
 100
 200
 300
 400
 500
 600
16 32 64 128 256 512 1024 2048
sp
ee
du
p
K (number of processors)
wave
Blue Gene/Q
 0
 50
 100
 150
 200
16 32 64 128 256 512 1024
sp
ee
du
p
K (number of processors)
pkustk07
Cray XE6
 0
 50
 100
 150
 200
 250
 300
 350
 400
 450
16 32 64 128 256 512 1024 2048
sp
ee
du
p
K (number of processors)
pkustk07
Blue Gene/Q
 0
 100
 200
 300
 400
 500
 600
 700
16 32 64 128 256 512 1024
sp
ee
du
p
K (number of processors)
kkt-power
Cray XE6
 0
 200
 400
 600
 800
 1000
 1200
 1400
16 32 64 128 256 512 1024 2048
sp
ee
du
p
K (number of processors)
kkt-power
Blue Gene/Q
 0
 50
 100
 150
 200
 250
 300
 350
 400
16 32 64 128 256 512 1024
sp
ee
du
p
K (number of processors)
crankseg-2
Cray XE6
 0
 100
 200
 300
 400
 500
 600
 700
 800
16 32 64 128 256 512 1024 2048
sp
ee
du
p
K (number of processors)
crankseg-2
Blue Gene/Q
Fig. 4: Speedup curves for the last 8 of 16 test matrices.
1045-9219 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2014.2311804, IEEE Transactions on Parallel and Distributed Systems
13
6.4 Speedup Results
Figs. 3 and 4 present the speedup curves of four tested
schemes. The results obtained on XE6 and BG/Q su-
percomputers are illustrated with white and gray plots,
respectively. These plots are grouped by test matrices for
the ease of readability.
In both architectures, all schemes similarly scale up
to K = 64 or K = 128 and then their distinctive char-
acteristics begin to establish themselves with increasing
number of processors. On XE6, all embedded schemes
scale better than CONV, and EMB-KLF and EMB-KLR
scale better than EMB by obtaining roughly the same
speedup values. On BG/Q, EMB-KLF and EMB-KLR
usually scale better than CONV and EMB, while CONV
and EMB can scale better with respect to each other
depending on the test matrix. We can say that the effect
of message latency is more dominant on XE6, which
leads to embedded schemes having better speedup val-
ues despite the increased message volume in general.
Moreover, the embedded schemes start scaling better
at lower K values compared to BG/Q. On the other
hand, on BG/Q, this impact is not as dramatic as on
XE6 and the effect of increased message volume in em-
bedded schemes on speedup values is more prominent.
This is basically due to relatively slow communication
on BG/Q, which overshadows the benefits of reducing
maximum message counts by making the embedded
schemes’ performance more sensitive to the increases
in message volume. As seen in the plots that belong to
BG/Q, EMB-KLF and EMB-KLR are usually able to ob-
tain better speedup values at relatively higher K values
where the message startup costs completely dominate
the message volume costs.
Regarding the plots in Figs. 3 and 4, among 16 ma-
trices, the lowest speedup values and poorest scalabil-
ity characteristics belong to Andrews, cbuckle and
cyl6 matrices on both architectures for CONV scheme.
They exhibit quite poor scaling performance where the
speedup values start deteriorating very early at low
K values compared to other test instances. For these
matrices, the speedup values of CONV scheme are below
60 on XE6 with 1024 processors, and below 100 on
BG/Q with 2048 processors. These three matrices have
the highest communication requirements in terms of
maximum message counts. The corresponding values
are 83, 96, 108, 128 for Andrews matrix, 36, 52, 82, 126
for cbuckle matrix, and 36, 54, 78, 126 for cyl6 matrix
for K = 128, 256, 512, 1024, respectively (see Table 2 in
Appendix). This poor performance is basically because
of the high latency overhead which becomes the decisive
factor in communication and overall execution times
with increasing number of processors. On the other
hand, observe that the embedded schemes have much
better scalability characteristics for these matrices due
to their lower latency overheads. Note that sparsity pat-
terns of the matrices, which depend on the application,
along with the partitioning process as a whole, deter-
mine the communication requirements of the parallel
solver.
Speedups on BG/Q are typically higher than XE6 since
according to our running time analysis, the computa-
tion on XE6 is approximately 8 to 10 times faster than
BG/Q. This enables computation to communication ratio
to remain high and processors to be computationally
intensive even at high K values for BG/Q, thus leading
to higher speedups.
7 CONCLUSIONS AND FUTURE WORK
We presented a novel parallelization scheme for linear
iterative solvers, where point-to-point communications
incurred by sparse-matrix vector multiplies and collec-
tive communications incurred by inner product com-
putations can be performed in a single communication
phase. Our parallelization provides an opportunity to
reduce the synchronization overheads and establishes an
exact value on the number messages communicated. We
realized this opportunity by embedding point-to-point
communications into collective communication opera-
tions. Embedding allows us to avoid all message startup
costs of point-to-point communications at the cost of
increasing message volume. Further, we presented two
iterative-improvement-based heuristics to address this
increase in the volume. The experiments were conducted
on a Cray XE6 machine with up to 1024 processors
and on a IBM BlueGene/Q machine with up to 2048
processors for test matrices from various domains. The
results indicate that the message latencies become the
determinant factor for the scalability of the solver with
increasing number of processors. The results also show
that our method, compared to conventional paralleliza-
tion, yields better scalable performance by providing a
low value on the number of messages communicated.
We plan to investigate applicability of the proposed
embedding and rearrangement scheme to precondi-
tioned iterative solvers. We believe that the proposed
embedding scheme is directly applicable to the ex-
plicit preconditioning techniques such as approximate
inverses or factored approximate inverses [3], [28]. Such
preconditioners introduce one or two more SpMxV com-
putations into the iterative solver. Since each SpMxV
(either with the coefficient matrix or the preconditioner
matrices) is often preceded/followed by global reduc-
tion operation(s), embedding of P2P communications of
SpMxV operations into collective communication primi-
tives should be viable. However, the computational rear-
rangement scheme may need modification according to
the utilized preconditioning technique and the respective
partitioning method used for it.
ACKNOWLEDGMENTS
We acknowledge PRACE for awarding us access to
resources Hermit (Cray XE6) based in Germany at High
Performance Computing Center Stuttgart (HLRS) and
Juqueen (Blue Gene/Q) based in Germany at Ju¨lich
Supercomputing Centre.
1045-9219 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2014.2311804, IEEE Transactions on Parallel and Distributed Systems
14
REFERENCES
[1] C. Aykanat, F. O¨zgu¨ner, F. Ercal, and P. Sadayappan. Iterative
algorithms for solution of large sparse systems of linear equations
on hypercubes. IEEE Trans. Comput., 37(12):1554–1568, December
1988.
[2] Cevdet Aykanat, Ali Pinar, and U¨mit V. C¸atalyu¨rek. Permuting
sparse rectangular matrices into block-diagonal form. SIAM J. Sci.
Comput., 25:1860–1879, June 2004.
[3] Michele Benzi, Jane K. Cullum, and Miroslav Tuma. Robust
approximate inverse preconditioning for the conjugate gradient
method. SIAM J. Sci. Comput., 22(4):1318–1332, April 2000.
[4] Umit Catalyurek and Cevdet Aykanat. Hypergraph-partitioning-
based decomposition for parallel sparse-matrix vector multiplica-
tion. IEEE Trans. Parallel Distrib. Syst., 10:673–693, July 1999.
[5] U¨mit V. C¸atalyu¨rek, Cevdet Aykanat, and Bora Uc¸ar. On two-
dimensional sparse matrix partitioning: Models, methods, and a
recipe. SIAM J. Sci. Comput., 32(2):656–683, February 2010.
[6] Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de
Geijn. Collective communication: theory, practice, and experience:
Research articles. Concurr. Comput. : Pract. Exper., 19(13):1749–
1783, September 2007.
[7] A.T. Chronopoulos and C.W. Gear. s-step iterative methods for
symmetric linear systems. Journal of Computational and Applied
Mathematics, 25(2):153 – 168, 1989.
[8] Timothy A. Davis and Yifan Hu. The university of florida
sparse matrix collection. ACM Trans. Math. Softw., 38(1):1:1–1:25,
December 2011.
[9] E. F. D’Azevedo, V. L. Eijkhout, and C. H. Romine. Conjugate
Gradient Algorithms With Reduced Synchronization Overheads
on Distributed Memory Processors. Technical Report 56, Lapack
Working Note, 1993.
[10] E. de Sturler and H. A. van der Vorst. Reducing the effect of
global communication in gmres(m) and cg on parallel distributed
memory computers. Appl. Numer. Math., 18(4):441–459, October
1995.
[11] P. Ghysels and W. Vanroose. Hiding global synchronization la-
tency in the preconditioned conjugate gradient algorithm. Techni-
cal Report 12.2012.1, ExaScience Lab, Intel Labs Europe, December
2012.
[12] P. Ghysels and W. Vanroose. Hiding global synchronization la-
tency in the preconditioned conjugate gradient algorithm. Parallel
Computing, (0):–, 2013.
[13] A. Gupta, V. Kumar, and A. Sameh. Performance and scalability of
preconditioned conjugate gradient methods on parallel comput-
ers. Parallel and Distributed Systems, IEEE Transactions on, 6(5):455–
469, 1995.
[14] Bruce Hendrickson and Tamara G. Kolda. Partitioning rectan-
gular and structurally unsymmetric sparse matrices for parallel
processing. SIAM J. Sci. Comput., 21(6):2048–2072, December 1999.
[15] Bruce Hendrickson and Tamara G. Kolda. Graph partitioning
models for parallel computing. Parallel Comput., 26(12):1519–1534,
November 2000.
[16] Bruce Hendrickson, Robert Leland, and Steve Plimpton. An
efficient parallel algorithm for matrix-vector multiplication. In-
ternational Journal of High Speed Computing, 7:73–88, 1995.
[17] Torsten Hoefler, Peter Gottschling, Andrew Lumsdaine, and Wolf-
gang Rehm. Optimizing a conjugate gradient solver with non-
blocking collective operations. Parallel Comput., 33(9):624–633,
September 2007.
[18] B.W. Kernighan and S. Lin. An efficient heuristic procedure for
partitioning graphs. Bell System Tech. J., 49:291–307, 1970.
[19] Vipin Kumar. Introduction to Parallel Computing. Addison-Wesley
Longman Publishing Co., Inc., Boston, MA, USA, 2nd edition,
2002.
[20] J. G. Lewis and R. A. van de Geijn. Distributed memory matrix-
vector multiplication and conjugate gradient algorithms. In
Proceedings of the 1993 ACM/IEEE conference on Supercomputing,
Supercomputing ’93, pages 484–492, New York, NY, USA, 1993.
ACM.
[21] David Luenberger and Ye. Linear and Nonlinear Programming.
Springer, third edition, 2008.
[22] Grard Meurant. Multitasking the conjugate gradient method on
the {CRAY} x-mp/48. Parallel Computing, 5(3):267 – 280, 1987.
[23] Sanjay Ranka and Sartaj Sahni. Hypercube algorithms: with appli-
cations to image processing and pattern recognition. Springer-Verlag
New York, Inc., New York, NY, USA, 1990.
[24] Y. Saad. Practical use of polynomial preconditionings for the con-
jugate gradient method. SIAM Journal on Scientific and Statistical
Computing, 6(4):865–881, 1985.
[25] Y. Saad. Krylov subspace methods on supercomputers. SIAM J.
Sci. Stat. Comput., 10(6):1200–1232, November 1989.
[26] Y. Saad. Iterative Methods for Sparse Linear Systems. Society for
Industrial and Applied Mathematics, Philadelphia, PA, USA, 2nd
edition, 2003.
[27] Bora Uc¸ar and Cevdet Aykanat. Encapsulating multiple
communication-cost metrics in partitioning sparse rectangular
matrices for parallel matrix-vector multiplies. SIAM J. Sci. Com-
put., 25(6):1837–1859, 2004.
[28] Bora Uc¸ar and Cevdet Aykanat. Partitioning sparse matrices for
parallel preconditioned iterative methods. SIAM J. Sci. Comput.,
29(4):1683–1709, June 2007.
[29] Bora Uc¸ar and Cevdet Aykanat. Revisiting hypergraph models
for sparse matrix partitioning. SIAM Rev., 49:595–603, November
2007.
[30] Brendan Vastenhouw and Rob H. Bisseling. A two-dimensional
data distribution method for parallel sparse matrix-vector multi-
plication. SIAM Rev., 47:67–95, January 2005.
R. Oguz Selvitopi received his B.S. degree in
Computer Engineering from Marmara University
(2008) and M.S. degree (2010) in Computer En-
gineering from Bilkent University, Turkey where
he is currently a PhD candidate. His research in-
terests are parallel and distributed systems, par-
allel computing, scientific computing and bioin-
formatics.
Muhammet Mustafa Ozdal Muhammet Mustafa
Ozdal received his B.S. degree in electrical engi-
neering (1999), and M.S. degree in computer en-
gineering (2001) from Bilkent University, Turkey.
He obtained the Ph.D. degree in computer sci-
ence from the University of Illinois at Urbana-
Champaign in 2005. He was a recipient of the
IEEE William J. McCalla ICCAD Best Paper
Award in 2011, and the ACM SIGDA Technical
Leadership Award in 2012. He has served as the
program and general chair of IEEE/ACM SLIP,
contest and publicity chair of ACM ISPD, and technical program com-
mittee member of the following IEEE/ACM conferences: ICCAD, DAC,
DATE, ISPD, ISLPED, and SLIP. He is currently a research scientist
in the Strategic CAD Labs of Intel Corporation. His research interests
include heterogeneous computing, hardware/software co-design, high-
performance computing, and algorithms for VLSI CAD.
Cevdet Aykanat received the BS and MS de-
grees from Middle East Technical University,
Ankara, Turkey, both in electrical engineering,
and the PhD degree from Ohio State University,
Columbus, in electrical and computer engineer-
ing. He worked at the Intel Supercomputer Sys-
tems Division, Beaverton, Oregon, as a research
associate. Since 1989, he has been affiliated
with the Department of Computer Engineering,
Bilkent University, Ankara, Turkey, where he
is currently a professor. His research interests
mainly include parallel computing, parallel scientific computing and its
combinatorial aspects. (co)authored about 70 technical papers pub-
lished in academic journals indexed in ISI and his publications received
above 600 citations in ISI indexes. He is the recipient of the 1995 Young
Investigator Award of The Scientific and Technological Research Council
of Turkey and 2007 Parlar Science Award. He has served as a member
of IFIP Working Group 10.3 (Concurrent System Technology) since
2004 and as an Associate Editor of IEEE Transactions of Parallel and
Distributed Systems between 2008 and 2012.
