Improved load distribution in parallel sparse Cholesky factorization by Rothberg, Edward & Schreiber, Robert
Improved Load Distribution
in Parallel Sparse Cholesky Factorization
Edward Rothberg
Robert Schreiber
The Research Institute of Advanced Computer Science is operated by Universities Space Research
Association, The American City Building, Suite 212, Columbia, MD 21044, (410) 730-2656
Work reported herein was supported by NASA via Contract NAS 2-13721 between NASA and the Universities
Space Research Association (USRA). Work was performed at the Research Institute for Advanced Computer
Science (RIACS), NASA Ames Research Center, Moffett Field, CA 94035-1000.
https://ntrs.nasa.gov/search.jsp?R=19940033143 2020-06-16T11:37:12+00:00Z

Improved Load Distribution in Parallel Sparse Cholesky Factorization
Edward Rothberg Robert Schreiber
Intel Supercomputer Systems Division
14924 N.W. Greenbrier Parkway
Beaverton, OR 97006
Research Institute for
Advanced Computer Science
NASA Ames Research Center
Moffett Field, CA 94035-1000
Abstract
Compared to the customary column-oriented ap-
proaches, block-oriented, distributed-memory sparse
Cholesky factorization benefits from an asymptotic re-
duction in interprocessor communication volume and
an asymptotic increase in the amount of concurrency
that is exposed in the problem. Unfortunately, block-
oriented approaches (specifically, the block fan-out
method) have suffered from poor balance of the compu-
tational load. As a result, achieved performance can
be quite low. This paper investigates the reasons for
this load imbalance and proposes simple block mapping
heuristics that dramatically improve it. The result is a
roughly 20_o increase in realized parallel factorization
performance, as demonstrated by performance results
from an Intel Paragon TM system. We have achieved
performance of nearly 3.2 billion floating point oper-
ations per second with this technique on a 196-node
Paragon system.
1 Introduction
The Cholesky factorization of sparse symmetric
positive definite matrices is an extremely important
computation, arising in a variety of scientific and engi-
neering applications. Sparse Cholesky factorization is
quite time-consuming and is frequently the computa-
tional bottleneck in these applications. Consequently,
there is significant interest in performing the compu-
tation on large parallel machines.
Parallel sparse Cholesky factorization is typically
performed using one of two alternative high-level ma-
trix mapping strategies. The first and more tradi-
tional approach, a 1-D mapping, distributes rows or
columns of the sparse matrix among processors. Un-
fortunately, this mapping has two major limitations.
The first is that it produces interprocessor communi-
cation volumes that grow linearly in the number of
processors [7]. The communication costs of this ap-
proach makes it nonscalable [14], and they also dom-
inate runtimes on all but the smallest parallel ma-
chines. The second limitation is caused by the parallel
task definition that seems to naturally accompany a
1-D data mapping. A parallel method that breaks the
computation into column scaling and column modifi-
cation tasks suffers from an excessively long critical
path [11]. Since the critical path represents a lower
bound on parallel runtime, parallel speedups are lim-
ited, unnecessarily, by this approach.
The second mapping alternative is a 2-D block map-
ping, where rectangular blocks of the sparse matrix
are mapped to the processors. In such a mapping,
we view the machine as a two-dimensional Pr x Pc
processor grid, whose members we denote P(i, j). All
blocks in a given block row are mapped to the same
row of processors, and all elements of a block column
to a single processor column. The advantages of such
a mapping are substantial. Communication volumes
grow as the square root of the number of processors,
versus linearly for the 1-D mapping. The critical path
length is also significantly reduced. For a k x k grid
problem, it is O(k) for a 2-D block mapping, versus
O(k 2) for a 1-D mapping. These advantages accrue
even when the underlying machine has some intercon-
nection network whose topology is not a grid. Several
researchers have obtained excellent performance us-
ing a block-oriented approach, both on fine-grained,
massively-parallel SIMD machines [3] and on coarse-
grained, highly-parallel MIMD machines [12].
To define a 2-D block mapping, one must spec-
ify the mappings of matrix (block) rows to proces-
sor rows and of columns to processor columns. What
has been used to date is the 2-D cyclic (also called
torus-wrap) mapping: block LIj resides at proces-
sor P(I mod Pr, J mod Pc). We have observed, how-
ever, that a 2-D cyclic mapping (particularly a coarse-
grained mapping using large blocks) has a serious
limitation: it produces significant load imbalances
that severely limit achieved efficiency. On the Intel
Paragon system, on which interprocessor communica-
tionbandwidthisquitehigh,thisloadimbalancelim-
its efficiencyto agreaterdegreethancommunication
or wantofparallelism.
Thispaperinvestigatesinsomedetailthesourcesof
theloadimbalanceproducedbyblock-orientedmeth-
ods.Afterexaminingthecausesof thepoorloadbal-
anceobtainedbythecyclicmapping,weproposeand
investigatea methodfor improvingit. Wefind that
byperformingaheuristicremappingof matrixblocks
to processors,loadimbalancecanbemitigatedto a
pointwhereit is nolongerthemostseriousbottle-
neckin the computation.Performancer sultsfrom
a block-orientedcoderunon theIntel Paragonsys-
tem demonstratethat the blockmappingheuristic
producesa roughly20%increasein achievedparal-
lel performancecomparedwith the cyclicmapping.
Wealsofind that carefulchoiceof P, the number of
processors, can produce substantial performance im-
provements without remapping; but they are not as
large as those produced by the heuristic mapping.
The organization of the paper is as follows. Section
2 provides background on parallel sparse Cholesky,
including a discussion of block-oriented factorization
and the block fan-out method. Section 3 looks at load
balance results from the block fan-out method. Sec-
tion 4 describes our approach to improving the load
balance, and looks at the resulting improvement in
load balance and parallel performance. Finally, Sec-
tion 5 discusses the results.
2 Parallel Sparse Cholesky Faetoriza-
tion
2.1 Computation Structure
The goal of the sparse Cholesky computation is to
factor a sparse symmetric positive definite n x n matrix
A into the form A = LL T, where L is lower triangular.
We consider only the numeric factorization here; our
present codes perform matrix reordering and symbolic
factorization sequentially.
In the block factorization approach considered here,
matrix blocks are formed by dividing the columns of
the n x n matrix into N contiguous subsets, N _< n.
The identical partitioning is performed on the rows.
A block Lzj in the sparse matrix is formed from the
elements that fall simultaneously in row subset I and
column subset J.
The block-oriented sparse factorization is per-
formed with the following pseudo-code:
1
2
3
4
5
6
7.
8.
L:=A
for K= 1 to N do
LKK := Factor(LKK)
for I = K + 1 to N with LIK 7£ 0 do
LIK := L_K L_IK
for J = K + 1 to N with LjK # 0 do
fox" I = J to N with LIK 7£ 0 do
Ltj := Ljs - LIKL_K
We will refer to Statement 3, the Cholesky factor-
ization of a diagonal block LKK, as a BFAC(K, K)
operation. Similarly, we refer to Statement 5 as
a BDIV(I,K) operation, and Statement 8 as a
B M O D( I , J, K) operation.
Observe that in a BMOD 0 operation, the row in-
dex of the destination block L1j is the same as the
row index of one source block, and that the column
index is the same as the row index of the other source
block. Equivalently, a block Lls modifies only blocks
in block row I or block column I.
2.2 Supernodes
Before discussing parallel sparse factorization, we
must first discuss an important concept in sparse fac-
torization, that of a supernode [2]. A supernode is a
set of adjacent columns in the factor L whose non-zero
structure consists of a dense lower-triangular block on
the diagonal, and an identical set of non-zeroes for
each column below the diagonal block. Supernodes
arise in any sparse factor, and they are typically quite
large. The regularity in the sparse matrix captured by
this supernodal structure can be exploited for a vari-
ety of purposes [2, 8, 11, 13]. We exploit it to simplify
the internal non-zero structure of blocks of the matrix.
Specifically, we create block columns whose member
columns belong to the same supernode. As a result,
we obtain blocks whose rows are either completely
zero or are dense. This regular structure allows the
block factorization primitives (BFAC, BDIV, and
BMOD) to be implemented efficiently. This regular-
ity can be further increased by performing supernode
amalgamation [1] on the factor matrix. Amalgama-
tion is a heuristic.that merges supernodes with very
similar non-zero structures into larger superiaodes. We
use amalgamation in our experiments.
2.3 Parallel Block-Oriented
Cholesky Factorization
Sparse
We now discuss parallel block-oriented sparse fac-
torization. As mentioned earlier, the approach we
use is the block fan-out method. We give a high-
level overview of the method here; detailed descrip-
tionshaveappearedearlier[12].Ourcodeisbasedon
thesingle-program,ultiple-data(SPMD),message-
passingmodelof parallelcomputing.A setof pro-
cessorsexecuteidenticalcode.All datais privateto
someprocessor,andprocessorscommunicatevia in-
terprocessormessages.
EachblockLIj in the block fan-out method has
an owning processor. The owner of LH performs all
block operations whose destination is L1.r (these in-
clude BFAC(I, J) when I = J, BMOD(I, J) when
I _ J, and all BMOD(I, J, K) operations with LIj
as their destinations). A block LIj is typically the
destination for several block operations.
Interprocessor communication is required whenever
a block on one processor modifies a block on an-
other processor. A processor records enough informa-
tion about each block to allow it to determine when
a block has received all block modifications, and to
which other processors that block must be sent once
the last block modification is performed. A processor
also records enough information to allow it to deter-
mine which blocks that it owns are modified by a block
it receives from another processor. The block fan-out
method is entirely "data-driven". A processor acts on
received blocks in the order in which they are received
from other processors, and it sends blocks it owns to
other processors as soon as they are ready to be used
as source blocks.
For a sparse matrix, the block fan-out method does
not actually perform a 2-D mapping on the entire
matrix. Instead, the method splits the matrix into
two portions: a domain portion and a roo_ portion.
The domain portion consists of matrix columns corre-
sponding to disjoint subtrees of the elimination tree.
(See [10] for a discussion of the elimination tree.) Each
subtree is assigned wholly to a single processor so as
to evenly distribute the factorization work associated
with the domain portion of the matrix among the
processors. The domain portion is permuted to the
front of the sparse factor matrix, and it is mapped
to the processors using a 1-D block-column mapping
(although the non-zeroes in the domain portion are
still stored as a set of blocks). The root portion of
the matrix is mapped to processors using a 2-D map-
ping. Details on the use of domains are provided else-
where [12]. The main advantage of using domains is
that they significantly reduce interprocessor commu-
nication volumes.
2.4 Block Mappings
A crucial issue in any block factorization is the map-
ping of blocks to processors. We assume for this dis-
cussion that the processors P can be arranged as a
grid of Pr rows and Pc columns; best performance is
obtained when Pr and Pc are both O(x/-P). A block
mapping is a function:
map: {O..N-1} x{O..N-1) ---* {O..Pr-1}x{O..P_-I}
from blocks LI_, to processors P(r, c) in the processor
grid. In its most general form, the mapping is arbi-
trary: a block can be mapped to any processor in the
grid. We now introduce some special classes of block
mappings, and show that significant benefits can be
obtained from their use.
We define a block mapping to be Cartesian product
(CP) if
map(I, J) = P(mapI(I), mapJ(J)).
where
mapI: {O..N- 1} ---+ {O..P_ - 1},
map J: {O..N- 1} -+ {O..P_- 1}
are arbitrary. In a CP mapping, a row of blocks is
mapped to a row of processors in the processor grid
and a column of blocks is mapped to a column of pro-
cessors. CP mappings are important because they re-
duce interprocessor communication volumes. As we
mentioned earlier, a block LIK can only modify blocks
in row 1 or column I. Any CP mapping guarantees
that the modified blocks are only mapped to a single
row and a single column of the processor grid. Thus,
the maximum number of processors a block must be
sent to is P_ + Pc, which is O(x/P).
We say that map is symmetric Cartesian (SC) if
Pr = Pc and mapI = mapJ. The most commonly
used block mapping is SC, with the cyclic function
(mapI[I] = mapJ(I) = I mod Pr). The resulting
mapping is typically referred to as a 2-D cyclic or
lotus-wrap mapping. While a cyclic mapping is ef-
fective at reducing communication (since it is CP), it
is unfortunately not very effective at balancing com-
putational load.
3 Block Fan-Out Method Load Bal-
ance
We now consider the load balance produced by the
cyclic mapping and by SC mappings in general. Ex-
periment and analysis show that the cyclic mapping
produces particularly poor load balance; moreover,
some serious load balance difficulties must occur for
any SC mapping. Improvements obtained by the use
of nonsymmetric CP mappings are discussed in the
following section.
Table1:Benchmarkmatrices.
Name
DENSE1024
DENSE2048
GRID150
GRID300
CUBE30
CUBE35
BCSSTK15
BCSSTK29
BCSSTK31
BCSSTK33
Equations NZ in L
Ops to
factor
(Million)
1,024 523,776 358.4
2,048 2,096,128 2,865.4
22,500 656,027 56.5
90,000 3,266,773 482.0
27,000 6,233,404 3,904.3
42,875 12,093,814 10,114.7
3,948 647,274 165.0
13,992 1,680.804 393.1
35,588 5,272,659 2,551.0
8,738 2,538,064 1,203.5
3.1 Benchmark Matrices and Experimen-
tal Design
Table 1 lists the sparse matrices we use as bench-
mark matrices. The list includes two dense matrices
(DENSE1024 and DENSE2048), two 2-D grid prob-
lems (GRID150 and GRID300), two 3-D grid problems
(CUBE30 and CUBE35), and 4 irregular sparse ma-
trices from the Harwell-Boeing sparse matrix test set
[4]. The 2-D and 3-D grid matrices are pre-ordered us-
ing nested dissection, which gives asymptotically opti-
mal orderings for these problems. The Harwell-Boeing
matrices are pre_ordered using multiple minimum de-
gree [9], which is considered the best for most irregular
sparse matrices with respect to sequential operation
count and fill. Note that the floating-point operation
counts are from the best known sequential sparse fac-
torization algorithm. All Mflops measurements pre-
sented here are computed by dividing these floating-
point operation counts by parallel factorization run-
times.
All of our experiments were performed on an Intel
Paragon system at Intel Supercomputer Systems Di-
vision in Beaverton, Oregon. All nodes in this system
have 32 megabytes of memory, except for two which
have 128 megabytes. The machine is running Release
1.2 of Intel's OSF/1 operating system. In this release,
message latency is 50 #seconds and message passing
bandwidth is at most 75 megabytes per second. For
the messages sizes used in our code, the effective band-
width is roughly 40 megabytes per second.
Our code uses hand-optimized versions of the Level-
3 BLAS for almost all arithmetic. The BLAS are
used to perform the BDIV() (triangular solution with
a matrix of right-hand sides) and BMOD O (matrix
multiplication) block operations. Performance for the
individual BLAS operations is in the range of 20 --
40 Mflops per processor, depending on the sizes of the
arguments.
0.7
×
_0.6
+
_o.s
__ ; 0 X +
.0.@
t io
0.2_ J
0"11 2 3 4
" bstence. P-IO0
o errancy. P-IO0
x x b_er_e, P.64
+ efficiency. P-64
+
x X
0
+
)W
X X
i 0
+
0
+
0
0
5----6--6 --_11
Matrix Number
Figure 1: Efficiency and overall balance for the block
fan-out method on the Paragon system (B = 48).
3.2 Empirical Load Balance Results
We now report on the efficiency and load balance of
the block fan-out method. Parallel efficiency is given
by
efficiency = tseq/( P . tparallel ),
where tvarane I is the parallel runtime, P is the number
of processors, and tseq is the runtime for the same
problem on one processor. For the data we report
here, we measured tseq by factoring the benchmark
matrices using our parallel algorithm on one processor.
Although a true sequential algorithm would be slightly
faster, we use the parallel algorithm as the baseline
because our intent here is to better understand the
behavior of this parallel algorithm. Note that we also
report absolute parallel performance.
We use the load balance measure
overall balance = work_otat/(P, workmax),
where work_otal is the total amount of work performed
in the factorization, P is the number of processors, and
workma_ is the maximum amount of work assigned to
any processor (we describe how factorization work is
measured later in this section). This ratio gives an
upper bound on parallel efficiency:
efficiency _ overall balance.
Communication time, critical path lengths, and
scheduling-induced inefficiencies will generally cause
achieved efficiency to be lower than this bound.
Figure 1 shows efficiency and overall balance for
our benchmark matrices when factored on 64 and
100 processors. In all our experiments, we choose
Pr = Pc = v/ft. (We do not believe, however, that our
proposed heuristic remapping and its benefits are spe-
cific to square processor grids.) In all experiments we
used a block size of 48. In other words, when we create
row and column partitions in order to form blocks, the
subset sizes are chosen to be as close to 48 as possible.
Recall that column subsets are always subsets of su-
pernodes, so some block columns will have fewer than
48 columns. We choose a block size of 48 because it
strikes a reasonable balance between the single-node
efficiency of the factorization, which increases with in-
creasing block size, and the amount of concurrency
available in the computation, which decreases with in-
creasing block size. Our results would apply for other
block size choices as well.
Observe that parallel efficiencies are generally quite
low. The best achieved parallel efficiency is 58%, and
the worst is 16%. The overall balance bounds are low
as well. The best is less than 68% and the worst
is 27%. Given the low overall balance bounds, low
achieved efficiencies are not surprising. It is clear from
the data, however, that the bound is by no means a
perfect predictor of achieved efficiencies. Other factors
limit performance. Examples include interprocessor
communication costs, which we measured at 5% --
20% of total runtime, long critical paths, which can
limit the number of block operations that can be per-
formed concurrently, and poor scheduling, which can
cause processors to wait for block operations on other
processors to complete. Despite these disparities, the
data do indicate that load imbalance is an important
contributor to reduced parallel efficiency.
To better understand the causes of the load imbal-
ance generated by a cyclic mapping, let us now look at
how well the computational load is distributed among
groups of processors. Specifically, we consider load
imbalance among rows of processors, columns of pro-
cessors, and diagonals of processors. To explain these
measures, we first need to define a few terms. We de-
fine work[I, J] to be the amount of work performed
on behalf of block LI.I by its owner. This measure
would ideally reflect the amount of processor runtime
required to perform all operations associated with that
block. Runtime is of course difficult to predict with-
out actually performing the corresponding operations.
To approximate runtime, we use a work measure that
is equal the number of floating point operations per-
formed on behalf of the block plus one-thousand times
the number of distinct block operations performed on
behalf of the block. The latter term is included to
increase the accuracy of the work approximation for
matrices that have a large number of small blocks (due
to small supernodes). The fixed cost of performing
a block operation using small blocks often dominates
the overall cost of the operation. The one-thousand op
fixed cost was measured from our factorization code.
We define workI[I] to be the aggregate work re-
quired by blocks in row I. That is:
N-1
workI[I] = _ work[l, J].
J=O
An analogous definition applies for work J, the aggre-
gate column work.
We define the row balance by
row balance --
worktotal
P * workrotornax '
where
workrowraax -_ max
r
_-_l:maptlq=_ workI[I]
Pc
This row balance statistic gives the best possible over-
all balance (and hence parallel efficiency), obtained
only if there is no imbalance within a row of proces-
sors. The intent is to isolate load balance due to poor
distribution of work across rows of processors. An
analogous expression gives column balance.
We define diagonal balance as:
diagonal balance =
worktotal
P * workdiagmax '
where
workdiagrnax -_ max
O<d<Pr
_(t,1)eDa work[I, J]
P_
and
Dd ------{(I, J): (mapI[l] -- mapJ[J]) mod P_ = d}.
Note that we use generalized diagonals in this expres-
sion. Generalized diagonal d is made up of the set of
processors P(i, j) for which (i - j) mod P_ = d.
Note that the row, column, and diagonal balances
are not necessarily independent. In particular, a sin-
gle highly overloaded processor can affect all three.
Note also that due to their coarse nature, these bal-
ance measures do not capture all possible causes of
load imbalance in the computation. In fact, it is pos-
sible for a mapping to have good balances under these
measures while still having a poor overall balance. De-
spite this, the data we present later make it clear that
improving these three measures of balance will in gen-
eral improve the overall load balance.
Table 2 lists the row, column, and diagonal bal-
ances with a 2-D cyclic mapping of the benchmark
matrices on 64 processors. Recall that a row balance
of 0.5 means that if all work assigned to a single row of
the processor grid were distributed evenly among the
Table2: Efficiency bounds for 2-D cyclic mapping due
to row, column and diagonal imbalances (P = 64, B =
48).
Matrix
DENSE1024
DENSE2048
GRID150
GRID300
CUBE30
CUBE35
BCSSTK 15
BCSSTK29
BCSSTK31
BCSSTK33
Row Col. Diag. Overall
bal. bal. bal. bal.
0.65. 0.95 0.69 0.46
0.80 0.99 0.82 0.67
0.78 0.86 0.62 0.48
0.85 0.89 0.71 0.54
0.87 0.94 0.77 0.68
0.86 0.94 0.80 0.66
0.70 0.69 0.58 0.38
0.68 0.75 0.63 0.39
0.75 0.95 0.73 0.54
0.76 0.89 0.71 0.53
processors within that row, then the overall balance,
and hence the efficiency, would be at most 0.5. All
three types of imbalance are significant. Diagonal im-
balance is the most severe, followed by row imbalance,
followed by column imbalance.
We believe these data can be better understood by
considering dense matrices as examples (although the
following observations apply to a considerable degree
to sparse matrices as well). Row imbalance is due
mainly to the fact that workI[I], the amount of work
associated with a row of blocks, increases with in-
creasing I. More precisely, since work[l, J] increases
linearly with J and the number of blocks in a row
increases linearly with I, it follows that workI[I] in-
creases quadratically in I. Thus, the processor row
that receives the last block row in the matrix receives
significantly more work than the processor row imme-
diately following it in the cyclic ordering, resulting in
significant row imbalance. Column imbalance is not
nearly as severe as row imbalance. The reason, we be-
lieve, is that while the work associated with blocks in
a column increases linearly with the column number
J, the number of blocks in the column decreases lin-
early with J. As a result, workJ[J] is neither strictly
increasing nor strictly decreasing.
Note that the reason for the row and column im-
balance is not that the 2-D cyclic mapping is an SC
mapping; rather, we have significant imbalance be-
cause the mapping functions mapI and mapJ are each
poorly chosen. These effects would be seen for any
rectangular processor grid, that is for any choice of Pr
and Pc.
To better understand diagonal imbalance, one
should note that blocks on the diagonal of the ma-
trix are mapped exclusively to processors on the main
diagonal of the processor grid. Blocks just below the
diagonal are mapped exclusively to processors just be-
low the main diagonal of the processor grid. These
diagonal and sub-diagonal blocks are among the most
work-intensive blocks in the matrix; they receive the
most block modifications of any block in the corre-
sponding row. Thus, the diagonal and sub-diagonal
processors that receive these blocks receive more work
than the other processors. Note that diagonal imbal-
ance is especially acute for the sparse benchmark ma-
trices. Recall that a given block of L may be zero, or
may be dense, or (most often) it may have some zero
rows and some dense rows. In these problems, the di-
agonal blocks are the only ones that are guaranteed to
be dense. Since we split supernodes into smaller sub-
sets when we form these blocks, the blocks on the first
block subdiagonal are often actually subblocks of the
diagonal blocks of supernodes, so they too are dense.
Moreover, no diagonal block is zero, and relatively few
subdiagonal blocks are zero.
The remarks we make about diagonal blocks and
diagonal processors apply to any SC mapping, and do
not depend on the use of a cyclic function mapI(I) =
I mod Pr.
4 Heuristic Remapping
We now describe and evaluate a heuristic to im-
prove load balance. Recall that the primary benefit of
using a CP mapping for the block factorization is that
it limits communication. Specifically, with a CP map-
ping a block of the matrix is sent to only one row and
one column of processors in the processor grid. The
heuristic we use to improve load balance in a block
method relies on the simple observation that all SC
mappings must suffer from poor diagonal balance, and
that among them the 2-D cyclic mappings also suffer
from poor row and column balance. We therefore look
at nonsymmetric CP mappings, which map rows in-
dependently of columns. We choose the row mapping
mapI to maximize the row balance, and independently
choose mapJ to maximize column balance. Because
the symmetry is broken, we do not expect severe diag-
onal imbalance, since block diagonals are now spread
over the whole machine.
Our approach to choosing row and column map-
pings to improve load balance is to look at the prob-
lem as a number partitioning problem 1 [5]. Our goal
is to choose a function mapI that minimizes
max Z work I[ I] .
r
l:mapI[_=r
An analogous goal applies to the choice of mapJ.
1Number partitioning is a well studied NP-complete prob-
lem. The objective is to distribute a set of numbers among a
fixed number of bins so that the maximum sum in any bin is
minimized.
Weshallconsiderseveralheuristicsfor optimizing
loadbalance.All obtaina rowmappingasfollows:
mapped[r] := 0, Vr E {0..Pr - 1}
for each block row I, in some order (see below)
mint := r for which mapped[r] is minimum
mapI[I] := mint
mapped[mint] := mapped[mint] + workI[I]
This familiar algorithm iterates over all block rows,
mapping a block row to the processor row that has
received the least work so far (mapped[r] records the
amount of work mapped to processor row r). The only
difference between the various heuristics we consider
is the sequence in which the block rows are mapped.
We have experimented with four different sequences,
which we now describe.
The Decreasing Work (DW) heuristic considers
rows in order of decreasing work. This is a standard
approach to number partitioning; that small values to-
ward the end of the sequence allow the algorithm to
lessen any imbalance caused by large values encoun-
tered early in the sequence.
The Increasing Number (IN) heuristic considers
rows in order of increasing row number. This straight-
forward approach is included for purposes of compar-
ison; we expect that it will be markedly less effective
than the other heuristics.
The Decreasing Number (DN) heuristic con-
siders rows in order of decreasing row number. While
also straightforward (straightbackward?) we expect
it to provide better load balance than the increasing
number heuristic. Recall that the amount of work as-
sociated with a row generally increases with increasing
row number.
The Increasing Depth (ID) heuristic considers
rows in order of increasing depth in the elimination
tree. This refinement of the decreasing number heuris-
tic takes into account the structure of a sparse prob-
lem. In a sparse problem, the work associated with a
row is more closely related to its depth in the elimi-
nation tree than it is to its row number. We therefore
guessed that it would be the second best of the four
schemes.
We may have intuitive guesses about which of these
heuristics will provide the best results. But we do
not pretend that this is an exact science. Just as as-
tronomers and their theories are the servants of the
sky, we must let the matrices tell us which is best.
4.1 Results
We begin by looking at how the row and column
mapping heuristics affect the measures of load balance
described above. Table 3 shows the results of applying
the heuristics to a single problem, BCSSTK31, on 64
Table 3: Row, column, and
lem BCSSTK31 (P = 64, B
Row Col.
Heuristic
Cyclic
Decr. Work
Inc. Number
Decr. Number
Inc. Depth
diagonal balance for prob-
= 48).
Diag. Overall
bal. bal. bal. bal.
0.75 0.95 0.73 0.54
0.99 0.99 0.92 0.76
0.83 0.96 0.90 0.72
0.99 0.98 0.93 0.81
0.99 0.99 0.96 0.81
processors. The five rows correspond to the five differ-
ent mapping heuristics (the row labeled cyclic corre-
sponds to the original cyclic mapping). The numbers
in the table give balance results when the correspond-
ing heuristic is applied to both the row and column
mapping.
As expected, the quality of the row and column
mapping depends on the order in which the rows or
columns are considered. The decreasing work and in-
creasing depth heuristics produce the best row and col-
umn load balances. The increasing row/column num-
ber heuristic produces the worst, but is still much bet-
ter than the symmetric cyclic mapping. Note that all
of the heuristics remove the diagonal imbalance. This
indicates that independent row and column mappings
are extremely effective at removing diagonal imbal-
ance. Note also that the improvements in row, col-
umn, and diagonal imbalance together produce dra-
matic improvements in overall balance.
We now broaden our investigation by including re-
sults from all the matrices in the benchmark set, and
also by varying the row and column mapping heuris-
tic independently. Table 4 presents this load balance
data, giving mean improvement in overall balance over
all ten benchmark matrices on 64 and 100 processors.
The rows of the table correspond to the five different
heuristics for mapping blocks rows to processor rows.
The columns of the table correspond to the same five
heuristics, applied to the block columns of the matrix.
The data reveal that all of our mapping heuristics dra-
matically improve processor load balance.
Of course, the goal of this heuristic is not to improve
load balance, but rather to improve achieved paral-
lel performance. Table 5 presents mean performance
improvement numbers from factorizations on 64 and
100 processors of the Intel Paragon system. Compar-
ing the numbers in Table 5 to those in Table 4, one
can see that the improvements in performance are not
nearly as large as the improvements in load balance.
This indicates that after the heuristics are applied,
load balance is no longer the most constraining factor
limiting performance. The performance improvements
are nevertheless quite significant, and they have been
obtained at little cost in algorithm complexity.
Table4: Meanimprovementinoverallbalance(B = 48).
P = 64 P = 100
Rowremapping
heuristic
Cyclic
Decr.Work
Inc.Number
Decr.Number
Inc.Depth
Columnremappingheuristic
CY DW IN DN ID
0% 18% 17% 21% 17%
37% 34% 41% 47% 42%
19% 18% 21% 20% 24%
39% 37% 43% 43% 47%
39% 34% 45% 47% 43%
Column remapping heuristic
CY DW IN DN ID
0% 19% 23% 22% 21%
39% 38% 56% 52% 50%
20% 24% 24% 31% 21%
41% 36% 50% 50% 49%
40% 37% 53% 54% 49%
Table 5: Mean im
Row remapping
heuristic
Cyclic
Decr. Work
Inc. Number
Decr. Number
Inc. Depth
)rovement in parallel performance on the Intel Paragon system (B=48).
P=64
Column remapping heuristic
CY DW IN DN ID
0% 13% 14% 15% 17%
21% 14% 18% 21% 19%
16% 13% 13% 15% 15%
18% 14% 18% 16% 18%
20% 14% 19% 19% 18%
P= 100
Column remapping heuristic
CY DW IN DN ID
0% 12% 19% 19% 20%
20% 16% 21% 19% 20%
20% 17% 11% 19% 19%
23% 15% 19% 15% 20%
24% 16% 20% 21% 18%
Which heuristic provides the best performance?
Several choices provide quite comparable results. In
fact, it appears that the most important point is that
some remapping for load balance must be done; the
particular remapping used is (with a few exceptions)
of secondary importance. Indeed, as measured by the
row and column load balances, the differences between
the various nonsymmetric, heuristically load balanced
mappings is much smaller than the difference between
any of them and the symmetric cyclic mapping. Recall
that the most significant contributor to load imbalance
is the diagonal imbalance, and all the nonsymmetric
mapping strategies remove this imbalance.
4.2 Alternative Heuristics
In addition to the heuristics described so far, we
also experimented with two other approaches to im-
proving factorization load balance. The first is a sub-
tle modification of the original heuristic. It begins
by choosing some column mapping (we use a cyclic
mapping). This approach then iterates over rows of
blocks, mapping a row of blocks to a row of proces-
sors so as to minimize the amount of work assigned
to any one processor. Recall that the earlier heuristic
attempted to minimize the aggregate work assigned
to an entire row of processors. We found that this
alternative heuristic produced further improvements
in load balance (typically 10-15% better than that of
our original heuristic). Unfortunately, realized perfor-
mance did not improve. This result partially confirms
our earlier claim that load balance is not the most
important performance bottleneck once our original
heuristic is applied.
The other approach we considered is significantly
simpler thaaa any of the approaches described so far.
This approach reduces imbalance by performing cyclic
row and column mappings on a processor grid whose
dimensions Pc x Pr are relatively prime. The rela-
tively prime dimensions eliminate diagonal imbalance
(the most significant form of imbalance), since cyclic
row and column mappings now scatter diagonal ma-
trix blocks among all processors in the grid. We found
that we could obtain significantly higher performance
than that reported earlier for symmetric cyclic map-
pings if we used one fewer processor (e.g., 63 proces-
sors instead of 64, and 99 processors instead of 100).
In both cases, one fewer processor produces relatively
prime grid dimensions. The improvement in perfor-
mance was somewhat lower than that achieved with
our earlier remapping heuristic (17% and 18% mean
improvement on 63 and 99 processors versus 20% and
24% on 64 and 100 processors).
4.3 Results for Larger Machines
We now present results from larger sparse prob-
lems factored on a larger machine. Table 6 gives in-
formation about the larger sparse problems. Problem
DENSE4096 is a dense matrix. Problem CUBE40 is a
regular 3-d cube. It is ordered using nested-dissection.
Problem COPTER2 is a model of a helicopter rotor
blade, given to us by ttorst Simon of NASA Ames. It
is ordered using multiple minimum degree. Problem
10FLEET comes from the linear programming formu-
lation of an airline fleet assignment problem. It was
Table6: Largebenchmarkmatrices.
Opsto
Name
DENSE4096
CUBE40
COPTER2
10FLEET
Equations NZ in L factor
(Million)
4,096 8,386,560 22,915
64,000 21,408,189 23,084
55,476 13,501,253 11,377
11,222 4,782,460 7,450
given to us by Roy Marsten of Delta Airlines. It is
also ordered using multiple minimum degree. We also
include performance numbers for two matrices from
our previous set, CUBE35 and BCSSTK31.
Table 7 shows our results for these problems on
144 and 196 processors. The table shows performance
using a cyclic mapping and using our mapping heuris-
tic (increasing depth on rows and cyclic on columns).
Note that our heuristic again produces a roughly 20%
performance improvement over a cyclic mapping.
5 Discussion
We have demonstrated that the performance of par-
allel block-oriented sparse factorization can be affected
significantly by the mapping of blocks to processors.
With the traditional 2-D cyclic mapping, load imbal-
ance limits performance. Our mapping heuristics im-
prove load balance to the point where it appears to
no longer be the most constraining performance bot-
tleneck. Two questions that arise are: (i) what is the
most constraining bottleneck after our heuristic is ap-
plied, and (ii) can this bottleneck be addressed to fur-
ther improve performance?
One potential remaining bottleneck is communica-
tion. Instrumentation of our block factorization code
reveals that on the Paragon system, communication
costs account for less than 20% of total runtime for
all problems, even on 196 processors. The same in-
strumentation reveals that most of the processor time
not spent performing useful factorization work is spent
idle, waiting for the arrival of data.
One possible explanation for this idle time is that
there simply is not enough work to be performed in
parallel. This does not appear to be the case: an anal-
ysis of the critical path of the factorization computa-
tion [11], which gives a coarse bound on the amount
of parallelism in the problem, reveals that while these
problems certainly do not admit a large surplus of con-
currency, there should be enough to keep the proces-
sors occupied. Critical path analysis for problem BC-
SSTK15 on 100 processors, for example, indicates that
it should be possible to obtain nearly 50% higher per-
formance than we are currently obtaining. The same
analysis for problem BCSSTK31 on 100 processors in-
dicates that it should be possible to obtain roughly
30% higher performance.
Another possibility is that poor scheduling of the
block operations limits performance. It is possible
that low-priority block operations delay higher prior-
ity block operations, causing processors that are wait-
ing for the results of the high-priority operations to
sit idle. We hope to investigate the use of dynamic
scheduling techniques that are more sensitive to some
measures of priority of tasks than is the purely "data-
driven" approach used in the block fan-out method.
We should mention that as part of this work we
also explored mappings whose goal was to reduce in-
terprocessor communication volumes. Our approach
was loosely based on the subtree-to-subcube mapping
scheme originally used for column-oriented sparse fac-
torization [6]. Here, the processor set is divided recur-
sively among the subtrees in the elimination tree. Ma-
trix columns belonging to these subtrees are mapped
exclusively to the appropriate submachines. In our
block-oriented code, rather than dividing processors
among subtrees, we instead divided processor-columns
of the processor grid among subtrees. We found that
communication volumes were reduced by as much as
30%. Unfortunately, load balancing became more dif-
ficult. The best load balance we were able to obtain
using a rough subtree-to-subcube mapping on columns
and a heuristic mapping on rows was roughly the
same as the load balance with a pure cyclic mapping.
Since communication costs were not a significant per-
formance bottleneck on the Paragon system, the re-
sulting factorization performance was lower than that
produced by our heuristic mappings.
One other approach we considered for improving
load balance was to vary the block size. Intuitively,
it would appear that the factorization computation
can tolerate large blocks towards the beginning of
the factorization, when there is a significant amount
of concurrent work available to hide load imbalances
caused by the large blocks. Small blocks would be
more appropriate towards the end of the computation.
We discovered that this intuition is actually incorrect.
Varying the block size between the early stages of the
computation and the later ones has no effect on load
imbalance; and it reduces the amount of parallelism
available in the problem. It is possible, however, to
improve the load balance by choosing the block size
based on the processor row/column to which the block
is mapped. We succeeded in improving performance
using this approach, but the improvement was not
nearly as large as it was using the block remapping
strategies we have described.
One final issue that merits further investigation is
whether the sparse factorization approach proposed
here may actually provide higher performance for
Table7: Performance(Mflops)forlargerbenchmarkproblemson144and196nodes,usingacyclicmapping,and
usinganincreasingdepthrowremappingandcycliccolumnmapping.
Matrix
CUBE35
CUBE40
DENSFA096
BCSSTK31
COPTER2
10FLEET
P = 144
Performance: Performance: Performance
cyclic heuristic improvement
1788 2207 23%
2093 2384 14%
3587 4156 16%
1161 1322 14%
1693 1779 5%
2027 2246 II%
P = 196
Performance: Performance: Performance
cyclic heuristic improvement
2019 2456 22%
2515 3187 27%
4489 5237 17%
1361 1709 26%
1959 2312 18%
2488 2722 9%
dense problems than is currently obtained by special-
ized dense factorization methods [15] that use cyclic
mappings.
References
[1] Ashcraft, C.C., and Grimes, R.G., "The influ-
ence of relaxed supernode partitions on the mnl-
tifrontal method", ACM Transactions on Mathe-
matical Software, 15(4): 291-309, 1989.
[2] Ashcraft, C.C., Grimes, R.G., Lewis, J.G., Pey-
ton, B.W., and Simon, H.D., "Recent progress in
sparse matrix methods for large linear systems",
International Journal of Supercomputer Applica-
lions, 1(4): 10-30, 1987.
[3] Conroy, J., Kratzer, S., and Lucas, R. "Data par-
allel sparse LU factorization", Proceedings, Sev-
enth SlAM Conference on Parallel Processing for
Scientific Computing, to appear.
[4] Duff, I.S., Grimes, R.G., and Lewis, J.G., "Sparse
Matrix Test Problems", ACM Transactions on
Mathematical Software, 15(1): 1-14, 1989.
[5] Garey, M., and Johnson, M., Computers and
Intractability: A Guide to the Theory of NP-
Completeness, 1982.
[6] George, A., Heath, M., Liu, J., and Ng, E., "So-
lution of sparse positive definite systems on a hy-
percube", Journal of Computational and Applied
Mathematics, 27(1): 129-156, 1989.
[7] George, A., Liu, J. and Ng, E., "Communication
results for parallel sparse Cholesky factorization
on a hypercube", Parallel Computing, 10: 287-
298, 1989.
[8] Gilbert, J., and Schreiber, R., "Highly parallel
sparse Cholesky factorization", SlAM Journal on
Scientific and Statistical Computing, 13: 1151-
1172, 1992.
[9]
[10]
[11]
[12]
[13]
[14]
[15]
Liu, J., "Modification of the minimum degree al-
gorithm by multiple elimination", ACM Trans-
actions on Mathematical Software, 11: 141-153,
1985.
Liu, J., "The role of elimination trees in sparse
factorization", SIAM Journal on Matrix Analysis
and Applications, 11:134-172, 1990.
Rothberg, E., Exploiting the memory hierarchy in
sequential and parallel sparse Cholesky facloriza-
lion, Ph.D. thesis, Stanford University, January,
1993.
Rothberg, E., and Gupta, A., "An efficient block-
oriented approach to parallel sparse Cholesky
factorization", Supercomputing '93, p. 503-512,
November, 1993.
Rothberg, E., and Gupta, A., "An evalua-
tion of left-looking, right-looking, and multi-
frontal approaches to sparse Cholesky factoriza-
tion on hierarchical-memory machines", Techni-
cal Report STAN-CS-91-1377, Stanford Univer-
sity, 1991.
Schreiber, R., "Scalability of sparse direct
solvers". In Alan George, John R. Gilbert, and
Joseph W.H. Liu, editors, Graph Theory and
Sparse Matrix Computation, pp. 191-209. The
IMA Volumes in Mathematics and Its Applica-
tions, Volume 56, Springer-Verlag, New York,
1993.
Van De Geijn, R., Massively parallel LINPACK
benchmark on the lntel Touchstone Delta and
iPSC/860 systems, Technical Report CS-91-28,
University of Texas at Austin, August, 1991.
