Processor-efficient sparse matrix-vector multiplication  by Heath, L.S. et al.
An lntemational Journal 
Available online at www.sciencedirect.com computers & 
.¢ , . .¢ .  mathematics 
with applications 
Computers and Mathematics with Applications 48 (2004) 589-608 
www.elsevier.com/locate/camwa 
Processor-Efficient Sparse 
Matrix-Vector Multiplication 
L .  S .  HEATH AND C .  J .  R IBBENS 
Depar tment  of  Computer  Science, V i rg in ia  Po lytechnic  Ins t i tu te  and  Sta te  Un ivers i ty  
B lacksburg,  VA 24061-0106, U.S.A.  
<heath><ribbens>©cs. vt. edu 
S. V .  PEMMARAJU  
Department of Computer Science, The University of Iowa 
Iowa City, IA 52242-1419, U.S.A. 
sriram©cs, uiowa, edu 
(Received September 2002," revised and accepted June PO03) 
Abst rac t - -The  matrix-vector multiplication operation is the kernel of most numerical algorithms. 
Typically, matrix-vector multiplication algorithms exploit the sparsity of a matrix, either to reduce 
the t ime taken or the memory used. In the case of parallel implementations of numerical algorithms, 
the sparsity of a matr ix can also be used to reduce the number of processing elements (PEs) required. 
For example, in Leiserson's systolic array algorithm for matrix-vector multiplication the nonzeros in 
the matr ix  are partit ioned into bands and each band is fed into a different PE. Similarly, in Melhem's 
algorithm, the nonzeros are partit ioned into stripes and each stripe is fed into a distinct PE. However, 
in many practical applications, such as the numerical solution of partial differential equations, a 
matr ix is much sparser than is indicated by its bandwidth or by its stripe width. 
A new measure of sparsity of a matr ix called staircase width is introduced. A staircase in a 
matr ix M ---- (mi , / )~x~ is a set of matr ix positions S ---- {(i,j) I rn~,j ~ 0}, such that  no position in S 
is strictly to the northeast of another position in S. The staircase cover of a matr ix is a partit ion of 
the positions of nonzero entries in the matrix into staircases and the staircase width of a matr ix is the 
size of a smallest staircase cover of the matrix. The staircase width of a matr ix is never larger than 
the bandwidth or the stripe width of a matrix. Moreover, it is significantly smaller than  either for 
some commonly occurring types of matrices uch as arrowhead matrices. Experiments with sparse 
matrices derived from a variety of engineering problems uggest that,  in practice, the staircase width 
of a matr ix is about half the stripe width of the matrix. An efficient algorithm to compute a min imum 
width staircase cover of a matr ix is presented. Staircase covers are used as the basis of a new parallel 
algorithm to compute a matrix-vector product on a linear systolic array of PEs. This algorithm uses 
as many PEs as the staircase width of the matrix, and therefore, saves on the number of PEs relative 
to algorithms of Leiserson and Melhem. It is shown that  despite using fewer PEs, the algorithm 
uses no more t ime steps than Leiserson's or Melhem's algorithm. Q 2004 Elsevier Ltd. All rights 
reserved. 
Keywords - -Bandwidth ,  Matrix-vector multiplication, Partial orders, Sparse matrices, Staircase 
width, Stripe width, Systolic arrays. 
The authors gratefully acknowledge the support of National Science Foundation Grant CCR-9009953. This 
re,~earch was done while the third author was with the Department of Computer Science, Virginia Polytechnic 
Ir~stitute and State University. 
We acknowledge the detailed comments of anonymous referees that has helped make this a better paper. 
0898-1221/04/$ - see front matter  ~) 2004 Elsevier Ltd. All rights reserved. Typeset by ,4A4S-TE X
doi: 10.1016/j. camwa. 2003.06.009 
590 L.S. HEATH et al. 
1. INTRODUCTION 
Sparse matrix-vector multiplication is an important computational kernel in most scientific om- 
putations. Im and Yelick [1,2] mention signal and image processing, document retrieval, and data 
mining as some of the applications in which efficient sparse matrix-vector multiplication is crit- 
ical. Matrix-vector multiplication is often used in iterative solvers for linear systems, in explicit 
methods for ordinary differential equations, and in eigenvalue computations [1,3]. In general, 
the performance of sparse matrix operations i much lower than their dense matrix counterparts. 
This is mainly because the memory access patterns are irregular and there is an overhead for 
manipulating the sparse matrix representation. Sparsity of a matrix is a problem for parallel 
implementations a well and implementations that work well for dense matrices often expend a 
large fraction of their computational resources performing unnecessary operations that involve 
zeros. 
Given that the matrices used in most practical applications are sparse, attempts to exploit 
the sparsity of matrices to speed up matrix computations continue. Exploitation of sparsity 
typically involves symbolic preprocessing of a matrix based on its sparsity structure [4]. The 
preprocessing may yield a decomposition of the sparse matrix that can then be processed with a 
specialized algorithm. Sometimes the decomposition leads to a number of smaller dense matrix 
subproblems, which can then be solved with a system for dense linear algebra, such as BLAS [5] 
or LAPACK [5,6]. Some of the important measures or techniques associated with sparsity are 
matrix bandwidth [7,8]; matrix stripe width [3,9]; block matrices [10,11]; the jagged diagonal 
data structure [12]; frontal and multifrontal methods [13,14]; and panel clustering [15,16]. 
Melhem [3,9] introduced a systolic array network for performing matrix-vector multiplications 
and solving triangular systems of equations. In Melhem's cheme, the processing elements (PEs) 
in the systolic array avoid many unnecessary operations, while the number of PEs utilized is 
small. The basis of Melhem's cheme is a decomposition of the nonzero elements of the matrix 
into stripes. Each stripe is fed into a different PE, and therefore, Melhem's cheme utilizes only 
as many PEs as the number of stripes required to cover the matrix. Melhem [3] mentions that, in 
matrices that arise in finite-element analysis, the bandwidth of an n x n matrix is typically O(n 1/2) 
(for 2-D problems) or e (n  2/a) (for 3-D problems), while the number of stripes required to cover 
the matrix is bounded above by a constant (independent of n). Therefore, as compared with 
the scheme described by Leiserson [17], which uses as many PEs as the bandwidth of a matrix, 
Melhem's cheme uses far fewer PEs in application domains uch as finite-element analysis. 
In this paper, we introduce a new way of imposing structure on sparse matrices, called staircase 
covers. The notion of staircase covers generalizes the notion of stripe covers, used by Melhem. 
We then present a new matrix-vector multiplication algorithm on systolic arrays that uses as 
many PEs as the number of staircases required to cover the matrix. We show that the minimum 
number of staircases required to cover a matrix is never greater than the minimum number of 
stripes required. Furthermore, in matrices uch as arrowhead matrices that arise in various engi- 
neering applications, the number of staircases required to cover the matrix is bounded above by a 
constant, whereas the number of stripes required is e(n)  for an n x n matrix. Section 4 describes 
experiments in which the sizes of minimum staircase covers and stripe covers are computed for a 
number of sparse matrices that arise in engineering applications. Experiments indicate that the 
average number of stripes required to cover a matrix is about two times the number of staircases 
required. Finally, we show that this saving in the number of PEs required by our scheme is 
achieved without using any additional time steps as compared to Melhem's algorithm. 
As in Leiserson's and Melhem's algorithms, the amount of hardware required by our algorithm 
depends on the sparsity structure of the matrix and this is only known at runtime. There 
are two reasons why this is not a significant concern. First, reconfigurable VLSI technology [18], 
especially in the form of field-programmable gate arrays (FPGAs), is capable of being reconfigured 
dynamically to provide the resources necessary to implement a systolic array having any number 
Matrix-Vector Multiplication 591 
of virtual processors [19-22]. FPGAs can even be connected with buses in parallel, so there are 
few limits on how much hardware can be made available [23,24]. Second, our scheme is sufficiently 
flexible that a smaller number of PEs can be adapted through partitioning the set of staircases 
in a staircase cover, feeding each subset into the PEs, and accumulating the results. 
2. PREL IMINARIES  
Throughout he paper, we use M = (mi,j)n×n to denote an n x n matrix with entries mi,j and 
use x --- (xj)nxl to denote an n-vector with entries xj. Let Jn -- {1, 2 , . . . ,  n} be the set of row 
(,0r column) indices of M. Each ordered pair (i,j) E j2 is a position index of M. The position 
index (i, j) is said to cover the corresponding entry m~,j of M. Let F C Jn 2 be the set of position 
indices that cover nonzero entries of M; that is, F = {(i, j) 6 J~ [ m~,j ¢ 0}. Diagonal k of M is 
the set of position indices Dk ---- { (i, j) 6 Jn 2 I J - i  = k}. The band of M is the smallest contiguous 
set of diagonals {Dp, Dp+~,..., Dq} that covers F; the bandwidth BAND(M) of M is q - p + 1, 
the cardinality of its band. Figure 1 illustrates a pattern of nonzero entries (black squares) in a 
5 x 5 matrix. The matrix has bandwidth 5, and the leftmost drawing shows its nonzero entries 
covered by five bands. Leiserson [17] proposes a systolic array for multiplying a matrix M of 
bandwidth b = BAND(M) and a vector x using a linear array of b PEs, requiring 2n + b clock 
beats to accomplish the multiplication. 
\ 3 4 5 4 5 
",,,, 
4 5 • .3  
\ \  \ k  
\ k  \ \ 
\ 
\ 
1 2 3 
- 
l |  
Figure i. A matrix with bandwidth 5, stripe width 4, and staircase width 2. 
In many practical applications, such as finite-element analysis, M is much sparser than is 
indicated by its bandwidth. Melhem [3] proposed a systolic network called MAT/VEC to take 
advantage of such sparsity. To describe MAT/VEC and our scheme in the same framework, we 
introduce some terminology related to partial orders (see [25]). A partial order is a binary relation 
l~hat is reflexive, antisymmetric, and transitive. A partially ordered set (poset) P = (S, <_) is a 
set S with a partial order <_ defined on it. For a pair of elements, s, s t 6 S, we say s < s t if s _< s t 
and s ~ s'. A set of elements {sl, s2 , . . . ,  sk} from S, such that s~ < s~+l, where 1 < i < k - 1, is 
,called a chain in P. The length of a chain is simply the number of elements in it. An antichain in 
a poser P = (S, <__) is a subset S ' c_ S, such that no two elements in S t are related by _<. Suppose 
that S' C_ S and s E S t. Then s is a maximal element of S t if, for every s t E S t, either s ~ <_ s 
or s and s t are unrelated by _<. Similarly, s is a minimal element of S t if, for every s' e S t, 
either s _< s t or s and s ~ are unrelated by <_. 
Define the strictly southeast poset (F, ~sse) by ( i l , j l )  <--sse (i2,j2) if either ( i l , j l )  = (i2,j2) 
or ix < i2 and j l  < j2- A stripe T _C F in M is a nonempty chain in the poset (F, <_~e); stripes 
are generalizations of diagonals. A stripe cover of M is a set of stripes that covers the set of 
nonzero entries F. The stripe width STRIPE(M)  is the minimum cardinality of any stripe cover 
of M. For example, the matrix in Figure 1 has stripe width 4, and the middle drawing shows 
its nonzeros covered by four stripes. The nonzero entries (1, 3), (2, 3), (3, 2), and (4, 2) have to 
belong to distinct stripes, and hence, four is the fewest number of stripes required. The optimal 
592 L.S. HEATH et al. 
stripe cover shown in Figure 1 is 
({(1,3),(4,4),(5,5)},{(1,2),(2,3)},{(1,1),(3,2),(4,3)},{(4,2)}}. 
It is easy to see that a band of M is also a stripe cover of M, and hence, the stripe width of M 
is no greater than its bandwidth. 
Define the southeast poser (F,-<se) by ( i l , j l )  gse (i2,j2) if and only if il <_ i2 and j l  <: J2. A 
staircase • C F in M is a nonempty chain in the poset ( F, <-se). A staircase cover of M is a set 
of staircases that covers F. The staircase width STAIR(M) is the minimum cardinality of any 
staircase cover of M. For example, the matrix in Figure 1 has staircase width 2, and the rightmost 
drawing shows its nonzeros covered by two staircases. The nonzero entries (3, 2) and (2, 3) have 
to belong to different staircases, and therefore, two is the fewest number of staircases required. 
The optimal staircase cover shown in Figure 1 is 
{{(1, 1), (1, 2), (1, 3), (2, 3), (4, 3), (4, 4), (5, 5)}, {(3, 2), (4, 2)}}. 
A stripe is also a staircase, and hence, the staircase width of a matrix is no greater than its stripe 
width. To multiply the matrix in Figure 1 with any 5-vector, Leiserson's ystolic array requires 
five PEs, Melhem's MAT/VEC requires four PEs, while our scheme requires two PEs. 
This paper's contributions can be summarized as follows. 
1. In Section 3, we show how to efficiently compute an optimal staircase cover of a matrix. 
Our algorithm takes a list of the position indices of m nonzero elements in an n x n matrix 
as input and has time complexity O(m log log n). 
2. In Sections 5 and 6, we present a new systolic array algorithm for matrix-vector mul- 
tiplication called the generalized MAT/VEC.  As MAT/VEC does with stripes, gener- 
alized MAT/VEC feeds the nonzero entries covered by each staircase to a correspond- 
ing PE. Thus, generalized MAT/VEC can perform the multiplication with as many PEs 
as STAIR(M).  
3. Since the staircase width of a matrix is never greater than, and in many applications 
significantly smaller than, the stripe width of the matrix, generalized MAT/VEC saves on 
PEs relative to MAT/VEC. In Section 4, we describe xperiments in which we have com- 
puted the bandwidth, stripe width, and staircase width of sparse matrices that arise from 
a variety of applications, for example, fluid mechanics, financial portfolio optimization, 
structural engineering, and linear programming. For these matrices, the average value 
of STRIPE(M) is more than two times the average value of STAIR(M). 
4. In Section 7, we show that, even though generalized MAT/VEC saves on PEs as com- 
pared to MAT/VEC, it does not require additional time steps. Using Melhem's notion 
of global cycles, we show that generalized MAT/VEC and MAT/VEC perform the same 
computations in each global cycle, and thus, take exactly the same number of global cy- 
cles to perform a matrix-vector multiplication (Theorem 8). Thus, generalized MAT/VEC 
achieves a savings in hardware without requiring additional time steps. 
3. COMPUTING A STA IRCASE COVER 
In Section 5, we describe generalized MAT/VEC and show that its correctness depends on a 
staircase cover of matrix M in which the staircases are totally ordered in a sense that will be 
made precise in the following. In this section, we sketch an algorithm that computes an optimal 
staircase cover of M that is totally ordered as required by generalized MAT/VEC. First, we 
present an alternate characterization f the staircase width of a matrix. Then we define a partial 
order _<~ on the set of all staircases of M. 
Define the northeast poset (F,-<he) by ( i l , j l )  --<he (i2,j2) if and only if il _> i2 and j l  -< J2. 
Define the strictly northeast poset (F,-<she) by ( i l , j l )  ~-~sne (i2,J2) if and only if either ( i l , j l )  = 
Matrix-Vector Multiplication 593 
i 3 - -  
i2 - -  
i 1 - -  
j3 31 J2 
Figure 2. Illustrating the posers <:ne and ~se- Here (il,jl) ~ne 
(il,jl) ~sne (i2,j2). Also, (i3,j3) ~_se (i2,j2) and (i3,j3) ~sse (i2,j2)- 
(i2,j2) and 
So :=F;  
~:=1;  
while (S~-1 # 0) do 
let kg~ be the set of maximal elements in (S~-l, _<s~e); 
~:=/~+1; 
endwhile;  
re turn  ~1, ~2,. • •, ~-1  
Figure 3. Staircase Cover Algorithm (SCA). 
1',i2,j2) or il > is and Jl < j2- Figure 3 illustrates the various posets we have defined. An 
antistaircase in M is a chain in the poset (F, _<She). Using the following well-known theorem of 
Dilworth, we obtain an alternate characterization f the staircase width of M. 
THEOREM 1. (See [26].) Let P = (S, <_) be a finite poser. The minimum number of chains 
required to cover all the elements of S equals the size of the largest antichain in P. 
The following corollary to Dilworth's theorem provides a characterization f the staircase width 
of a matrix. 
LEMMA 2. The size of the largest antistaircase in F equals STAIR(M). 
:PROOF. We first claim that a subset of F is an antistaircase in M if and only if it is an antichain 
in the poset (F, <_se). To see this, consider an arbitrary antistalrcase in M, {(i l , j l ) ,  ( i2, j2),. . . ,  
(ik, Jk)}. By definition, this is a chain in -<sne, and therefore, without loss of generality we have 
il > i2 > ... > ik and Jl < J2 < ...Jk. 
From this it follows that for any pair of position indices (ip, jp) and (iq, jq), (ip, Jp) ~se (iq, jq) and 
(iq,jq) ~se (ip,jp). Therefore, {(i l , j l ) ,  ( i2, j2),. . . ,  (ik,Jk)} is an antichain in the poset (F, ~-se). 
The argument that proves the converse is similar. 
Thus, a largest antistaircase in F is a largest antichain in the poser (F, --~se). STAIR(M) is the 
minimum number of chains in the poset (F, <-se) required to cover F. By Theorem 1 the lemma 
follows. | 
For the matrix shown in Figure 1 there are several argest antistaircases; {(3, 2), (2, 3)} and 
{(3, 2), (1, 3)} are two examples. From the lemma, it follows that the staircase width of the matrix 
is 2. 
We now define binary relations ~ and <u on the set of all staircases in M. For a pair of stair- 
cases ~1 and ~2 in M, ~2 <e ~1 if both ~1 N ~2 = ~ and, for every position index (i2, J2) E ~2, 
594 L.S. HEATH et al. 
there is a position index ( i l , j l )  E O1, such that (i2,j2) <she (il,j l). Define O2 _<~ O1, if ei- 
ther 02 <~ O1 or ~/2 ---~ O1. As we show in the following !emma, <~ defines a partial order on 
the set of all staircases in M. 
LEMMA 3. _<v is a partial order on the set of all staircases of M. 
PROOF. By definition, _<~ is reflexive. It suffices to show that <~ is antisymmetric and transi- 
tive. To show that <v is antisymmetric, onsider staircases (~1, 02, such that @2 <~ 01. Sup- 
pose (i2,j2) e 02. By definition of <~, there exists (il, j l) e g~l, such that (i2,j2) <she (il,jl). 
Now suppose @1 <v ~22. Then, there exists (i3,j3) C k~2, such that ( i l , j l )  <she (is,is). By tran- 
sitivity of <sne, we have (i2,j2) <sne (i3,ja). But, this is a contradiction to (i2,j2) and (i3,j3) 
belonging to the same staircase. 
To show that <-~ is transitive, consider staircases O1, 02, and @a, such that ~3 <'r @2 
and ~2 <~ @1. Suppose (i3,j3) E @3. By definition of <~, there exists (i2,j2) E @2, such 
that (i3,j3) <sne (i2,j2). Also by definition, there exists (il, j l) c 01, such that (i2,j2) <sne 
(il,jl). Transitivity of <sne implies that (in,j3) <sne (il,jl). Since (i3,j3) was arbitrary, we 
conclude that far every (i3,j3) E ~3 there exists (il, j l) E ~1, such that (i3,ja) <~ne (i l , j l ) .  
Hence, O3 <~ ~1, as desired. | 
For the purposes of generalized MAT/VEC, we are interested in a staircase cover of M that is 
totally ordered by <~. The proof of the following theorem includes an algorithm for constructing 
such a staircase cover of M. 
THEOREM 4. Every matrix M has a staircase cover consisting of STAIi~( M) staircases that are 
totally ordered by <_~. 
PROOF. The proof uses an algorithm that we call the staircase cover algorithm (SCA) (see 
Figure 3). SCA takes a matrix M as input and returns a set of nonempty sets of position indices 
H = {¢1,~2, - . . ,%}.  
We show that H constitutes an optimal staircase cover of M and that g2~ <~ k~-i for all l, 
2 < g < k. These properties are established in the following three steps. 
1. H is a staircase cover of M. By definition of maximal elements, if (i~,jl), (i2,j2) E O~ are 
distinct position indices, then neither ( i l , j t)  <-she (i2, j2) nor (i2,j2) "~sne (Q, J1). Hence, 
either (il, J0  ~se (i2, j2) or (i2, j2) ~se (il, j l) .  AS every pair of elements of ~e are retated 
by <~, we have that 0~ is a chain in (F, _<se). Thus, ~ is a staircase in M. After ~ times 
through the while-do loop in SCA, 
When SCA terminates, S~ = ~. Hence, {k~l, ~2, . . . ,  O~} is a staircase cover of M. 
2. H is an optimal staircase cover. For an arbitrary element (ik,j~) E ~,  there exists at 
least one element (i~-l,j~-~) e q2~-l, such that (ik,j~) <She (ik-~,j~-l). Otherwise, 
(ik, jk) would have been included in O~-1. Extending this argument inductively, it is easy 
to see that there exists a sequence of position indices 
(ik,jk) <~ne (i~-l, jk,1) <she "'" <s,e (il,j l), 
such that (Q, j~) ~ ~ C r for all ~, 1 < g < k. This implies that there is an antistaircase of
size k in F. Applying Lemma 2, we conclude that {ff21, 02 , . . . ,  ~}  is an optimal staircase 
cover of M. 
3. <_~ is a total order on H. Consider an arbitrary pair of staircases O~ and O~+~ in H, 
where I < r < k - 1. As mentioned in the proof of Property 2, for any position index 
(i,j) ~ ~l / r+ l  , there exists a position index (i',j') ~ q~, such that (i,j) <.he (i',j'). Hence, 
~I/r+l <3~ Or. | 
Matrix-Vector Multiplication 595 
The application of SCA to the nonzeros of Figure 1 yields these two staircases 
91 = {(1 ,1 ) , (1 ,2 ) , (1 ,3 ) , (2 ,3 ) , (4 ,3 ) , (4 ,4 ) , (5 ,5 )} ,  
= {(3 ,2 ) , (a ,2 )} ,  
also illustrated in the rightmost picture in Figure 1. Note that 91 consists of the maximal 
elements of (F, <She), as for each (i,j) e i91, there is no (i',j') • F that is strictly north and east 
of it. Note that there are other optimal staircase covers of the matrix in Figure 1 that are not 
totally ordered by <_~. One example of such a staircase cover is 
{{(1,1),(1,2),(1,3),(2,3)},{(3,2),(4,2),(4,3),(4,4),(5,5)}}. 
The notion of a staircase in a matrix is equivalent to the notion of a queue layout of a graph [27- 
31]. In particular, the queue layout of a graph corresponds to a staircase cover of the adjacency 
matrix of the graph and finding the minimum number of queues for a particular layout of a graph 
is equivalent to computing the staircase width of the corresponding adjacency matrix. Reference 
[31] presents an algorithm for determining the fewest queues required for a graph layout that 
runs in O(mloglogn) time on a graph with n vertices and m edges. With minor modifications, 
this algorithm can also be used to implement SCA in O(]F[ loglogn) time on an n × n matrix, 
where F is the set of positions of nonzero entries in the matrix. The assumption here is that F is 
the input to the algorithm. 
In the context of our application of staircase covers to matrix-vector multiplication on systolic 
arrays, a systolic algorithm to compute a minimum width staircase cover of a matrix may be 
useful. This is beyond the scope of the current paper, but we suspect hat techniques traditionally 
used for sorting on systolic arrays (see [32]. for an example) will be useful in computing staircase 
covers as well. 
4. COMPARING BANDWIDTH,  
STRIPE WIDTH, AND STAIRCASE WIDTH 
As mentioned in Section 2, for any matrix M, the following inequality holds: 
BAND(M) >_ STRIPE(M) >__ STAIR(M). 
k is of course good news for generalized MAT/VEC that STRIPE(M) >_ STAIR(M) for all 
!matrices M, but the important question is how much smaller is the staircase width as compared 
to stripe width for the kinds of sparse matrices that arise in real applications. In this section, we 
describe xperiments hat show that for sparse matrices derived for a variety of applications, we 
,can expect STRIPE(M) to be about twice the size of STAIR(M). Furthermore, these experiments 
show there are matrices that arise in real applications for which the staircase width is dramatically 
:smaller than the stripe width. 
For our experiments we downloaded 25 real, nonsymmetric matrices from the University of 
:Florida sparse matrix collection, which currently contains 926 sparse matrices that have arisen 
in a variety of real applications 1. These matrices are all stored in the popular Harwell/Boeing 
format. To complete our experiments in a reasonable amount of time, we restricted our sample 
to matrices with 5000 or fewer rows and columns. The matrices in the University of Florida 
collection are organized into directories, each directory containing matrices ubmitted by a re- 
search group. Roughly speaking, this organization partitions matrices into directories by prob- 
lem domain. Therefore, to prevent any bias and to get matrices from a wide variety of prob- 
lem domains, we considered the matrix directories in alphabetical order and selected up to two 
ITh£s collection can be found at http://w~m, cise. all. edu/rasearch/sparse/matrices/. 
596 L.S. HEATH et al. 
matr ices satisfying our criteria (real, nonsymmetr ic ,  number  of rows, columns < 5000) from 
each directory. Wi th in  each matr ix  directory, we selected the first two matr ices in a/phabeti -  
ca] order that  satisfy our criteria. (While our programs to compute  bandwidth,  str ipe width, 
and staircase width  work for nonsquare matr ices also, all the matr ices selected in this manner,  
turned out  to be square.) Our programs, data  files, and documentat ion  are all available at 
htZp ://www. cs. uiowa, edu/~sr i ram/sparseMatr i ces /exper iment  s. html/. 
The  results of our experiments are shown in the table in Figure 4. The  second co lumn gives 
the directory and file name of the matrix, as found at f tp: / / f tp,  cise. ufl. edu/pub/ facu l ty /  
dav is /mat r i ces / .  The  average sparsity of a matr ix  is more than  96~,  so these matr ices are 
quite sparse. The last three columns show the staircase width  (SCW),  str ipe width  (SW), and 
bandwidth  (BW) computed by the programs available at the  URL  given above. 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
File name 
Bai/bfwa398. rua 
Bai/bfwa62. rua 
Brethour /coater l .  rua 1348 x 
DRIVCAV/cavityOl. rua 317 x 
DRIVCAV/cavity02. rua 317 x 
FIDAP/exl. rua 216 x 
FIDAP/exI0. rua 2410 x 
Garon/garonl. rua 3175 x 
Grund/bl_ss. rua 7 x 
Grund/b2_ss. rua 1089 x 
HB/arc130. rua 130 x 
HB/bp_0. rua 822 x 
Harem/add20. rua 2395 × 
Hamm/add32. rua 4960 × 
Mallya/lhr01. rua 1477 x 
Mallya/lhr02. rua 2954 x 
Shyy/shyy41. rua 4720 × 
Simon/raefskyl .rua 3242 x 
Simon/raefsky2. rua 3242 x 
TOKAMAK/utmlTOOb. rua 1700 x 
TOKAMAK/utm300. rua 300 x 
Wang/swangl .rua 3169 x 
Wang/swang2. rua 3169 × 
Zitney/extrl. rua 2837 × 
Zil;ney/radfr i. rua 1048 × 
Dimensions 
398 x 398 
62 x 62 
1348 
317 
317 
216 
2410 
3175 
1089 
130 
822 
2395 
4960 
1477 
2954 
4720 
3242 
3242 
1700 
300 
3169 
3169 
2837 
1048 
Nonzeros 
3678 
450 
19457 
7327 
7327 
4352 
54840 
88927 
15 
4228 
1282 
3276 
17319 
23884 
18592 
37206 
20042 
294276 
294276 
21509 
3155 
20841 
20841 
11407 
13299 
Means 
Medians 
Sparsity SCW SW BW 
97.68 20 32 609 
88.29 15 28 99 
98.93 57 137 869 
92.71 33 105 205 
92.71 33 108 205 
90.67 34 79 129 
99.06 33 53 227 
99.12 60 127 6343 
69.39 3 4 10 
99.64 260 387 2017 
92.41 23 206 251 
99.52 41 267 1337 
99.70 86 232 4627 
99.90 32 84 8059 
99.15 72 223 2537 
99.58 72 223 2537 
99.91 4 6 81 
97.20 i l l  320 871 
i 
97.20 111 320 871 
99.26 24 43 512 
96.50 23 41 141 
99.79 58 108 6127 
99.79 58 108 6127 
99.86 913 1136 5318 
98.79 15 33 1075 
96.27 87.64 187.92 2047.36 
99.06 34 108 871 
Figure 4. The staircase width (SCW), stripe width (SW), and the bandwidth (BW) 
of 25 real, nonsymmetric matrices is shown here. These matrices were downloaded 
from the University of Florida sparse matrix collection. 
It  is easy to construct  examples of n x n matr ices whose bandwidth  is O(n)  and whose str ipe 
width is 0(1) .  For  example,  consider the n x n matr ix  M,  wi th even n, such that  
F = {(1, 1), (2, 3), (3, 5 ) , . . . ,  (n/2, n - 1)}. 
Then,  STR IPE(M)  = 1 but  the bandwidth  of M is n/2. More to the point, Me lhem mentions 
that  in matr ices that  arise in f inite-element analysis, the bandwidth  of an n x n mat r ix  is typ- 
Matrix-Vector Multiplication 597 
ically O(n ~/2) (for 2-D problems) or O(n 2/3) (for 3-D problems), while the number of stripes 
required to cover the matrix is bounded above by a constant (independent of n). These obser- 
vations are supported by our experiments. In the table in Figure 4, the average stripe width 
is 187.92, while the average bandwidth is 2047.46. For several matrices in the table, the differ- 
ence between the stripe width and bandwidth is even more dramatic. For example, the average 
bandwidth of the three matrices garon l . rua  (Row 8), add32.rua (Row 14), and swangl .rua 
(Row 22) is 6843, more than 60 times the average stripe width, which is 106.3. The matrices 
garonl ,  rua and swangl, rua arise from 2-D and 3-D finite-element discretization of the Navier- 
Stokes equation, whereas add32.rua arises in digital circuit simulation. Figure 5 shows the 
patterns of nonzeros in these matrices, and these patterns explain why it requires many bands 
to cover the nonzeros but very few stripes. 
The difference in the average values of stripe width and staircase width is not as dramatic as 
the difference between the stripe width and bandwidth. Nevertheless, the average stripe width 
is about twice the average staircase width. The matrix a rc l30 . rua  is an example in which 
t]he stripe width is dramatically larger than the staircase width. Figure 6 shows the pattern of 
nonzeros in this matrix and from the picture it is clear that the nonzeros have an arrowhead 
pattern. Matrices with this nonzero pattern occur frequently in the numerical solution to partial 
differential equations and there are papers that focus on efficient matrix operations for arrowhead 
matrices [33]. 
Figure 5. The pattern of nonzeros inthe matrices ga_ronl, rua (3175 x3175), add32. 
rua (4960 × 4960), and swangl .rua (3169 × 3169). 
i 
i 
Figure 6. The pattern of nonzeros in the matrix axcl30 .rua (130 x 130), revealing 
an arrowhead pattern. 
5. GENERAL IZED MAT/VEC 
The problem at hand is to multiply an nxn matrix M -- (rni,j),~×n and an n-vector x -- (xj)n×l 
1;o produce as result, an n-vector y -- (Yi)nxl- Let {~l ,k~2, . . . ,~k} be the staircase cover 
of M constructed by SCA (described in Section 3). From the proof of Theorem 4, it follows 
that @e <~ ~-1 ,  for all f, 2 < f < k. As generalized MAT/VEC is described, it will become 
clear that for our algorithm to work the staircases have to be totally ordered by <:~. Generalized 
598 L.S.  HEATH et hi. 
MAT/VEC uses k PEs, each PE having six input ports, labeled I1, 72, 73, /4, Is, and/6, and 
two output ports, labeled O1 and 02. A typical PE in generalized MAT/VEC is shown in 
Figure 7(a). The k PEs are labeled 1 through k and are connected in a linear array as shown in 
Figure 7(b). The components of the x-vector are fed from the right of generalized MAT/VEC 
through input port I1 of the PE labeled 1, in the order xl ,x2, . . .  ,x~. The components of the 
y-vector, initialized to 0, are fed from the left of generalized MAT/VEC, through input port /2 
of the PE labeled k, in the order Yl,Y2 . . . .  ~ y~. The product M-  x is accumulated in y, that is, 
n when yi exits PE 1, it contains the value ~j=l  m~,jxj. The elements of each staircase k~, where, 
1 < g < k, are fed into the PE labeled g in increasing order of -<se. In particular, if (i, j) E ~,  
then m~,j is fed into input port/3,  i is fed into input port I5, and j is fed into input port /4 of 
the PE labeled g. Thus, along with an entry mi,j of M, the row index i and the column index j
are also fed to the PE. Since a staircase is fed into a PE in increasing order of -<se, the sequence 
of values fed into port I4 (the j values) and the sequence of values fed into port /5 (the i values) 
are both nondecreasing. Once PEg  has m~,j on its input port/3,  it then waits for xj and Yi to 
appear on input ports I1 and/2, respectively, and when they do appear, it updates Yi according 
to yi := Yi + mi,j • xj. The value fed into a PE via input port /6  depends on the relationship 
between the current input on/3  and the next input on Ia. If the next input on Ia is a matrix 
element that is in the same row as the current input, then/6 has the value h (horizontal). If the 
next input on/3 is a matrix element that is in the same column as the current input, then/6 has 
the value v (vertical). Otherwise/6 has the value d (diagonal). 
I6151413 
(a) (b) 
Figure 7. Sparse matrix-vector multiplication systolic network. 
Step 1: Rm :--[/3]; Rj :-- [/4]; Ri := [/6]; 
Step 2: Whi le (Cz < Rj) or (Cy < R,) do 
Step 2.1 if (not empty(I1)) and (C~ < Rj) then 
{R~ := [I1]; C~ := C~ + 1; 
if (C~ < Rj) then  O1 := Rz;} 
Step 2.2 if (not empty(/2)) and (Cy < Ri) then 
{R~ := It2]; C~ := C~ + 1; 
if (C v < R{) then 02 := Ry;} 
Step 3: Ry :~--- Ry -~- R m - Rx 
Step 4: if ([/6] = h) then  {O1 : :  Rz} 
else if ([/6] = v) then  {02 := Ry} 
else {O1 :-- R~; O2 := Ry} 
Figure 8. Program for each PE. 
Each PE in the network executes the program shown in Figure 8 repeatedly until M • x has 
been computed. Each PE has five internM registers, Rm, R~, Rj, R~, and Ry and two counters C~ 
and Cy. The counters C~ and C v are initialized to 0 and represent the index of the most recently 
received components of the x-vector and y-vector, respectively. Consider a PE labeled g for 
some g, 1 < ~ < k. After receiving m~,j, i, and j into registers Rm, Ri, and Rj, respectively, 
(Step 1), the PE waits for the appropriate components of the x-vector (xj) and the y-vector (y~) 
to appear on input ports/1 and/2, respectively, (Step 2). More precisely, the PE first checks 
Matrix-Vector Multiplication 599 
input port /1 (Step 2.1) and then checks input port /2 (Step 2.2) for input. Step 2.1 can be 
explained as follows. If the PE has not already received the correct component of the x-vector, 
that is, if C~ < Rj, and if there is some input, on input por t / t ,  then the input is copied into 
register Rx and the corresponding counter Cx is incremented. If C~ is still smaller than Rj, then 
the PE has no use for the component of the x-vector just received, and therefore, this component 
is moved to the output port O1. Step 2.2 is similar to Step 2.1, except hat a component of the y- 
vector is received as input. Step 2 terminates when C~ = Rj and Cy = R~. Thus, for any PE, C~ 
(respectively, Cy) keep track of the number of components ofx (respectively, ) that have reached 
the PE. Hence, when Step 2 is completed the correct components of the x-vector and y-vector 
have been received. On receiving xj and Yi, the PE updates Yi according to Yi := Y~ + m~,j. xj 
(Step 3). In Step 4, the PE decides whether to retain either xj or y~ for future use. In particular, 
if the next element in the staircase is in the same row as the current element, that is, [/6] = h, 
then y~ is retained and xj is passed on to the left neighbor (PE ~ + 1). If the next element in 
the staircase is in the same column as the current element, that is, [I6] = v, then xj is retained 
and y~ is passed on to the right neighbor (PE / - 1). If neither is true ([/6] = d), then both xj 
and y~ are passed on to the left and right neighbor, respectively. 
The main difference between the program executed by each PE in generalized MAT/VEC and 
the program executed by each PE in MAT/VEC is in Step 4. In MAT/VEC, each stripe in a 
stripe cover is fed into a distinct PE and after y~ is updated (as in Step 3), both xj and Yi are 
passed on to the neighboring PEs, while in generalized MAT/VEC whether xj or y~ (or both) is 
(are) passed on to the neighboring PE(s) depends on whether the next matrix element to be fed 
into the PE is in the same row or column as the current one. Note that generalized MAT/VEC 
is a generalization ofMAT/VEC in the sense that a stripe may be fed into a PE and if so/6 will 
always have the value d. 
Each connection between a pair of adjacent PEs can be thought of as a queue with capacity n. 
The reader may be concerned about this requirement; specifically, that the queue capacity grows 
hnsarly in the array size. Section 8 addresses this concern by explaining how a staircase cover can 
be constructed that accommodates a constant upper bound on queue capacities. For example, 
the connection between port O1 of PEg and port I1 of PEt  + 1 is a queue into which components 
of the x-vector are enqueued by PEg and dequeued by PEg + i. The net result is that each PE 
has access to four queues: two queues (at input ports I1 and/2) that it can dequeue lements 
from and two queues (at the output ports O1 and O2) that it can enqueue into. For the purpose 
of our analysis, we give these queues names. The input queue for PE ~ that contains components 
of the x-vector (respectively, -vector) will be called IQ~(l) (respectively, IQy(£)). Accordingly, 
the output queues will be called OQ~(~) and OQy(~). Note that IQz(g -t- 1) and OQx(~) are 
Mentical and likewise [Qv(g) and OQy(g + 1) are identical for all g, 1 < g < k -  1. For notational 
convenience, we will use IQx(k + 1) as an alternate name for OQ~(k) and IQ~(O) as an alternate 
name for OQv(1 ). 
Since generalized MAT/VEC is systolic, the PEs execute their programs at different speeds 
depending upon the availability of data. To analyze the number of time steps required by gener- 
alized MAT/VEC, we focus on the slowest PE. To be precise, we introduce a hypothetical g obal 
synchronization scheme as suggested by Melhem. Replace Step 3 in Figure 8 by the following: 
Step 3: Wait for a synchronization signal SYNC;  
R~ := Ry + Rm- Rx; 
The purpose of the synchronization signal SYNC is to force the execution of generalized MAT/ 
VEC into two alternating phases: a communications phase and a processing phase. During 
the commur/ications phase data flows through the network until each PE is either waiting for 
data (in Step 2), or waiting for SYNC (in Step 3). We assume that all the PEs are connected 
to a hypothetical controller that issues the signal SYNC after detecting the termination of the 
600 L.S. HEATH et al. 
communications phase. At the instant SYNC is received, all PEs that are waiting in Step 3 
perform the computation Ry := Ry + Rm • Rx, while the other PEs remain idle. This is the 
processing phase. A communications phase followed by a processing phase is called a global cycle 
of the network. We measure the time taken by generalized MAT/VEC to compute M-  x by the 
number of global cycles required. We denote by CP, and PPt, the communications phase and 
the processing phase, respectively, in global cycle t. Note that the above synchronization scheme 
is hypothetical and simply serves as a way of defining a measure on the amount of time required 
for generalized MAT/VEC to compute the matrix-vector product. Furthermore, since this is the 
same measure of time used by Melhem [3], the comparison between the two schemes is made on 
the same basis. 
1 1 
2 2 2 1 
3 1 
Figure 9. A matrix with staircase width 3. 
Ys, Y4, Ya, Y2, Yl 
77~3,2 ~2,1 [ 777"1,3 
' 
~1 3 2 1 
Xl~ X2~ 3:37 X4~ X5 
Figure 10. Initial status of generalized MAT/VEC. 
A::2 _=, A= ,3 
y~ ~ y4 , ~13 
T/1 :=  1/1 + ml ,3  ' =3 
Y2 ;=  Y2 + rrL2,1 • x l  
Wl3,2 trl2,2 V/;1,4 
) 
~1 := ~1 -[- ;'hi,4 • x4 
) ~ :=  ~2 + rr~,= - :=  
A2' A2  ' 
~3 := y3 + TRY3,2" z2 
• 1 ~ m 4 ' 2  f/t,3 ~2,4 
P x5 
Y5 
P2 :=  ~2 + t~s,4 ":g4 
~'4 :=  lt4 "{" rn4,= "X2 
Y4 :=//4 + m4,s • z3 
m4,5 
g1,~2,~3, g4 ) ~ q  X5 
Y3, y2 ~"Yl 
Figure 11, The computations performed by generalized MAT/VEC. 
Matrix-Vector Multiplication 601 
Figures 9-11 illustrate how generalized MAT/VEC performs a matrix-vector multiplication. 
Figure 9 illustrates a 5 x 5 matrix whose staircase width is 3. The nonzero elements covered 
by ~I are labeled 1, those covered by k02 are labeled 2, and those covered by k03 are labeled 3. 
Note that ~3 <~ ~2 <v ~1. Figure 10 shows the status of generalized MAT/VEC initially and 
Figure 11 shows the six global cycles that generalized MAT/VEC goes through in order to perform 
the matrix-vector multiplication. For each global cycle, the status of generalized MAT/VEC after 
the communications phase and the computations performed uring the corresponding processing 
p:hase are shown. Notice that during global cycle 1, components xa and x2 are both placed in 
queue IQ~(2). It is easy to see that the stripe width of the matrix shown in Figure 9 is six. Thus, 
MAT/VEC uses six PEs and the reader is invited to verify that MAT/VEC performs the same 
computations a  generalized MAT/VEC, in each processing phase. 
6. CORRECTNESS OF GENERAL IZED MAT/VEC 
The proof of correctness ofgeneralized MAT/VEC depends fundamentally ontwo properties of 
our scheme that are established below (Lemmas 5 and 6). The first property is that, for all i, yi 
does exit generalized MAT/VEC. In other words, generalized MAT/VEC does not deadlock. In 
order to prove this, we first need to formalize the notion of a deadlock. Generalized MAT/VEC 
is said to be in a deadlocl~ if for some PE hi, [/4] = Jl, [/5] = il, and for some PE k2, [I4] = J2, 
and [/5] = i2, such that kl > k2, il _< i2, and Jl > J2. The implication of the above condition is 
that PE kl is waiting for xj~ which is blocked by PE k2, which in turn is waiting for Yi2, which is 
blocked by PE kl. A deadlock state is illustrated in Figure 12. The fact that staircases are totally 
ordered according to <~ is crucial in ensuring that generalized MAT/VEC does not deadlock. 
Y/2 
i l  j l  
~ • 
i2 j2 
IlII 
_. I I. x j2 
"l 
x jl 
Figure 12. Generalized MAT/VEC in a deadlock state. 
LEMMA 5. Generalized MAT/VEC does not reach a deadlock. 
PROOF. To show this, we assume that generalized MAT/VEC reaches adeadlock state and obtain 
a contradiction. Because kgk 1 <~ ~k2, it follows from the definition of <~ that (Q, j l)  <she (i3, j3) 
for some (i3,j3) C ~k2. Since, il _< i2 and j l  >_ j2, (i2,J2) --.she (il, j l)- By, transitivity of ---She 
we have (i2,j2) <s,e (i3,j3), which is a contradiction because (i2,j2) and (i3,j3) belong to the 
same staircase. | 
We now know that all yi exit the systolic array; we do not know that each Yi has the correct 
value. To establish this we show a second property of our algorithm, which is that, when m~,j 
reaches a PE, xj and Yi have not traveled past the PE, and therefore, will participate in the 
computation Yi := Yi + mi j  - xj eventually. This implies that when y~ exits the system, it 
n contains the value ~-~j=l m~,j • xj. 
LEMMA 6. For any PE g, 1 < g < k, just before the execution of Step 2 in the program shown 
in Figure 8, the following holds: 
Cx <_ R~ and Cy < R~. 
PROOF. The proof is by induction on the number of times Step 2 has already been executed by 
PE g. 
602 L.S. HEATH et al. 
BASE CASE. Suppose that Step 2 has been executed 0 times by PE /. Due to initiiization, 
Cx = 0 and Cv = 0. After Step 1 has been executed, Ri and Rj acquire values in Jn. Thus, 
after Step 1 has been executed and just before Step 2 is executed the first time, it is the case 
that C~ < Rj and Cy < Ri. 
INDUCTION CASE. Assume that for some integer p _> 0, the lemma holds after Step 2 has been 
executed p times and just before it is executed for the (p + 1) th time. We now show if PE 
executes Step 2 for the (p ÷ 2) nd time, then just before this execution of Step 2, Lemma 2 holds. 
Two cases are possible depending on whether the condition in Step 2 evaluates to true or false 
when Step 2 is executed for the (p + 1) th time. 
Case 1. Suppose that the condition (Cx < Rj) or (C v < R~) evaluates to true when Step 2 
is executed the (p + 1) th time. In the body of the loop (Steps 2.1 and 2.2) C~ is 
incremented by at most one, if Cx < Rj and Cy is incremented by at most one, 
if Cy < Ri. In any case, just before Step 2 is executed for the (p + 2) nd time, we have 
Cx <_ Rj and C v _< Ri. 
Case 2. Suppose that the condition (Cx < Rj) or (Cy < R~) eviuates to false when Step 2 
is executed the (p + 1) th time. By the induction hypothesis, we have Cx = Rj 
and Cy = Ri. The steps executed immediately after Step 2 are Steps 3 and 4 and 
these steps do not change the value of Cx and Cy. When the PE finishes the execution 
of Step 4, it begins again at Step 1. If there is more input on/3,  then in Step 1, 
R~ and Rj get new values, but these new values are greater than or equal to the 
corresponding old values. This is because the elements fed into PE l belong to ~I,~ 
and are fed in increasing order according to <se. Hence, just before Step 2 is executed 
for the (p + 2) nd time C~ < Rj and Cy _< Ri. 
By induction, the lemma follows. | 
It is easy to see that for any PE g, 1 < g < k, C~ is the number of components of the vector x 
that have reached the PE. Similarly, it is easy to see that Cy is the number of components of 
the vector y that have reached the PE. Thus, Lemma 6 implies that when mi,j reached PE t, 
the components xj and yi have not already traveled past the PE. Since, Lemma 5 shows that 
generalized MAT/VEC does not deadlock, eventually the components xj and yi will reach PE 
and the computation yi := yi + mis" xj is performed. This implies the correctness ofgeneralized 
MAT/VEC, as stated in the following theorem. 
THEOREM 7. For each i, 1 < i < n, Yi exits at output port 0 20l c the rightmost PE in the array 
?% 
with value ~-~j=l mi,j • xj. 
7. PERFORMANCE OF  GENERAL IZED MAT/VEC 
In this section, we analyze the running time of generiized MAT/VEC relative to the running 
time of MAT/VEC. As mentioned before, to compute M.  x, generalized MAT/VEC requires 
STAIR(M) PEs, while MAT/VEC requires STRIPE(M) PEs. For any M we have STAIR(M) __ 
STRIPE(M) and the experiments in Section 4 indicate that generalized MAT/VEC often has 
a significantly smiler hardware requirement. In the main result in this section, we show that, 
despite having a smiler hardware requirement, generalized MAT/VEC requires the same time 
(number of global cycles) as does MAT/VEC. We use the notion of a g lobi  cycle as our unit 
of time for ease of analysis and for appropriateness of comparison. At the end of this section 
we mention that even if we had used another, possibly more rei ist ic measure of time, we would 
have concluded that generlized MAT/VEC is at least as fast as MAT/VEC. 
7.1. Count ing Global  Cycles 
Define inductively a sequence of position index sets CF1, CF~, CF3, . . . ,  as follows: 
F0 =F ,  
Matrix-Vector Multiplication 603 
a~ad, for all i >_ 1, 
CFi = set of minimal elements of (Fi-1, _<~¢), 
Pi = Fi-~ - CFi, . 
Each CFi is called a computational front. For the matrix in Figure 9, 
CF1 = {(1,3), (2, 1)}, CF4 = {(4,2), (2,4)], 
CF2 : {(1,4), (2,2)}, CF5 = {(3,4),(4,3)}, 
CF3 : {(2,3), (3, 2)}, CF6 = {(4,5)}. 
Note that a computational front is a chain in the poser (F,-<sne)- Melhem [3] shows that ele- 
ments in each computational front participate in a computation i each processing phase. In other 
words, Melhem shows that MAT/VEC performs the computation y{ := yi + mi,j.xj in processing 
phase PPt if and only if (i, j) e CFt. Theorem 8 shows that the same is true for the computa- 
tions performed by generalized MAT/VEC. The implication, of course, is that MAT/VEC and 
generalized MAT/VEC perform the same computations in each global cycle. 
THEOREM 8. Eor all t, (i,j) E CFt if and only if the computation y{ := yi + m{,j .xj is performed 
in processing phase P Pt by generalized MAT/VEC. 
PROOF. For the convenience of argument, let CFo = ~ and let PPo be a dummy processing 
phase in which no computations are performed. The proof of the theorem is by induction on t. 
]=IASE CASE. The claim is trivially true when t = 0 because of the definitions of CFo and PPo. 
INDUCTION CASE. Let t _> 1 and assume that, for all ff _< t - l ,  we have (i~,j ') E CFt, if and only 
if generalized MAT/VEC performs the computation Yi, := Yi, + mi,,j, • x f  in PPt,. Using this 
induction hypothesis, we show that (i, j) e CFt if and only if generalized MAT/VEC performs 
t:he computation Yl := Yi + mi,j • xj in PPt. Let 
r~ = r - (CF~ u CF2  u . . .  u CF~) ,  
for all g _> 0. Let us focus on the status of generalized MAT/-VEC just after PPt-1. Let 
(i,j) E CFt ~ ~v for some p, 1 _< p < k. By the induction hypothesis, mi,j has not yet been 
used in a computation. By definition of CFt, (i,j) is a minimal element with respect o _<se 
in Ft-1. Hence, mi,j is the/3-input to the PE labeled p in global cycle t. We will now show that 
during CPt, the components xj and Yi reach PE p, and thus, the PE computes Yi := y~ +m~,y .xj 
in PPt. 
By Lemma 6, we know that y~ is in queue IQv(pn), where PL >-- P, and xj is in queue IQz(pR), 
where PR <- P- We first show that for all PEs to the left of PE p, [I5] > i, implying that y~ is not 
blocked by any PE to the left of PEp  and reaches PEp  in CPt. The proof is by contradiction. 
Suppose that there is a PE labeled p' > p (that is, to the left of PE p) for which [Is] "~ _-- i. 
Specifically, assume that for PE p~ > p, 
[I51 = i', [/4] = j', i '< i .  
Then, j '  > j, because otherwise i f j  ~ _< j, then (i',j') <-s~ (i,j) and (i,j) is not a minimal element 
of Ft-1 with respect o -<se. Thus~ we have 
i' _< i and j '  > j. (1) 
Since p~ > p, we know that ~p, -<'r ~p. Hence, by the definition of _<~, there is a position 
index (i", j")  e ~p, such that (i', j~) <~n, (i", j'~). In other words, 
i' > i" and j '  < j" .  (2) 
604 L.S. HEATH et al. 
But, (1) and (2) together imply that (i,j) <she (i",j"). This is a contradiction because (i,j) 
and (i',j") both belong to the same staircase ~p. Hence, for every PE to the left of PE m, 
[/5] > i. This allows yi to reach PE p in CPt. A similar argument suffices to show that for all 
PEs to the right of PE p, [/4] > j, thus allowing xj to reach PEp  in CPt. Hence, Yi := Yi +mi,j.xj 
is computed in PPt. 
To complete the proof we need to show that if (i, j) ff CFt then yi := yi + mi,j • xj is not 
computed in PPt. Suppose (i, j) E CFt, for some t ~ < t; by the induction hypothesis rni,j has 
already participated in a computation and will not participate in any further computations. So 
assume that (i,j) E CFt, for some t' > t. This implies that (i,j) E Ft-1. Now let us be more 
specific and assume that (i,j) E F,- I  N ff/p for some p, 1 < p < k. There are two cases depending 
on whether (i,j) is a minimal position index in Ft-1 rl qp with respect o <_se- 
CASE 1. If (i,j) is not the minimal position index in Ft-1 N qp, then mi,j is not the I3-input 
to PEp  in global cycle t, and hence, the computation yi := Yi + rni,j • xj is not performed in PPt. 
CASE 2. Suppose that (i, j) is the minimal position index in Ft-1 N ~p. Since (i,j) ~ CFt, there 
is a position index (i',j') E Ft-1, such that (i',j') <_s~ (i,j). Now suppose that (i',j') e q~p, for 
some p~ ¢ p. If p~ > p, then PE p~ is to the left of PEp .  By Lemma 6, component Yi, is to 
the left of PE p~ and since i >_ i t, component Yi is also to the left of PE pq Thus, Yi is blocked 
by PE p~ and cannot reach PEp  in CPt. Similarly, it can be shown that if p~ < p, then xj is 
blocked by some PE to the right of PEp  and cannot reach PE p in CPt. This implies that as 
claimed, yi := y~ + mi,j • xj is not computed in PPt. 
By induction, the theorem follows. | 
Melhem provides a lower bound (see Proposition 3.1 in [3]) and an upper bound (see Proposi- 
tion 3.2 in [3]) on the number of global cycles that MAT/VEC requires to complete M.x.  Neither 
of the bounds is tight, and he leaves the precise characterization f the number of global cycles 
required by MAT/VEC an open question. The following theorem provides the precise number 
of global cycles required by generalized MAT/VEC (and MAT/VEC) to compute M • x, as a 
function of how the nonzeros are distributed in matrix M. 
THEOREM 9. Let • be a largest staircase in M. Then, the number of global cycles required by 
generalized MAT/VEC or MAT/VEC to compute M.  x is [~[. 
PROOF. Let CF1, CF2,..., CFT be the nonempty computational fronts of M. From the defini- 
tion of computational fronts, it is easy to see that there is a sequence 
(il,jl) -<se (i2,j2) -<se "'" -<se (iT,iT), 
such that (ip, jp) e CFp, for 1 < p < T. The above sequence is an antichain in (F, -<~ne). This 
implies that CF1, CF2,..., CFT is a smallest set of chains in (F, -<sne) covering F. Therefore, by 
Theorem 1, T is equal to the size of a largest antichain in (r, -<~ne)- An antichaln in (F, <~n~) is
a staircase, all of whose elements belong to F. The theorem follows. | 
In the matrix shown in Figure 9, the size of the largest staircase is six. This is equal to the 
number of computational fronts in the matrix, and is therefore, equal to the number of global 
cycles required by generalized MAT/VEC and MAT/VEC. 
7.2. Counting Clock Pulses 
The reader may be concerned that using a global cycle as a unit of time ignores the amount of 
communication among PEs that has to go on between consecutive processing phases. An alter- 
native approach is to count the number of clock pulses required to complete the task, assuming 
that at each clock pulse, each PE performs a read-compute-write cycle. More precisely, at each 
clock pulse, each PE reads the input ports that it is expecting input on, performs a computation 
if it has the requisite data, and then writes data that it no longer needs on output ports. It 
Matrix-Vector Multiplication 605 
is possible to show that for any matrix M the number of clock pulses required by generalized 
MAT/VEC is no greater than the number of clock pulses required by MAT/VEC. The proof 
of this claim is more tedious than the proof of the corresponding claim for global cycles and 
is not included here. We merely present an example to provide insights into the situation. In 
Figure 13 a 3 x 3 matrix is shown with a staircase cover of width 2 on the left and a stripe 
cover of width 4 on the right. The blank slots correspond to zeroes; entry i belongs to the ira 
staircase or stripe. Figure 14 shows the matrix-vector multiplication performed on this matrix 
by generalized MAT/VEC and organized by clock pulses. Since the matrix is covered by two 
staircases, we have two PEs. Each row in the picture corresponds to the array of two PEs at a 
particular clock pulse. Each cell corresponds to a PE. The entry at the top of each cell denotes 
t:he position indices of the next matrix element, the entry at the left denotes the index of the 
next element of the y-vector, and the entry at the right denotes the index of the next element of 
t:he x-vector. An empty input port is denoted by *. Shaded cells represent PEs and clock pulses 
where actual computation (as opposed to mere communication) took place. For example, in clock 
pulse 4, PE 1 had data m~,a, y~, and xa and performed the computation y~ +-- y~ + ml,3 • x3. 
At that clock pulse, PE 2 is unable to perform any computation because it is missing Y3. It 
takes seven clock pulses for all the yi to exit the array. Figure 15 shows the same matrix-vector 
multiplication performed by MAT/VEC. Four PEs are used because four stripes were required to 
cover the matrix (see Figure 13). Despite the additional PEs, MAT/VEC takes additional clock 
pulses (11 in all) to complete the computation. It can be shown by induction that at any clock 
2 2 1 4 3 1 
1 2 
2 2 4 3 
Figure 13. A 3 x 3 matrix. On the left a staircase cover with width 2 is shown and 
on the right a stripe cover with width 4 is shown. 
m 
Clock Pulse 
11 13 
1 1 , * 1 
13 
2 ~ iN  * 2 
~!i~ * 3 
32 "~l~ lii 
Y-----~ 4 2 2 ~!~ ~72 
7 * * 3 * 
X 
Figure 14. Computing a matrix-vector product on generalized MAT/VEC with two 
PEs. 
606 
Y 
L .  S. HEATH et  al. 
m 
Pulse 1 
11 12 23 13 
1 
1 , * * * , * 1 
11 12 23 13 
2 1 * * * * I * 2 
11 12 '23 13 
3 1 * • 1 * 2 * 3 
~ ~ ~ 12 23 13 X 
~ *  2 * * *3  4 
~N:gN~ 23 13 32  
5 2 * p~ !;i~ , , . 3 
I ~  33 23 13 
6 2 * 1 * * 3 
• 33 23 ~ i ~  
7 
• 33 ~ i~ * 
8 • * 3 * i~ 
9 , , , ~ , • 2 * 
10 
• 3 * * 3 * * • 
11 
• * * * * * 3 • 
• * * * * * * * 
Figure 15. Computing a matrix-vector p oduct on MAT/VEC with four PEs. 
l i t  7"0, 
] 2,3,...,n- 2, n -  1 (1,n) n '  (n, 1) 1 ' 
PE 2 PE 1 
,n  n -  1 ,n -  2,...,2 1 
Figure 16. Each of the queue between the two PEs contains (n - 2) elements. 
pulse, the computations completed by MAT/VEC is a subset of the computations completed by 
generalized MAT/VEC. The intuition is that because of a larger array of PEs, elements have 
further to travel to reach the appropriate PE. This increases communication time and decreases 
parallelism. 
8. FUTURE D IRECT IONS 
In this paper, we have presented a natural generalization of the systolic array scheme pro- 
posed by Melhem [3] for matrix-vector multiplication. Our scheme, called generalized MAT/VEC 
outperforms MAT/VEC in that it has, typically a significantly smaller hardware requirement, 
without paying in extra time. 
Matrix-Vector Multiplication 607 
The reader may be concerned about the fact that generalized MAT/VEC assumes that PEs 
have access to unbounded queues. Since PEs are relatively simple pieces of hardware, this 
may seem unreasonable. For example, consider the "border matrix" M = (ml,j)~×~ matrix, 
in which m~,j ~ 0 if and only if i E {1, n} or j E {1,n}. When M is processed by generalized 
MAT/VEC, we eventually reach the state shown in Figure 16 in which each of the two queues 
between PEs has (n - 2) elements in it. 
This illustrates worst case behavior and in general queues need to be as large as the largest 
"step"--the number of elements belonging to a single row (or column)--in a staircase. If we are 
given a bound, say a constant bound k, on queue size, then our strategy would be to partition the 
positions of nonzero entries in the matrix into k-staircases, which are staircases which contains at 
most k elements from a row or a column. This will ensure that all queues need to hold at most k 
elements at any time during the algorithm. Finding a minimum width covering of a matrix by 
k-staircases i as easy as finding a minimum width staircase cover. As k increases, larger queues 
axe possible, but the width of a minimum k-staircase cover decreases. The extremes are familiar 
to us: when k = 1 we have an instance of Melhem's tripe covers and when k -- n, we have an 
instance of the staircase covers as described in this paper. The precise nature of the trade-off 
between the value k and the width of k-staircases i  not clear and might be an interesting future 
direction of research. 
][n Section 3, we presented SCA an algorithm for computing the staircase width of a matrix. 
We did not consider the issue of rearranging the rows or columns of the matrix to minimize 
staircase width. This problem, similar to the problem of minimizing bandwidth of a matrix, is 
NP-complete [31], but nothing is known about obtaining approximate solutions to this problem. 
T~ds is another future direction of research that interests us. 
:Finally, as mentioned earlier, reconfigurable VLSI technology, especially FPGAs, provides the 
flexibility to implement a variable number of processing elements and queues of varying sizes 
in fast hardware. Mapping staircase-based algorithms to pipelined FPGA bus systems is an 
interesting challenge for the engineer. 
REFERENCES 
I. E.-J. Im and K. Yelick, Optimizing sparse matrix kernels for data mining, SIAM Internaional Conference on 
Data Mining, ht tp  : / /~ .  cs. berke ley,  edu/~e j  im/pub l i ca t  ion, (submitted). 
2. E.-J. Im and K. Yelick, Optimizing sparse matrix vector multiplication for register euse, International Jour- 
nal of High Performance Computing Applications (SCI), ht tp  : / /~ 'w.  cs. berke ley ,  edu/~e j  im/ pub- 
l i ca t ion / ,  (submitted). 
3. R. Melhem, Parallel solution of linear systems with striped sparse matrices, Parallel Computing 6, 165-184, 
(19ss). 
4:. I.S. Duff, Sparse numerical linear algebra: direct methods and preconditioning, In The State of the Art in 
Numerical Analysis, (York, 1996), pp. 27-62, Oxford Univ. Press, New York, (1997). 
5. P.E. Bjcrstad and T. Screvik, Data-parallel BLAS as a basis for LAPACK on massively parallel computers, 
In Linear Algebra for Large Scale and Rcal-Time Applications, (Leuven, 1992), pp. 13-20, Kluwer Acad. 
Publ., Dordrecht, (1993). 
6. 3. Dongarra, Performance ofLAPACK, In Large-Scale Matrix Problems and the Numerical Solution of Partial 
Differential Equations, (Lancaster, 1992), pp. 55-67, Oxford Univ. Press, New York, (1994). 
7. K.Y. Chen, Minimizing the bandwidth of sparse symmetric matrices, Computing 11,103-110, (1973). 
8. J.-C. Luo, Algorithms for reducing the bandwidth and profile of a sparse matrix, Computers gA Structures. 
An International Journal 44 (3), 535-548, (1992). 
9. R. Melhem, Determination of stripe structures for finite element matrices, SIAM Journal on Numerical 
Analysis 24 (6), 1419-1433, (1987). 
10. D.J. Evans, Block matrix multiplication and LU factorisation systolic arrays, Int. J. Comput. Math. 76 (1), 
45-57, (2000). 
11. P. R6zsa, The Role of Mathematics in Modern Engineering, (Melbourne, 1994), pp. 85-103, Studentlitteratur, 
Lund, (1996). 
12. Y. Sand, ILUM: A multi-elimination ILU preconditioner for general sparse matrices, SIAM J. Sci. Comput. 
1~ (a), g30-s47, (1996). 
13. T.A. Davis and I.S. Duff, A combined unifrontal/multifrontal method for unsymmetric sparse matrices, ACM 
Trans. Math. Software 25 (1), 1-20, (1999). 
608 L.S. HEATH et al. 
14. J.W.H. Liu, The multifrontal method for sparse matrix solution: Theory and practice, SIAM Rev. 34 (1), 
82-109, (1992). 
15. W. Hackbusch, A sparse matrix arithmetic based on 7-/-matrices. I Introduction to ?/-matrices, Computing 
62 (2), 89-108, (1999). 
16. S.A. Sauter, Variable order panel clustering, Computing 64, 223-261, (2000). 
17. C.E. Leiserson, Area-Ejficient VLSI Computation, MIT Press, Cambridge, MA, (1982). 
18. A. De Hon, The density advantage of configurable computing, Computer 33, 41-50, (2000). 
19. K. Compton and S. Hanck, Reconfigurable computing: A survey of systems and software, ACM Computing 
Surveys 34, 171-210, (2002). 
20. M. Eisenring and M. Platzner, A framework for run-time reconfigurable systems, Journal of Supercomputing 
21, 145-159, (2002). 
21. M. Platzner, Reconfigurable accelerators for combinatorial problems, Computer 33, 58-61, (2000). 
22. X.J. Zhang and K.W. Ng, A Review of high-level synthesis for dynamically reconfigurable FPGAs, Micro- 
processors and Microsystems 24, 199-211, (2000). 
23. K.Q. Li, Constant ime Boolean matrix multiplication on a linear array with a reconfigurable pipelined bus 
system, Journal of Supercomputing 11, 391-403, (1997). 
24. K.Q. Li, Y. Pan and S.Q. Zheng, Fast and processor efficient parallel matrix multiplication algorithms on 
a linear array with a reconfigurable pipelined bus system, IEEE Transactions on Parallel and Distributed 
Systems 9, 705-720, (1998). 
25. W.T. Trotter, Combinatorics and Partially Ordered Sets: Dimension Theory, The Johns Hopkins University 
Press, Baltimore, MD, (1992). 
26. R.P. Dilworth, A decomposition theorem for partially ordered sets, Annals of Mathematics 51, 161-166, 
(1950). 
27. L.S. Heath, F.T. Leighton and A.L. Rosenberg, Comparing queues and stacks as mechanisms for laying out 
graphs, SIAM Journal of Discrete Mathematics 5 (3), 398-412, (1992). 
28. L.S. Heath and S.V. Pemmaraju, Stack and queue layouts of posers, SIAM J. Discrete Math. 10 (4), 599~25, 
(1997). 
29. L.S. Heath and S.V. Pemmaraju and A.N. Trenk, Stack and queue layouts of directed acyclic graphs. I, SIAM 
J. Compu~. 28 (4), 1510-1539, (1999). 
30. L.S. Heath and S.V. Pemmaraju, Stack and queue layouts of directed acyclic graphs. II, SIAM J. Comput. 
28 (5), 1588-1626, (1999). 
31. L.S. Heath and A.L. Rosenberg, Laying out graphs using queues, SIAM Journal of Computing 21 (5), 
927-958, (1992). 
32. H.W. Lang, M. Schimmler, H. Schmeck and H. Schroder, Systolic sorting on a mesh-connected network, 
IEEE Transactions on Computers 34 (7), 652-658, (1985). 
33. S. Oliveira, A new parallel chasing algorithm for transforming arrowhead matrices to tridiagonal form, Math. 
Comp. 67 (221), 221-235, (1998). 
