Matrix-vector multiplication: Parallel algorithms and architectures  by Codenotti, B. & Puglisi, C.
Comput. Math. Applic. Vol. 16, No. 12, pp. 1057-1063, 1988 00974943/88 $3.00+0.00 
Printed in Great Britain. All rights reserved Copyright © 1988 Pergamon Press pie 
MATRIX-VECTOR MULT IPL ICAT ION:  
PARALLEL  ALGORITHMS AND ARCHITECTURES 
B. CODENOTTI and C. PUGLISI 
Istituto di Elaborazione del CNR, Via S.Maria 46, 56100-Pisa, Italia 
(Received 5 March 1988) 
Communicated by E. Y. Rodin 
Abstract--We present an efficient parallel implementation f matrix-vector multiplication on a binary 
tree, whose leaves are connected to local memories, each containing one column of the matrix. The 
performance attained can be favourably compared with the one of the mesh of trees and the linear array. 
Further, we analyze the ease of architectures with a fixed number of processors. Finally, some results 
concerning the parallel implementation f iterative methods for the solution of linear systems and for 
eigenvalue computations are described. 
1. INTRODUCTION 
There is a wide body of literature concerning parallel algorithms, also in the area of numerical 
linear algebra. Parallel algorithms have been developed since the 1960s, although no parallel 
architectures had been constructed atthat time (see, for example [1, 4, 6]). Today parallel computers 
are available as commercial products. Thus research can go from design and analysis to 
implementation. 
This paper presents the description of several ways to carry out matrix-vector multiplication, 
and some related problems using parallel machines. More precisely, we recall the performance of 
the linear array and the mesh of trees, and we propose a column-oriented arrangement of the 
components of the matrix on a binary tree architecture. The performance of this method can be 
favourably compared to the one of the known designs. 
The matrix-vector multiplication problem can be efficiently parallelized, since the naive 
fan-in algorithm attains the logarithmic lower bound to the computation time. On the other 
hand, the problem arises of distributing data among the local memories of the processors. 
In fact, note that matrix-vector multiplication consists of a set of scalar products, of the type 
(Ai, b ), where A; denotes the i-th column of the matrix, b is the vector, and ( , )  denotes the scalar 
product. It is easy to see that one of the primary concerns of the problem is to distribute the 
elements of vector b among different modules. The solution adopted in this paper consists 
of a column-oriented allocation of the components of the matrix among the leaves of the binary 
tree. 
The computational model here considered is a slight modification of a theoretical model widely 
used in the literature [1]. Namely, we assume that: 
(i) Each processor can perform one arithmetic operation in a unit of time; 
(ii) No penalties occur for accessing memory; 
(iii) One interprocessor communication (send or receive) takes one unit of time; 
(iv) The computation is carried out on a given network of processors. 
The measures of complexity we consider are the computation time T (i.e. the number of parallel 
steps), the speed-up S (i.e. the ratio between the time-cost of the best sequential lgorithm and the 
time-cost of the given parallel algorithm), and the efficiency E (i.e. the ratio between the speed-up 
and the number of processors used). 
In this paper we have two primary goals: first, we show that matrix-vector multiplication on 
a binary tree has optimal speed-up, then we give some criteria to choose a convenient setting for 
a configurable network of few processors, in order to compute the matrix-vector product. 
I057 
1058 B. CODENOTTI and C. PUGLISI 
The description of the parallel computations will be given in terms of algorithms expressed 
using a Pascal-like notation. A sequence of statements of the type "Command~ NEXT Command2" 
means that Command2 must be executed after the completion of Commandj. A sequence of 
statements of the type "Command~, Command2" means that Commandj and Command2 can be 
executed simultaneously. The symbol Pg denotes the i-th processors in any network (in the 
natural order), while otherwise specified. The processors in the binary tree are numbered starting 
from the root. A comment between parentheses near each statement specifies the processors 
executing simultaneously the statement. The statement SEND (a,b) means that data a is 
sent to processor b. Similarly, the statement RECEIVE (a, b) means that data a is received from 
processor b. 
In the case of hierarchical structures (such as the binary tree and the mesh of trees) symbols p(i), 
d(i), and s(i) denote the father, the right and the left son of node i, respectively. In the paper all 
logarithms are to the base 2. 
2. L INEAR ARRAY AND MESH OF TREES 
In this section two known architectures suited for the computation of matrix-vector multipli- 
cation are described, namely the linear array [2] and the mesh of trees [3]. A linear array is formed 
by identical processors linearly connected (see Fig. 1). Matrix by vector multiplication can be 
computed on the linear array according to the following algorithm. 
Algorithm 2. I 
begin 
RECEIVE (bi, i - 1) 
Si:=ai lb i
SEND (b a, i + 1) 
fo rk=2 to n 
begin 
RECEIVE (bk, i -- 1) 
m+k:=ambk 
SEND (b~, i + 1) 
Si:=mik -}- S i 
end 
end 
NEXT 
NEXT 
NEXT 
NEXT 
NEXT 
NEXT 
{Pi, i=2,3 . . . . .  n} 
{e.i = 1,2 . . . . .  n} 
{e~,i = 1,2 . . . . .  n - 1} 
{Pi, i =2,3  . . . . .  n} 
{Pi, i= 1,2 . . . . .  n} 
{P,.i = 1,2 . . . . .  n - 1} 
{Pi, i = 1,2 . . . . .  n} 
Note that each processor receives one row of the matrix. With the aid of algorithm 2.1, it is easy 
to see that the time-performance attained by the linear array with n processors i  
T(n)  = 7n - 6 
A faster architecture is the mesh of trees, which consists of a square arrangement of processors, 
where each row and column of the mesh is connected by a tree (see Fig. 2). 
A mesh of trees with 3n2-2n nodes can compute matrix by vector multiplication in time 
T(n) = 5 log n + 1, according to the following algorithm. 
Algorithm 2.2 
{Each processor of the mesh is denoted by the pair (k, r); processors belonging to the r-th tree 
of the columns are denoted by index j ;  processors belonging to the k-th tree of the rows are denoted 
by index i. This algorithm is referred to processors of the r-th column-tree, processors (k, r) of the 
H H H t'--- ..... q l  
Fig. 1. A linear array. 
Matrix-vector multiplication 
( 
Fig. 2. A mesh of trees. 
1059 
mesh, and processors of  the k-th row-tree. Symbols po(k, r) and pr(k, r) denote the father of  node 
(k, r) belonging to the column-tree and to the row-tree, respectively} 
begin {processors of  the r-th column-tree} 
RECEIVE  (br,p(j)) NEXT {Pj , j= 1,2 . . . . .  n -2}  
SEND (b,, d(j)), {Pj,j = 0, 1, 2 . . . . .  n - 2} 
SEND (b,, s(j)) {Pj,j = 0, l, 2 . . . . .  n - 2} 
end 
begin {processor (k, r) of  the mesh} 
RECEIVE  (b,,pc(k, r)) NEXT 
Sk,==akrbr NEXT 
SEND (skr, pr(k, r)) 
end 
begin {processors of  the k-th row-tree} 
RECEIVE  (s~ (')), s(i)), {P,, i = 0, 1, 2 . . . . .  n - 2} 
RECE IVE  (s~ '~°), d(i)) NEXT {Pi, i = 0, 1, 2 . . . . .  n - 2} 
s~)==s~ ~;)) + s~ d~;)) NEXT {P~, i = 0, 1, 2 . . . .  , n - 2} 
SEND (s~°, p(i )) {P~, i = 1, 2 . . . . .  n - 2} 
end 
In Fig. 3, the performance of  the two architectures here analyzed is summarized. 
3. B INARY TREE 
In this section we analyze some possible choices for data allocation, related to the matrix-vector 
multiplication problem. First note that a "row oriented" allocation of  the matrix leads to the need 
Architecture T 5' E n Proc 
Linear array 7n-  6 n/  7 1 /7  5n 2_  2n 
Mesh of frees 51.ogn+l n2/SLogn 1/St.ogn n 
Fig. 3. Performance of the linear array and of the mesh of trees. T is the computation time; S is the 
speed-up; E is the efficiency; n Proc denotes the number of processors. 
C.A.M.W.A. 16/12--F 
1060 B. CODENOTTI and C. PUGLISI 
of computing n scalar products, all involving the same vector; therefore the problem arises of 
distributing the entries of the vector among the different modules devoted to the computation of 
each scalar product. This problem can be handled in various ways (for instance, by replying data), 
but it seems difficult to derive efficient parallel strategies. 
On the contrary, a "column-oriented" allocation naturally leads to an efficient parallel 
implementation of matrix-vector multiplication. In fact, consider a set of 2n -  1 processors 
connected to form a binary tree, whose leaves are provided with a local memory containing a
column of the matrix, as shown in Fig. 4. This network of processing elements can perform the 
computation by implementing a pipelined version of the fan-in algorithm, according to the 
following algorithm. 
Algor i thm 3.1 
begin {c(i) denotes the column allocated in the memory of the i-th leaf} 
for j = 1 to n 
begin 
s~i),=aj,~i)bc~i) NEXT {Pi ,  i = n - 1, n . . . . .  2n - 2} 
SEND (S~J ) ,p ( i ) )  NEXT {Pi ,  i =n  - 1 ,n  . . . . .  2n -2}  
end 
end 
begin 
RECEIVE (s~ '~°), s (  i )), { P , ,  i = O, 1, 2 . . . . .  n - 2} 
RECEIVE (s~ '~')), d ( i ) ) ,  NEXT {P,, i = 0, 1, 2 . . . . .  n - 2} 
s~°:=s~ )) + s~ °) NEXT {P, i = 0, 1, 2 . . . . .  n - 2} 
for j = 2 to n . 
begin 
SEND ( s )~) l ,p ( i ) ) ,  {P~, i = 1,2 . . . . .  n -2}  
RECEIVE (s~ "°)), s ( i ) ) ,  {P ,  i = O, 1, 2 . . . . .  n - 2} 
RECEIVE (s~ d°)), d( i ) )  NEXT {P~, i = 0, l, 2 . . . . .  n - 2} 
s)°'.=s~ "~)) + s~ '~°) NEXT { P, ,  i = O, 1, 2 . . . . .  n - 2} 
end 
SEND (s~i) ,p( i ) )  {P , ,  i = 1, 2 . . . . .  n - 2} 
end 
The performance attained by this algorithm is described in Fig. 5. It is worth pointing out that 
the time-cost of algorithm 3.1 slightly improves that of algorithm 2.1. 
Fig. 4. The binary tree. 
Matrix-vector multiplication 1061 
Architecture 7" S E n Proc 
Binory tree 2n+3Logn-1  n/2  I /4  2n-1  
Fig. 5. Performance of the binary tree. T is the computation time; S is the speed-up; E is the efficiency; 
n Proc denotes the number of processors. 
4. SMALL NUMBER OF PROCESSORS 
Parallel computers are today available as commercial products. Thus the designer of parallel 
algorithms is confronted with the problem of efficiently developing algorithms running on these 
machines. In the case of configurable parallel architectures, such as the CHiP machine [7], one has 
to embed the computation graph of a given parallel algorithm on the lattice of processing elements 
forming the machine. 
In this section, we take into account he problem of connecting a given number of processors 
in order to compute matrix-vector multiplication. For this purpose, we modify the three 
algorithms described in Sections 2 and 3, to treat the case for which the number of processors 
is independent of the size of the problem. More precisely, we have to slow down the compu- 
tation, because of the small number of processors, and to maintain the "same" structure of the 
algorithms. 
Consider a linear array of p processors, performing n x n matrix by n-vector multiplication, as 
suggested by the following algorithm. 
Algorithm 4.1 
{This algorithm can be derived by suitably modifying algorithm 2.1. Details can be found in [5]}. 
begin 
Each processor eceives a contiguous group of rows, and processes them a square block at 
a time. 
end 
This algorithm has running time 
Tp(n) -- 4nZ/p - 2nZ/p 2 + 3n - 5n/p + p - 1. 
Similarly, algorithm 4.2 describes the same computation on a mesh of tress of p processors. 
Algorithm 4.2 
{This algorithm can be derived by suitably modifying algorithm 2.2 Details can be found in [5]}. 
begin 
Each processor of the mesh receives a square submatrix, and processes it a row at a time. 
end 
Finally, algorithm 4.3 presents the computation on a binary tree. 
Algorithm 4.3 
{This algorithm can be derived by suitably modifying algorithm 3.1 Details can be found in [5]}. 
begin 
Each leaf of the binary tree receives a group of contiguous columns, and processes it a row 
at a time. 
end 
Figure 6 illustrates the performance of the three algorithms. It turns out that the binary tree and 
the linear array attain a speed-up slightly better than that of the mesh of trees. 
1062 B. CODENOTTI and C. PUGLISI 
Architecture T S E Range 
4nZ/p - 2 n2/p 2 + 
Linear array p /4  1/4 3=;p~n/2 
3n-Sn  /p  +p-1  
Mesh of trees 
Binary tree 
6n2/p+ n J '~p  + 
5 log4p/  3 - -2  
4n2/  ( p+l  ) + 
3tog(  p+11-4  
p/6  
p+l  
4 
116 
p+l  
4p 
8<.p<.3n2--2n 
3~p~ 2n-  1 
Fig. 6. Performance of the linear array, the mesh of trees, and the binary tree of p processors. T is the 
computation time; S is the speed-up and E is the efficiency, both computed assuming p to be constant 
respect o the size of the problem; range is the range of p for which the algorithms work. 
5. APPL ICATIONS AND CONCLUSIONS 
Architectures suited to compute matrix-vector multiplication are also naturally amenable for 
supporting iterative methods for the solution of linear systems and for eigenvalue computations. 
Here we study the implementation of Jacobi method for the solution of linear systems and of 
Lanczos algorithm for tridiagonalization of a matrix on a binary tree. 
Jacobi method can be applied to linear sytems whose coefficient matrix is diagonally dominant. 
Let A e R "×" be a diagonally dominant matrix, and b e R". 
The linear system 
Ax = b 
can be rewritten as 
where 
P =I - -D- JA ,  
leading to the iterative method 
x = Px + q, (1) 
q =D- lb  and D =Diag (alha22 . . . . .  a,,), 
x i+ l=Px i+q,  i=0,1,2 . . . .  
x0 = 0. (2) 
Assume that the local memory of each leaf of the binary tree contains a column of the coefficient 
matrix A. Since we need the columns of the matrix P, it seems convenient to rewrite equation (2) 
as 
i.e. 
Xi+ 1 = D -t[(D - A)xi + b], (3) 
xi+~ = D-~$i, ~ i=(D -A )x i+b.  (4) 
Note that the entries of the matrix D - A are stored columnwise in the local memories of the 
leaves of the binary tree. Thus the computation of x~÷ ~ can be performed in three stages: 
(i) Compute (D -A)x~, as in Section 3; 
(ii) Compute (D -A)x~ + b, in the leaves of the binary tree; 
(iii) Compute xg+~ by performing an arithmetic division in the leaves. 
Matrix-vector multiplication 1063 
It turns out that this algorithm uses only one parallel step more than the direct application of 
equation (1), starting from matrix P. This extra step consists of the computation of the last 
component of xi+l. One can see that the overall number of steps to perform one iteration of Jacobi 
method is 2n + 5 log n + 1. 
The binary tree is suited to execute Lanczos algorithm, too. We adopt the following version of 
the algorithm. 
Algor i thm 5.1 (Lanczos  i terat ion) 
Let r0 be an n-vector, arbitrarily chosen; 
let ~0 = 1, qo = O, ql = r0; 
for any step i( i  = l, 2 , . . . ,  n): 
(1) If fli-~ =~ O, then compute qi = r~_l/fl i_ 1 
else construct qi orthonormal to q0, ql, • • •, qi- i. 
(2) Compute 
(a) ~t i = qrAq i ,  
(b) r i = (A - g,,)q, - f l i -  i q , -  l, 
r, ll2. 
Each step of Lanczos algorithm can be easily paraUelized, since it consists mainly of scalar 
products, and norm computations. With some efforts, one can show that the running time of the 
parallel version of algorithm 5.1 (one iteration of Lanczos method) is 2b + 15 log n + 7 [5]. 
REFERENCES 
1. D. Heller, A survey of parallel algorithms in numerical linear algebra. SIAM Rev. 20, 740-777 (1978). 
2. H. T. Kung and C. E. Leiserson, Systolic arrays (for VLSI). In Introduction toVLSI Systems (Eds Mead and Conway). 
Addison-Wesley, Reading, Mass. (1979). 
3. F. T. Leighton, New lower bound techniques for VLSI. Proc. 1EEE FOCS, pp. 1-12 (1981). 
4. W. L. Miranker, A survey of parallelism in numerical analysis. SIAM Rev. 13, 524-547 (1971). 
5. C. Puglisi, Algoritmi dell'Algebra Linear¢ in Parallelo, Tesi di Laurea, Dipartimento di Informatica, Universita' diPisa 
(1988). 
6. A. Sameh, Numerical parallel algorithms--A survey. In High Speed Computer and Algorithm Organization, pp. 207-228. 
Academic Press, New York (1977). 
7. L. Snyder, Supercomputers and VLSI: the effect of large scale integration on computer architecture. Adv. Comput. 23, 
2-35 (1984). 
