Matrix multiplication is a computation and communication intensive problem Six parallel algorithms for matrix multiplication on the Connection Machine are presented and compared with respect to their performance and processor usage. For n by n matrices, the algorithms have theoretical running times of O(n log n), O(n log n), O(n), and O(1og n), and require n, n , n , and n processors, respectively. With careful attention to communication patterns, the theoretically predicted runtimes can indeed be achieved in practice. The parallel algorithms illustrate the tradeoffs between performance, communication cost, and processor usage. 
Introduction
to save this padding, the maximum size of arrays that can be run on the Chf would not increase. Furthermore, the algorithms do not introduce dummy processors except proportional to the padding just discussed. Thus, the algorithms waste neither space nor processors frivolously.
By leaving out special-casing for array sizes, we exclude some of the faster algorithms. For example, a fast O ( n ) systolic algorithm requires square matrices. However, by insisting on generality we obtain programs that run efficiently and without waste, independent of whether the matrices are square, skinny, or fat.
For brevity, however, section headings quote performance characteristics for square arrays of size n x n; we leave it as an exercise to the reader to derive the precise formulas for general, rectangular arrays.
A further aspect regarding generality is that all input and output matrices for a given algorithm are allocated in the same manner, for example row-major. We do not assume that matrices A or B are transposed beforehand t o speed up communication. If a matrix must be transposed or brought into some other special alignment, then our algorithms include the necessary steps, and our measurements include the time consumed by these steps.
All algorithms use double precision floating point. All algorithms are implemented in C*, except the sequential one, which is written in C. Reprogramming the C* programs in *Lisp would probably make them run faster.
All algorithms use the general CM communication; they do not take advantage of the grid communication or reduction operations in CM microcode. Most algorithms would run faster by using these features. Programming with some of these features is difficult in C*.
The O(n3) (sequential) algorithm
The sequential algorithm is the standard textbook version with three nested loops, highly optimized. In particular, code motion and strength reduction (replacing multiplication with addition) make address computations fast.
Registers hold indices, offsets, and temporary results. This algorithm does not use the CM at all, because it runs completely on the frontend. I t is included for comparison purposes. 
2.3
In this algorithm, each processor contains a row of each of the matrices A, B, and C. Thus, we need max(1, m) processors. The algorithm computes 1 x n inner products in sequence, one for each element of C. An inner product executes m multiplications in parallel in constant time, and then a parallel sum reduction to produce the sum in O(1og m) time The O(n2 log n) algorithm
2.4
This algorithm is a straight-forward generalization of the previous one. Instead of using m processors to perform a single, parallel inner product step at a time, we use I x m processors to produce an entire column of C at once.
Each processor contains one element of A, B , and C. Assume that matrix elements are assigned in row-major order to the processors (see Fig 2) .
For computing a column of C at once, we first need to broadcast the relevant column of B to all rows of A, then execute I x n multiplications in parallel, and then perform I sum reductions in parallel. The multiplications take constant time and the sum reductions O(1ogm) time, including communication. The broadcast of columns of B must be arranged carefully as follows.
There are several alternatives for implementing the broadcast on the CM. First fanout tree has the same structure as a tree for reduction, except that data flows from the root to the leaves instead of vice versa. Since we need a tree for each column element, we actually need to construct a parallel fanout forest. The fanout forest broadcasts B's column in logarithmic time.
In detail, the fanout forest operates as follows. First, we "seed" the column of B into the first row of A , in parallel for all elements. Next, we instruct the first row of A to duplicate the seeded coefficients one row down, also in parallel. Next, the first 2 rows of A to duplicate their elements 2 rows down, then the first 4 rows duplicate 4 rows down, and so on. In each step, the number of copies of the column of B doubles, until the entire matrix A is filled. This process takes O(log I ) communication steps. We programmed the broadcast explicitly, although the CM actually provides a primitive for it. Using the primitive instead would significantly speed up that portion of the program.
Another detail concerns the sum-reductions. The I sum-reductions run along the rows of A , orthogonal to the broadcast. Again, we programmed this process directly rather than using the corresponding CM primitive. Using this primitive also speeds up the program.
The fast O(n1ogn) algorithm
When analyzing the broadcast operation in the previous algorithm, one notices that it uses processors and communication bandwidths poorly. In the first step of the broadcast, only m of the I x m processors operate, and in the last one, barely half of the processors operate. The algorithm in this section eliminates the slow broadcast altogether.
As before, we lay out matrices one element per processor, in row-major order. Figure 3 illustrates. First, we transpose matrix B and overlay it onto A; the router performs this operation in constant time. The transposed overlay has the effect that row i of A is lined up with column i of B. Next, we perform 1 parallel inner product steps, producing the main diagonal of
As the next step, we rotate the transposed matrix B up one row, with the topmost row reentering at the bottom. NOW row i of A is lined up with column (i + 1) mod n, and we compute the upper main diagonal of C, along with element Ci-l,o. After n steps of inner product computation and rotation, C is complete.
This algorithm uses the bandwidth of the communication network and the processors effectively. Again, all communication is implemented directly, rather than using CM primitives. A detail involves the relative sizes of A and B : In general, the number of rows of A does not match the number of columns of B , so the transposition and rotations must be done with care. Our approach was to keep the smaller of the two arrays in place and rotate the larger, although the opposite might improve performance somewhat.
The slow O(n) algorithm
A problem with the previous algorithms is that they all perform logarithmic sum reduction. The algorithm in this section avoids the corresponding factor of O(1og n) by distributing the cost of the addition over the communication, and thus achieves linear runtime.
The algorithm is called "systolic", because it alternates between two distinct phases, a communication phase and a computation phase. Assume we have I x n processors, each computing one element of the result matrix Line up in the right order, the rows and columns enter C in skewed order, i.e., they are delayed from entering by one step per row or column. Processors without coefficients in a particular step are simply disabled. Figure 5 shows the resulting configuration after 3 shifts. For more details on the systolic algorithms, see reference lead to the most efficient implementation. Since the Connection Machine is SIMD, the injection and the shifts cannot occur simultaneously: First, the interior elements retrieve from the north and west, and then the leading row and column of C send for their next elements. Thus, the communication phase has two subphases, each of which idles a significant portion of the processors. We therefore modified the algorithm, realizing that the shifts are only an artifact of the topology of systolic processor arrays. With general communication on the CM, any processor can retrieve data from any other processor in essentially the same time. Thus, during each communication phase, every element of C retrieves the required coefficients from A and B directly. The address computation is identical for all elements and splitting the communication phase into subphases is avoided. Since no two elements of C retrieve the same coefficients, there are no problems with fanout as in Section 2.4. In section 3, we report only on the faster of the two variants.
2.7-The fast O ( n ) algorithm
A flaw of the previous algorithm is that even the faster variant underutilizes the processors. The activation of processors spreads from the north-west corner towards the south-east corner, with never more than half the processors busy. On average, processor utilization is only 1/3. How can we keep for the elements of C need not be computed in the same order. Consider
The answer to this question derives from the fact that the inner products
Because of the commutativity of the addition, there is no need to accumulate the sum starting with IC = 0. Instead, we could start with any ki,j in [O.. . m -11, sum to m -1, then "wrap around" and add the the terms from 0 to ki,j -1. Observe furthermore, that in each of the I X n multiplications we must make sure that no two use the same coefficients from A or B , because we would have a slowdown caused by fanout otherwise. We can exploit the commutativity of the addition to achieve this separation. One way to use different coefficients everywhere is to let ki,j = ( i t j ) mod m. In other words, the starting index for building the inner product is skewed for each row of A and column of B , guaranteeing that no element of A or B is used twice in a single inner product step. Figure 6 illustrates the starting assignment of k;,j in a 4 x 4 array. For square matrices, this approach is equivalent to overlaying A , B , and C, skewing the rows of A and the columns of B with wrap-around, and then rotating the rows of A and the columns of B during each step. The rotation does not work well for non-square matrices, and since we did not use the general grid addressing on the CM, rotation has no advantage over 
The O(1ogn) algorithm
The fastest algorithm is one that uses n3 processors to compute all n3 products .simultaneously, and then performs n2 sum reductions in parallel to produce C. The multiplication takes constant and the sum reduction logarithmic time. We also must take into account the duplication and alignment of data prior to the multiplication. Each row of A and each column of B must be dupIicated n and I times, respectively, and paired properly one with the &her. Two fanout forests, one for A and one for B , broadcast and align the data in logarithmic time. Figure 7 illustrates the pairings of rows and columns for 3 x 3 matrices.
Performance Results
We implemented d algorithms in C* and timed them on a Connection Machine model 2 with 32K processors (system version 5.0, field test), without floating point chips. As stated, all arithmetic (except address calculations) used double-precision floating point. The frontend controlling the CM was a DEC VAX under the ULTRIX operating system.
Since the speed of the VAX is often no match for the CM, we report timings measured only on the CM. Although this may be unrealistic for the given configuration, we believe this choice is justified for the following reasons. When the virtual to physical processor ratio is 1, the frontend time is typically more than twice as high as the CM time, with a CM utilization of less than 50 per cent. This indicates that the VAX cannot keep up with the CM, mainly because of its raw MIPS rating, but also because the frontend is timeshared among other users. When the virtual to physical processor ratio is high, such as 8, then the times on the frontend and the CM are nearly identical (for both elapsed times and the combined system and user times). This is an indication that the simulation of virtual processors is slowing the CM enough to match the speed of the VAX. ' We are therefore convinced that in this situation, the timings on the CM are more accurate. For a well-matched, faster frontend, such as a Symbolics Lisp machine or a SUN/4, timings would best be taken on the frontend.
For simplicity, all measurements were run with square matrices. The results are summarized in Figures 8 and 9 . Figure 8 shows all 7 curves for array sizes of up to 250 x 250. We shall discuss the curves clockwise, starting form the top left. The leftmost curve represents the O(n2 log n) algorithm, using n processors. This algorithm appears slower than even the sequential, 0(n3) algorithm (second curve). This is not surprising, since the difference between logn and n is not enough to offset the difference between a 1-bit processor and a 32-bit processor for the small values of n shown. Note, however, that the two curves will eventually cross, since the first is of a lower order. We estimate that the crossover point is n zi 450.
'Note that the virtual processor simulation is performed by the microcode of the CJI: Each instruction issued by the frontend is repeated implicitly by the CM's instruction decoder for the number of virtual processors assigned to the physical processors. Using n2 processors boosts performance far beyond that of the sequential processor. The crossover points occur early, clearly demonstrating that slow, but numerous processors can outperform a single, fast sequential processor. The fast O ( n log n) algorithm is almost twice as fast as the slower variant, demonstrating the significance of communication costs. The faster algorithm is almost as fast as the slow linear algorithm. The second linear algorithm is another factor of 2 faster, demonstrating the effect of full utilization of processor and communication bandwidth.
12

ORIGINAL PAGE IS
OF POOR QUALITY
A curious effect is the jump in these 4 curves, occurring for a problem size of about 180. Note that at this point, 32,400 processors are in use, which is the capacity of the CM available (32,768 processors). An increase of the problem size beyond 181 requires a virtual to physical processor ratio of 2. Thus, execution times double, and the gradient of the curve doubles also. The next such doubling would occur for n > 256, and then again for n > 362. On a full connection machine with 65,536 processors, the first jump would not occur until n > 256.
By doubling the virtual processor ratio, the times do not quite double, since communication between virtual processors simulated on the same physical processor is more efficient than communication among separate physical processors. However, the savings observable are minor. In the case of the fastest linear algorithm, increasing the virtual to physical processor ratio from 1 to 2 increased the time by a factor of about 1.93 rather than 2. Thus, for matrix multiplication, the savings are only 7 per cent.
The last curve is for the O(1og n) algorithm, which requires n3 processors. The effect of the virtual to physical processor ratio for some of the less processor-intensive problems is shown in Figure 9 . For larger problem sizes, the cubic behavior of the matrix multiplication cannot be denied. The performance of the "linear" algorithm is still cubic, once the number of processors is exhausted. Essentially, the performance curve is a cubic parabola, divided by the large constant factor of 32,000. The straight line a t the bottom is the virtual time of each processor. This is the time we would see if we had as many processors as memory words.
Note that the absolute performance achieved is hardly overwhelming. Counting only double-precision floating point additions and multiplications, the fast sequential algorithm achieves only about 4 Mflops for 1SO x 180 arrays; this number would double for a full (64K processor) CM-2. Including arithmetic instructions for address calculations, we come up with about 20 Mips, or 40Mips for a full CM-2. (This number excludes the instructions for stack manipulation and communication). Thus, we achieved only about 1/60 of the "typical application performance" quoted by Thinking Machines Corporation for general computing. Apparently, our algorithms are communication bound, and using the special features of the communication network would pay significant dividends. The numbers also demonstrate how difficult it is with the present programming languages to harness the power of the CM.
0
200 400 600 800 1000 The measurements in Figure 8 were taken in increments of 10, so the 'curves are quite accurate. Each individual point was computed by repeating the same problem long enough to have a cumulative runtime of between 2 and 4 minutes, and then dividing by the number of actual runs. The variance in time for the individual runs was so small as to be invisible on the diagram. For Figure 9 , we took measurements in increments of 50, with increments of 10 around the points were virtual to physical processor ratios change. Some of the higher points in this diagram represent the average of only a few runs. rethought. CM programmers have already discovered some new and interesting, totally parallel solutions for many problems, from multi-grid methods to document retrieval to ray tracing. Furthermore, we predict that many of our sequential algorithms will turn out to be special cases of parallel ones.
A second important insight is that with the right choice of algorithm and communication pattern, the speedup attainable is indeed proportional to the number of processors used. With few exceptions, all previous experiments with multiprocessors showed a point of diminishing and even reversing returns, when the addition of processors did not speed up a program proportionally or even slowed it down. At no time did we observe these effects on the CM; performance was always within a constant factor of the theoretically predicted, asymptotic performance. We suspect that earlier multiprocessors simply had insufficient communication bandwidth and high synchronization overhead. Because of the SIMD nature of the CM, there is no synchronization overhead, and the bandwidth of the hypercube is well matched to the demands that the processors can generate.
We can also confirm that the concept of the virtual processor is a great simplification for parallel programming. Not having to write twisted code for mapping a given problem onto a particular set of processors makes for easily written, easily understood, and easily ported programs. Further study is required to make this concept applicable when programs need to change the number of virtual processors dynamically.
There are also a number of negative conclusions. First, using a superlinear polynomial of processors severely limits the problem size, and the resulting program may not run efficiently because of the overhead of virtualization. In our example, using one processor per data element yielded the best overall performance. However, for small problem sizes, a superlinear number of processors is the best way t o bring the entire available hardware to bear on a problem.
Second, it became quite clear that automatically transforming "dusty deck" sequential problems to large-scale parallel ones is a pipedream. Considering matrix multiplication, it is easy to see how a compiler would detect the inner loop of the sequential program and transform it into a vector operation. However, we severly doubt whether a general compiler could be built that could generate all six variants we discussed from a single, sequential program. If automatic transformation can be done at all, it would have to start with the problem specification and not with a sequential implementation. In a sequential program, too many opportunities for parallelism have been hidden or eliminated.
A number of further studies should be done to get a better grasp of the idiosyncrasies of the Connection Machine. First, all programs should be rewritten in *Lisp, to compare the quality of the two two language implementations and the effect of the frontend. Second, to quantify the potential gains from the special features of the router, all programs should be modified t o use them. Preliminary experiments have shown that by using just the reduction operators, the O(n1ogn) algorithms run almost as fast as the corresponding linear algorithms. Of course, the linear algorithms could also be improved by using grid addressing. Finally, the ratio of communication time t o computation time should be determined by simply leaving out the floating point operations. It appears that all our implementations are communication bound and that floating point operations actually consume a negligible percentage of the time. Matrix multiplication shares this property with many other problems. Perhaps communication cost will turn out t o be the dominant cost for all large-scale parallel algorithms. 
18
A General remarks about the programs
All programs have a macro called DEBUG. When this macro is defined, either in the program directly, or via the -D option on the cc or cs command line, then detailed tracing information about the matrices will be printed. With the exception of the sequential algorithm, array dimensions are compiled into the programs. By using constants rather than variables, the programs run about 10 percent faster on the CM. There is no noticeable difference for the sequential algorithm. The matrices a r e allocated such t h a t each processor has one row of each matrix. The algorithm perforas n*n inner products i n sequence. All communication is done by t h e router. number of processors: n; performance: n*n*logn . r e g i s t e r i n t mono run. /* e l l h i n a t i n g temp causes a collision bug*/ The matrices are allocated such t h a t each processor has one elnent of o t each matrix. Each c o l u u~ of the second matrix is broadcast over t h e rows of the f i r s t matrix. then the products are a l l formed i n p a r a l l e l , and the rows are sum~ed i n parallel. This is repeated f o r every column * of the second matrix. A l l extern unsigned ~~virtual,to,ph~sic~,processor~ratio; /* (V*W>/(p*q) */ int mono rows. int mono cols) :
void nain(int argc, char *argvO) € r e g i s t e r int mono s t r i d e ; r e g i s t e r int mono E-col; /* runs through column numbers of B */ r e g i s t e r int mono 1~11. nun-of-runs; CH-timeval-t mono timer-results; 
