Abstract. This paper examines common implementations of linear algebra algorithms, such as matrixvector multiplication, matrix-matrix multiplication and the solution of linear equations. The different versions are examined for efficiency on a computer architecture which uses vector processing and has pipelined instruction execution. By using the advanced architectural features of such machines, one can usually achieve maximum performance, and tremendous improvements in terms of execution speed can be seen over conventional computers.
1. Introduction. In this paper we describe why existing algorithms for linear algebra are not usually suited for computers that employ advanced concepts such as pipelining and vector constructs to achieve enhanced performance. We examine the process of refitting or reorganizing an underlying algorithm to conform to the computer architecture, thereby gaining tremendous improvements in execution speeds while sacrificing neither accuracy nor algorithm clarity. This reorganization, where it can be done, is usually conceptually simple at the algorithm level. This paper will not address the issues involved with parallel processing. For a survey of parallel algorithms in linear algebra see the review paper by Heller [8] .
We will not concern ourselves here with an actual implementation on a specific architecture: To do so, one must understand all the subtlety and nuances of that architecture and risk obscuring the fundamental ideas. Rather, we use the features of a vector pipeline machine to understand how various aspects interrelate and how they can be put together to achieve very high execution rates.
We use the term architecture in reference to the organization of the computer as seen by the programmer or algorithm designer. Within the architecture we focus on the instruction set and memory references, and their interaction in terms of performance.
We will concentrate our examination on the behavior of linear algebra algorithms for dense problems that can be accommodated in the main memory of a computer. The solutions proposed here do not, in general, carry over to sparse matrices because of the short vector lengths and the indirect addressing schemes that are so prevalent in sparse matrix calculations. For a discussion of methods for handling the sparse matrix case, see [5] , [7] .
We will focus in particular on algorithms written in Fortran and assembly language. Fortran is an appropriate language, given the scientific nature of the application; occasional use of assembly language enables us to gain the maximum speed possible. The use of Fortran implies some storage organization for array elements. By definition Fortran must have matrix elements stored sequentially by column. As one accesses consecutive elements in a column of an array, the next element is found in the adjacent location. If references are made by rows of the array, accesses to the next element of a row must be offset by the number of elements in a column of the array. This organization will be important in the actual implementation. 2. Vector pipeline concepts. As background we will describe some of the basic features found in supercomputers, concentrating on those features that are particularly relevant to implementing linear algebra algorithms. For a more thorough discussion of supercomputers, see [9] , 11 ], 13]. We will concentrate our attention on an architecture that is "Cray-like" in structure, i.e., one that performs vector operations in a vector register configuration with concurrency provided by pipelining and independent instruction execution. We choose a configuration like the Cray-16] for a number of reasons: It performs well on short vectors, it has a simple instruction set, and it has an extremely fast execution rate for dense, in-core linear algebra problems. Nevertheless, the underlying concepts can be applied to other machines in its class, e.g., the These six instructions would correspond to perhaps 6k + instructions on a conventional computer, where k instructions are necessary for loop branching. Clearly, then, the time to interpret the instructions has been reduced by almost a factor of k, resulting in a significant savings in overhead.
Cray-like hardware typically provides for "simultaneous" execution of a number of elementwise operations through pipelining. Pipelining generally takes the approach of splitting the function to be performed into smaller pieces or stages and allocating separate hardware to each of these stages. Pipelining is analogous to an industrial assembly line where a product moves through a sequence of stations. Each station carries out one step in the manufacturing process, and each of the stations works simultaneously on different units in different phases of completion. With pipelining, the functional units (floating point adder, floating point multiplier, etc.) can be divided into several suboperations that must be carried out in sequence. A pipelined functional unit, then, is divided into stages, each of which does a portion of the work in, say, one clock period. At the end of each clock period, partial results are passed to the next stage and partial results are accepted from the previous stage.
The goal of pipelined functional units is clearly performance. After some initial startup time, which depends on the number of stages (called the length of the pipeline, or pipe length), the functional unit can turn out one result per clock period as long as a new pair of operands is supplied to the first stage every clock period. Thus, the rate is independent of the length of the pipeline and depends only on the rate at which operands are fed into the pipeline. Therefore, if two vectors of length k are to be added, and if the floating point adder requires 3 clock periods to complete, it would take 3 + k clock periods to add the two vectors together, as opposed to 3 * k clock periods in a conventional computer.
Another feature that is used to achieve high rates of execution is chaining. Chaining is a technique whereby the output register of one vector instruction is the same as one of the input registers for the next vector instruction. If the instructions use separate functional units, the hardware will start the second vector operation during the clock period when the first result from the first operation is just leaving its functional unit. A copy of the result is forwarded directly to the second functional unit and the first execution of the second vector is started. The net result is that the execution of both vector operations takes only the second functional unit startup time longer than the first vector operation. The effect is that of having a new instruction which performs the same operation as that of the two functional units that have been chained together. On the Cray in addition to the arithmetic operations, vector loads from memory to vector registers can be chained with other arithmetic operations.
For example, let us consider a case involving a scalar-vector multiplication, followed by a vector-vector addition, where the addition operation depends on the results of the multiplication. Without chaining, but with pipelined functional units, the operation would take a + k + m + k clock periods, where a is the time to start the vector addition (length of the vector addition pipeline) and m is the time to start a vector multiplication (length of the vector multiplication pipeline). With chaining, as soon as a result is produced from the adder, it is fed directly into the multiplication unit, so the total time is a + m + k. We may represent this process graphically as in Fig. 1 . It is also possible to overlap operations if the two operations are independent. If a vector addition and an independent vector multiplication are to be processed, the resulting timing graph might look like Fig. 2 .
To describe the time to complete a vector operation, we use the concept of a chime [6] . A chime (for chaining time) is a measure of the time needed to complete a sequence of vector operations. To compute the number of chimes necessary for a sequence of operations, one divides the total time to complete the operations by the vector length. Overhead of startup and scalar work are usually ignored in counting chimes, and only the integer part is reported. For example, in the graph for unchained operations above there are two chimes, whereas in the graph for the chained operation there is one chime. As Fong and Jordan [6] [6] , [10] . This level is achieved when vectors are retained in registers, operations are performed using chaining, and the results are stored in registers.
Dramatic improvements in rates of execution are realized in going from scalar to vector and from vector to supervector speeds. We show in Fig. 3 3. Matrix-vector example. We are now ready to examine a simple but important operation that pervades scientific computations: the matrix-vector product y ,---A * x, where y is a vector with m elements, A is a matrix with m rows and n columns, and x is a vector with n elements. We will write the matrix-vector product as follows: In some situations it may be necessary to calculate a matrix vector product with additional precision, e.g., when a residual calculation is needed. One immediately thinks of using accumulation of inner product; but from our discussion above, we see that a GAXPYcan be used to accomplish the same task in a purely vector fashion, provided that vector operations can use extended precision arithmetic.
4. Matrix multiplication. We will now look at another fundamental operation in linear algebra: the process of multiplying two matrices together. The process is conceptually simple, but the number of operations to carry out the procedure is large. In comparison to other processes such as solving systems of equations (2/3 n operations) or performing the QR factorization of a matrix (4/3 n operations 17]), matrix multiplication requires more operations (2n operations). (Here we have used matrices of order n for the comparisons. The operation counts reflect floating point multiplication as well as floating point addition. Since addition and multiplication take roughly the same amount of time to execute and each unit can deliver one result per cycle, the standard practice is to count separately each floating point multiplication and floating point addition.)
We wish to find the product of A and B and store the result in C. From relationships in linear algebra we know that if A and B are of dimension m n and n p, respectively, then the matrix C is of dimension m p. We will write the matrix multiplication algorithm as Generic matrix multiplication algorithm. We have intentionally left blank the loop indices and termination points. The loop indices will have variable names i, j, and k, and the termination points for the indices will be m, p, and n, respectively.
Six permutations are possible for arranging the three loop indices. The generic algorithm will give rise to six forms of matrix multiplication. Each implementation will have quite different memory access patterns, which will have an important impact on the performance of the algorithm on a "Cray-like" processor. For an alternative derivation of these six variants, see ].
We summarize the algorithms below. We have placed the initialization of the array in natural locations within each of the algorithms. All the algorithms displayed above perform the same arithmetic operations but in a different sequence; even the roundoff errors are the same. Their performance when implemented in Fortran can vary greatly because of the way information is accessed. What we have done is simply "interchanged the loops" [14] . This rearrangement dramatically affects performance on a "Cray-like" machine. [2] .) In terms of the six algorithms being discussed for matrix multiplication, only the forms that use a GAXPY (forms ikj and jki) can achieve a supervector speed; and of them only the form jki performs well in a Fortran environment because of its column orientation.
Up to this point we have not been concerned with the lengths of vectors. We will now assume that vector registers have some length, say zl (in the case of the Cray-1, the vector length is 64). When matrices have row and/or column dimensions greater than zd, an additional level of structure must be imposed to handle this situation. (A machine like the Cyber 205 does not have this problem since it is a memory to memory architecture; data is streamed from memory to the functional units and back to memory.) Specifically with algorithm jki, columns of the matrix can be segmented to look like the following:
where each segment is of length vl except for the last segment which will contain the remaining elements. The two ways to organize the algorithm are 1) to sequence column segments across the matrix A, developing a single complete segment of C, or 2) to sequence column segments down a column of A, developing a partial segment of C.
These algorithms have an additional looping structure to take care of the vector segmentation needed to process n elements in a vector register of length vl The aims are to keep information in vector registers and to store the results only after all operations are sequenced on that vector. The block row form accomplishes this goal: A vector register is used to accumulate the sum of vector segments over all columns of a matrix. In the block column form, however, after each multiple of a vector is added in a register, it is stored and the next segment is processed.
Graphically we have the following situation:
Block column
Block row
Clearly we want to maintain the vector segment of C in a register to minimize data traffic to and from memory, but mainly we want to avoid storing the information. The block column algorithm does not allow the repeated accumulation of a column of C, since a register cannot hold an entire column of the matrix C. It is interesting to note that while sequential access is essential in a paged environment, a "Cray-like" architecture does not suffer from paging problems. Therefore, the block row provides the best possible situation and will lead to a very high rate of execution.
5. Linear equations. We now turn to one of the procedures that probably uses the most computer time in a scientific environment: solving systems of equations. We will concentrate our analysis on solving systems of equations by Gaussian elimination or, more precisely, the reduction to upper triangular form by means of elementary elimination. To retain clarity, we initially will omit the partial pivoting step; nevertheless, pivoting is vital to ensure numerical stability and will be dealt with later.
Gaussian elimination usually transforms the original square matrix A into the product of two matrices L and U, so that A LU.
The matrices L and U have the same dimension as A; L is unit lower triangular (i.e., zeros above the diagonal and the value one on the diagonal), and U is upper triangular (i.e., zero below the diagonal). The algorithm that produces L and U from A, in general, overwrites the information in the space that A occupied, thereby saving storage. The algorithm, when given the matrix in an array A, will produce in the upper triangular portion of the array A the information describing U and in the lower triangular portion below the diagonal the information describing L.
Like the matrix multiplication algorithm, the algorithm for Gaussian elimination can be described as follows:
Generic Gaussian elimination algorithm. for for for aij aij (aik * akj)/akk. end end end
As before, we have intentionally left blank the information describing the loops. The loop indices will have variable names i, j, and k, but their ranges will differ. Six permutations are possible for arranging these loop indices.
If we fill in the blanks appropriately, we will derive six algorithmic forms of Gaussian elimination. Each form produces exactly the same matrices L and U from A; even the roundoff errors are the same. (We have omitted some numerically critical features, like pivoting, in order to keep the structure simple.) These six algorithms have radically different performance characteristics on a "Cray-like" machine. As in the case of matrix multiplication, we can best investigate these differences by analyzing the patterns of data access.
Forms ijk and jik are variants of the Crout algorithm of Gaussian elimination 17]. (To be more precise, these are variants of the Doolittle algorithm, but for simplicity we refer to them as Crout.) The Crout algorithm can be characterized by its use of inner products to accomplish the decomposition. At the ith step of the algorithm the matrix has the form shown in Fig. 4 .
FIG. 4.
For form ijk, inner products are formed with the ith column of U; and rows through n of the formed L to create the new elements of L;. A similar procedure is followed for form jik, but with the roles of U and L interchanged. Notice that since inner products are performed, if one accumulates the result in extended precision, a more accurate factorization will result.
Form kij is the form most often taught students in a first course in linear algebra. In brief, a multiple of a given row is subtracted from all successive rows to introduce zeros between the diagonal elements of the given row k to the last row (see Fig. 5 ). After zeroing out an element, the zeroing transformation is applied to the remainder of the matrix. This algorithm references elements by rows and, as a result, is not really suited for Fortran. FIG. 5. Form kji is the column variant of the kij row algorithm as described in [15] and is used in the LINPACK collection [3] . This form is organized so that sweeps are made down a column instead of across a row of the matrix. As with form kji, a zeroing transformation is performed, and then that transformation is applied to the remainder of the matrix. Since the operations occur within columns of the matrix, this algorithm is more attractive in Fortran than either of the previous approaches. The basic operation is a SAXPY. Since updates are made to all remaining columns of the matrix, there is no opportunity to accumulate the intermediate result in extended precision.
The final two algorithms, forms ikj and jki, differ from forms kij and kji primarily in how transformations are applied. Here, before zeros are introduced, all previous transformations are applied; and only afterwards is the zeroing transformation applied. Specifically, in form jki, columns through i-are applied to column before zeros are introduced in column i. This is the algorithm described in Fong and Jordan [6] ; see The basic operation of this algorithm is a GAXPY. As the previous transformations are applied, a column of the matrix can remain in a register and need not be stored in memory. Thus, the algorithm has the possibility of accumulating the intermediate vector results in vector extended precision.
Since column orientation is preferable in a Fortran environment, we will concentrate our attention on only three forms: jik, kji, jki which we will refer to as SDOT, SAXPY, and GAXPY respectively. If we were dealing with some other language where row orientation was desirable, then the other three forms would be appropriate for discussion.
Notice that in choosing these forms, only in the Crout variants do we have to consider a "stride" or worry about memory bank conflicts.
Appendix A lists the three forms implemented in Fortran 77.
We turn now to implementing the three forms in a pseudovector assembly language (PVAL). This language is idealized to the extent that we assume that vector segmentation and length of vector registers are not a problem. Segmentation is, nevertheless, an important detail on some vector computers, and we will discuss it later. We also do not label vector registers, but refer instead to their content.
In PVAL, then A more important quantity to measure with respect to vector machines is the memory traffic, i.e., the loads and stores. In most situations the arithmetic will actually be free; just the time spent in loading and storing results is actually observed, since the arithmetic is often hidden under the cost of the loads and stores. We will use the word "touch" to describe a load or a store operation. By examining the PVAL code found in Appendix B we can count the vector load and store instructions used in the various versions of LU decomposition; see Table 2 .
In counting vector and scalar touches, some optimization has been performed. Vectors are retained as long as possible in registers before storing. In sequences of instructions where a scalar and vector are referred to and the scalar is located adjacent to the vector, a vector load is performed of length one greater than the required vector; this procedure saves the scalar load operation, thereby hiding the cost in the vector load. This T + 2 6 optimization assumes that we can extract scalars from the vector registers; some machines may require additional work to do the extraction.
We notice that the SAXPY approach has almost twice as many touches as the GAXPY or SDOT approach. Most of this additional activity comes from vector store operations, which will have a dramatic effect in machines where vector stores disrupt the flow of computations. Even on a machine like the Cray X-MP, which has hardware enhancements to avoid this bottleneck found on the Cray-1, the GAXPY version never hinders performance. Experience on the Cray X-MP has shown that performance is actually increased when a GAXPY is used; this results from fewer bank conflicts and poor code generation by the compiler for the SAXPYversion.
An important distinction to remember when comparing GAXPY and SDOT is that the fundamental operation is different; GAXPY operates on vectors and produces a vector, while SDOT operates.on vectors and produces a scalar. The total number of touches is roughly the same in the two algorithms.
We will next examine the execution times of the three algorithms. [4] for details.)
The timing diagram for the inner loop is given in Fig. 8 .
If the segmentation is done outside the loop on j instead, the vector store can be kept outside the inner loop. Supervector speed can then be achieved. Just as with the matrix multiplication, we want to hold the result in the register as long as possible. Figure 9 shows the optimal access pattern.
The PVAL code in Appendix C has been modified to allow for segmentation. In each case, the segmentation has been chosen to minimize the overhead. We hope in the future that compilers will routinely carry out the process of restructuring.
Looking back, we have been pleasantly relieved to see that algorithms restructured for a vector pipeline architecture perform as well or better than their predecessors on conventional machines.
Looking forward, we admit that we may again have to go through this exercise for multiprocessor machines such as the Cray X-MP, Cray-2, Cyber 2xx and the Denelcor HEP. 
