The Connection Machine Scientific Software Library (CMSSL) is a library of scientific routines designed for distributed memory architectures. The basic linear algebra subroutines (BLAS) of the CMSSL have been implemented as a two-level structure to exploit optimizations local to nodes and across nodes. This paper presents the implementation considerations and performance of the local BLAS, or BLAS local to each node of the system. A wide variety of loop structures and unrollings have been implemented in order to achieve a uniform and high performance, irrespective of the data layout in node memory. The CMSSL is the only existing high performance library capable of supporting both the data parallel and message-passing modes of programming a distributed memory computer. The implications of implementing BLAS on distributed memory computers are considered in this light.
The Connection Machine Scientific Software Library (CMSSL) is a library of scientific routines designed for distributed memory architectures. The basic linear algebra subroutines (BLAS) of the CMSSL have been implemented as a two-level structure to exploit optimizations local to nodes and across nodes. This paper presents the implementation considerations and performance of the local BLAS, or BLAS local to each node of the system. A wide variety of loop structures and unrollings have been implemented in order to achieve a uniform and high performance, irrespective of the data layout in node memory. The CMSSL is the only existing high performance library capable of supporting both the data parallel and message-passing modes of programming a distributed memory computer. The implications of implementing BLAS on distributed memory computers are considered in this light.
Introduction
The connection machine systems provide highperformance computation on large, data-intensive problems and have found successful application in a diverse range of floating-point intensive applications. The Connection Machine Scientific Software Library (CMSSL) (Thinking Machines Corporation, 1993 ) is a software library designed for efficient use of distributed memory architectures, in particular the connection machine systems CM-2/200 and CM-5/5E. The CMSSL includes many algorithms commonly used in scientific computation, such as routines for dense and sparse matrix manipulation, linear systems solvers, eigensystem analysis, fast Fourier transforms, and differential equation solvers. The foundation of many of these algorithms is a set of relatively simple linear algebra operations known as the basic linear algebra subroutines (BLAS) (Lawson et al., 1979a (Lawson et al., , 1979b . In order to achieve high overall performance, it is necessary to optimize these fundamental operations.
Highly optimized BLAS implementations are available on most high performance computers. Initially, these routines comprised operations on matrices of rank-1 (vectors). These routines, termed level-1 BLAS, found widespread acceptance among developers and were incorporated into a wide array of software applications. The rise of pipelined computer architectures, particularly vector computers with a limited bandwidth to memory (one operand per cycle), revealed a need for codes that could retain a vector of data in the register set. The level-2 BLAS were introduced (Dongarra et al., 1988a (Dongarra et al., , 1988b to operate on matrices of rank-2. More recently, the widespread introduction of architectures with hierarchical memory structures, particularly data caches, led to the design of the level-3 BLAS (Dongarra et al., 1990a (Dongarra et al., , 1990b . These codes operate on pairs of matrices and are able to exploit a blocked memory access scheme. Blocked memory access schemes also are becoming increasingly important in noncached memory accesses due to the organization of memory chips (Comerford and Watson, 1992; Farmwald and Morning, 1992) .
Distributed memory architectures introduce an added level of complexity into numerical operations such as the BLAS. Matrices may be distributed across individual processor memories in many ways. Furthermore, the matrix operands local to each processor may be laid out in a wide variety of ways. In order to achieve high perfor-the operands, their distribution across the memory units (often referred to as layout), the overall availability of memory on the machine, and estimates of the performance for the various alternatives. The global software layer for the BLAS on the CM-5/5E is referred to as the distributed BLAS (DBLAS) (Johnsson and Mathur, 1992) . The Sca-LAPACK library (Choi et al., 1992) was introduced as a library for performing DBLAS computations based on specific data layouts and generic level-3 BLAS routines.
The DBLAS control how data are moved between processing nodes and which operations are to be performed on each node at each step of operation. Irrespective of which algorithm is selected by the DBLAS, it is still necessary to perform operations on the segments of matrices local to each node. In order to optimize these operations, a second level of BLAS, the local BLAS (LBLAS), was designed and implemented. The LBLAS for the connection machine system CM-2/200 were described by Johnsson and Ortiz (1992) . This paper describes the implementation and performance of the LBLAS on the CM-5/5E. The CMSSL is designed to support languages with array syntax. Various standards for such languages have been proposed, such as Fortran 90 (Metcalf and Reid, 1991) and High Performance Fortran (High Performance Fortran Forum, 1993) . Currently, the CMSSL supports two such languages: CM Fortran (CMF), which heavily influenced the design of HPF, and C*, which is being used as a strawman language for the design of a standard for a parallel C language. The languages determine the data structures and their representations and the distribution of the operands upon which library routines are operating.
The implementation of the languages determines what information is available to library routines and heavily influences the design of interfaces to the library. Library interfaces must be consistent with the definition of the language but should also be consistent with the spirit of the language (libraries are in effect extensions to a language) and provide sufficient information for a high degree of optimization. High resource utilization is one important justification for software libraries. In the CMSSL, array operands are passed in-place, and all par-&dquo;... matrix-matrix multiplication may be performed using one of several systolic algorithms. Other algorithms, e.g., those based on a broadcast of one or more operands, can also be chosen. The selection is based on the shapes and sizes of the operands, their distribution across the memory units (often referred to as layout), the overall availability of memory on the machine, and estimates of the performance for the various alternatives.&dquo; Table 1 Functionality of the LBLAS on the CM-5/5E allel computation occurs in a data parallel or &dquo;SPMD&dquo; fashion.
The LBLAS have found application far beyond the scope of the DBLAS. Many applications have been written that exploit the locally optimized LBLAS. LBLAS calls within CMSSL are used in a number of algorithms, such as linear system solvers and banded system solvers.
Although the LBLAS are accessible for &dquo;embarrassingly&dquo; parallel computations through DBLAS interfaces, such calls incur an unnecessary overhead when it is known to the calling program that the distribution of the operands is such that the computations in fact are &dquo;embarrassingly&dquo; parallel. The two-tiered approach to the BLAS on distributed memory architectures is justified both from an implementation point of view and from a user point of view. The latter is supported, at the moment, through public DBLAS interfaces and LBLAS interfaces internal to the CMSSL. Section 2 of this paper outlines the type of operations that have been implemented in the LBLAS, and how these differ from traditional BLAS operations. In order to explain the performance and optimizations of the LBLAS, section 3 briefly outlines the architecture of the CM-5/5E, placing particular emphasis on the features of the CM-5/5E nodes that affect LBLAS performance. Section 4 describes the architecture of the LBLAS. Implementation issues and measured performance of the level-LBLAS are described in section 5, while the level-2 LBLAS are presented in section 6. A brief discussion of level-3 LBLAS is given in section 7. Many of the decisions of loop partitioning and loop ordering for high performance cannot be made until runtime. The mechanism for runtime selection of LBLAS kernels is discussed in section 8. The paper concludes with a summary and discussion of results.
2 The Scope of the LBLAS 2.1 FUNCTIONALITY The functionality currently implemented in the LBLAS is summarized in Table 1 . The functionality of the LBLAS differs from the corresponding standard BLAS in a few ways. One general difference is the lack of scaling variables. The reason for this difference is that the DBLAS, which the LBLAS support, were designed without scaling variables for efficiency. In the DBLAS, scaling variables take the form of arrays with rank of one less than arrays representing vectors in the standard BLAS. On the connection machine systems, the runtime system distributes arrays evenly over the nodes, with the default subgrid shape having axes lengths of approximately equal extents. With this data distribution scheme, a large communication overhead may be incurred when scaling parameters are introduced in the DBLAS, since the array of scaling parameters is, in general, not aligned with any other operand. Explicitly aligning the array of scaling parameters with another operand array through the use of compiler directives may increase the storage requirements substantially. Therefore, scaling is left to the user in the current DBLAS and LBLAS implementations.
MULTIPLE-INSTANCE OPERATIONS
One distinguishing characteristic of the CMSSL is that it supports a multiple-instance interface. The need to perform the same operation on multiple independent operands arises in a wide variety of scientific problems and in divide-and-conquer algorithms. An example is the solution of Navier-Stokes equations in fluid dynamics, which requires a matrix-vector multiplication at each grid point (Olsson and Johnsson, 1990) . In the traditional implementation, a number of loops would serially invoke a call to matrix-vector product BLAS. In the multiple-instance formulation, the operations at each grid point are treated as a single call to a larger problem. Figure 1 illustrates an inner-product operation on a problem that has an array with a single instance-axis of extent nnx_instances. In Figure la , a number of calls to the inner-product routine are required. In Figure lb , a single function call is required. The benefit of this formulation is that within the &dquo;Since any operand for the BLAS has either one or two problem-axes, an operand may have one, two, or three contiguous sets of instance-axes. In the DBLAS, these three sets are identified, and within each set the instance-axes are folded into a single axis whenever there is more than one axis in the set.&dquo; Table 2 LBLAS Modes library routine an additional degree of freedom is available for optimization. Moreover, this degree of freedom represents embarrassingly parallel computations. We refer to the axis (axes) defining the elements of a single operand as the problem-axis (axes) for that operand, and the other axes as instance-axes. The problem-axes are always present, while instance-axes may be absent for a singleinstance computation. It is entirely possible, in fact very likely because of the data distribution rules used by CMF and the connection machine runtime system (Thinking Machines Corporation, 1994) , that the axis with stride one in an array is an instance-axis. As it is often desirable to vectorize operations along the axis with stride one, the LBLAS are highly optimized for vectorization along any available axis. Also, the multiple-instance feature generally lowers the overhead associated with looping and temporary storage. Most routines in the CMSSL use the same interfaces for single-instance and multiple-instance computations.
Since any operand for the BLAS has either one or two problem-axes, an operand may have one, two, or three contiguous sets of instance-axes. In the DBLAS, these three sets are identified, and within each set the instanceaxes are folded into a single axis whenever there is more than one axis in the set. The problem-axes together with the folded instance-axis that are predicted to offer the best opportunities for performance optimization are passed to the LBLAS. The LBLAS are designed to handle the problem-axis (axes) and one instance-axis. The ordering and partitioning of the other instance-axes, if any, is made in the DBLAS calling the LBLAS. These tasks remain with the user if the LBLAS are called directly.
DEFINITIONS AND NOTATION
The functionality of the currently implemented LBLAS can be cast as a generalized matrix multiplication where C, A, and B are arrays with up to two problem-axes and one instance-axis each. ops and opc are of the type N or T for normal or transpose, respectively. OPB is of type N, T, C, or H, where H is Hermitian (complex conjugate transpose) and C is complex conjugate. The combination of these that is implemented depends on which of six mode parameters is specified. The mode parameter defines how the operands are combined. The six modes are presented in Table 2 , where B implies the complex conjugate of B.
The LBLAS directly implement level-1 and level-2 functions, while level-3 functions are implemented in Fig. 2 The shapes of the operands A, B, and Cand the meaning of the parameters p, q, r, and m. terms of level-1 and level-2 LBLAS (as discussed in section 7). The level-3 _GEMM operation has been implemented using the M3 method (Hingham, 1992) to achieve maximum performance on complex data.
The LBLAS support the four common floating-point data types, namely single (32-bit real), double (64-bit real), single complex (64-bit complex) and double complex (128-bit complex). The LBLAS contain specific optimizations for 32-bit and 64-bit operations. However, there is currently no specific optimization for complex data types. Complex LBLAS calls are handled as a sequence of calls to kernels that operate on real data. The shapes of the local arrays for any LBLAS operation can be described in terms of four parameters, p, q, r, and m (see Figure 2 ). p is the number of rows in each instance in A and C. q is the number of columns in each instance in A and rows in each instance in B. r is the number of columns in B and C. m is the number of instances in each of the arrays. The arrays defined by the instance-axes must be conforming, and the instances must be ordered in the same way for all operands. Table 3 shows the problemaxes ranks of the operands C, A, and B. Each of these operands has an instance-axis in addition to the ranks shown in the table. The permissible values of p, q, and r are also shown. n, 4, and p indicate arbitrary values.
Finally, Table 3 also shows the number of floating-point operations per memory reference for the supported LBLAS functions.
3 Architecture of the CM-5/5E
To understand the design and optimizations for the LBLAS described in this paper, some familiarity with the CM-5/5E architecture (Thinking Machines Corporation, I991 ) is required. The CM-5/5E system was designed for up to I6,384 processing nodes, each with its own memory I . is supervised by a control processor (CP), although the nodes may operate independently in MIMD mode. Each node is connected to two low-latency, high-bandwidth interconnection networks: the data network and the control network. The data network is generally used for point-to-point interprocessor communication, and the control network is used for operations such as synchronization and broadcasting. A third network, the diagnostics network, which is not available to the user, is used for ensuring the status of the system. As the LBLAS performance is dependent only on nodal performance and not on that of the networks, these will not be further discussed in this paper. configurations, but only the version for the CM-5/5E with VUs is highly optimized. The non-VU version of the LBLAS is not discussed in this paper. The four VUS are memory mapped into the node processor address space.
It serves as a controller and coordinator for the VUs. Each VU consists of interfacing and control units as well as a register file. The register file can be viewed as 64i 64-bit words, or as 128 32-bit words. The register file in each VU is reconfigurable into sets of vector registers. For example, in 64 bits the register file can be viewed as a single vector register of depth 64, two vector registers of depth 32 each, four vector registers of depth 16 each, and so on. The VU instruction set supports a four-stage pipeline. The maximum vector length is 16. / One significant restriction in the VU instruction set is the lack of support for indirect addressing of registers. This results in a need to explicitly code loop bounds into the kernels. Kernels with similar structures that differ in loop extents or register allocations need to be individually coded and optimized. This results in a large number of distinct kernels.
The clock frequency of the node processor of the CM-5/5E is 40 MHz, and the VUs operate at half this I , frequency, which matches the clock rate of the memory. Each VU contains an ALU that can perform a multiply/ add or a multiply/subtract operation on 64-bit floatingpoint or integer operands per clock cycle. Thus the peak performance of each VU is 40 Mflop/s, and the performance of each node is 160 Mflop/s. The VUs also support operations on 32-bit operands, but at the same performance level as for 64-bit operands.
Each VU is designed to address up to 128 Mbytes. The memory size per node is tied to the density of the memory chips used. With 4-Mbit memory chips, each VU has 8 Mbytes of memory, while the use of 16-Mbit memory technology yields a VU memory of 32 Mbytes. Each VU has a single 64 bit wide data path to memory, giving peak memory bandwidth per node of 640 Mbytes/s.
In order to maintain a small package, memory chips are organized as a matrix with row and column addresses being multiplexed onto the same pins. Row accesses are buffered allowing for single-cycle accesses to columns, while row accesses require several cycles. The VU memory is dynamic RAM, with a DRAM page size of either 8 or 16 Kbytes. If the VU accesses two successive memory locations that are not on the same DRAM page, a DRAM page fault occurs. If a DRAM page fault occurs, the pipeline is stalled for five VU cycles, and hence it is desirable to implement looping structures within the LBLAS that minimize DRAM page faults.
Another performance hazard arising from the node architecture occurs because the VU and its associated memory is mapped into the memory space of the node processor. The node processor has a direct mapped translation lookahead buffer (TLB) that generates the physical memory locations of the data. With current generations of the CMOST (connection machine operating system), each of 64 TLB pages addresses 4 Kbytes of data. If the data being addressed are on a page not in the TLB, then the pipeline is stalled for approximately 25 VU cycles. To compound this problem, TLB thrashing can occur. Consider a code that accesses I successive rows of data in a loop. If the memory stride between rows is greater than 4 Kbytes, then a TLB miss occurs with each row access on the first iteration through the loop. If I is greater than the number of available TLB pages, then the TLB is not able to hold all of the required pages concurrently. In this case, a TLB miss may occur as often as every vector instruction, or even every memory access. This severely degrades performance, and a looping structure that minimizes the risk for TLB thrashing is highly desirable.
Operations on 32-bit data require special attention for optimum performance on the CM-5/5E. The memory units are 64 bits wide. 32-bit data are stored in halfwords.
However, each load always brings in a 64-bit word, and every store is a 64-bit store. 32-bit stores are carried out as a read-modify-write operation. A read-modify-write instruction requires 3.5 cycles. Thus if two 32-bit results can be stored in the same 64-bit word, then the store can be made in one cycle, or half a cycle per 32-bit word, a speedup by a factor of seven compared to storing the 32-bit words individually. Using both 32-bit operands in a 64-bit word loaded into the register file reduces the effective rate of a 32-bit load to half a cycle. 4 The LBLAS Architecture
EXECUTION CONTROL
The control of the execution of the LBLAS is divided among the control processor, the node processor, and the vector units. The LBLAS support two data parallel languages : CMF and C*. The LBLAS also support three programming models: CMF or C* with a global address space, CMF or C* with a nodal address space using message passing for internode communication, and CMF or C* in global/local mode. The global/local mode corresponds to the extrinsic procedures option in High Performance Fortran (High Performance Fortran Forum, 1993).
The LBLAS code resides in the memory system of the CP. Array data reside in the memory of the VUs, while scalar data reside in the memory of the CP. This rule implies that, for instance, the result of a single inner product resides in CP memory while the input arrays reside in VU memory. But the result of a multiple-instance inner product is an array, which resides in VU memory.
Note, however, that the result array is of rank one less than the input operands and thus, in general, will have a distribution different than that of the input operands.
Upon a call to a routine in the LBLAS, the CP invokes the execution of the code blocks on the node processors it manages. All node processors execute the same code block in an &dquo;SPMD&dquo; fashion. Synchronization of node processors takes place upon completion of the execution of each code block. The node processors simultaneously perform a number of tasks, such as the determination of array addresses, strides, and loop bounds. The node processors also determine which particular set of kernels shall be executed and the order of execution. As the &dquo;best&dquo; kernel to use in any situation depends on the shape of the operands and their allocation in memory, which can only be determined at runtime, this decision of which collection of kernels to use can only be made at runtime.
Selection of a suite of nonoptimal kernels for given operands can result in a performance degradation by more than an order of magnitude; factors of three to five are by no means rare. The runtime selection mechanism must be accurate and fast so as not to constitute an undue overhead on the total running time. The decision is made on the basis of the lengths and strides of the problem-and instance-axes (see section 8).
The node processor receives information from the CP that includes addresses, strides, and loop bounds. Much of this information is cached at the node processor, which then performs kernel selection and loop control. The VUs execute the arithmetic instructions contained in the loop body, under control of the node processors. Data are loaded from VU memory into VU registers and operated upon, and then results are stored back to VU memory.
SCHEDULING
The scheduling of operations take three levels of memory hierarchy into account: registers, DRAM pages, and the TLB. In addition, for 32-bit operands the computations are organized to take advantage of both operands in a 64-bit word during load operations whenever possible, and in particular, schedules are sought that produce in succession both operands for a 64-bit word store. Each level of the memory hierarchy corresponds to a loop nest. Thus three loop nests are needed to cover the three levels of the memory hierarchy. The number of iterations in a given loop corresponds to a partitioning of the corresponding loop in the next outer loop nest. The loop nests correspond to partitioning of the index space, as illustrated in Figure 5 . The extents of the axes in loops in the innermost loop nest define subdomains of the index space often called register tiles, or simply tiles. In the LBLAS, a tile is twoor three-dimensional. The optimization consists of determining the size and shape of a tilẽ determining the order in which tile axes are treated within a tilẽ determining the order in which tiles are treated.
As a rule, maximizing the size of a tile is a good idea. However, there are exceptions when more than one tile is required to cover the index space. In general, tiles of several sizes and shapes are required to cover the index domain. Tiles are constrained to have a size at most equal to the size of the register file. Moreover, in order to reduce looping overhead, using tile axes extents that are powers of two is beneficial.
All LBLAS codes have a choice of axis for the innermost loop: the axis of vectorization. The LBLAS interface supports two potential problem-axes and an instance-axis for each operand. Kernels have been written that are optimized to vectorize over any of these axes. In the case of the level-1 BLAS, the choice is between vectorization over the single problem-axis or the instance-axis. The level-2 BLAS have several choices as to the ordering of the loops over the problem-axes, as well as the instanceaxis. The ideal loop order for the tile with respect to performance can be derived from its shape and knowledge of the loop overheads and pipeline losses. Much of the loop overhead is determined by the node processor's ability to implement the required loop control and loop index updates concurrently with the VUs' handling of the loop body. Thus it is better to partition a loop of length 9 into a loop of length 5 and one of length 4 than it is to partition one of length 8 and one of length 1. The optimal order in which tiles are treated is determined based on load and store operations associated with each tile, DRAM page faults, and TLB misses.
In the design and implementation of the LBLAS, in addition to striving for high peak performance, considerable attention was devoted to achieving low overheads, that is, minimizing the problem size required for half of peak performance, n,~ (Hockney and Jesshope, 1981) .
In addition to loop partitioning and selecting loop orders for high performance, loop unrolling is also used to enhance performance. The following sections discuss functionality, design, implementation, and performance of individual LBLAS functions. 5 Level-1 LBLAS 5.1 FUNCTIONALITY Six level-I LBLAS functions have been implemented on the CM-5/5E. These are _SCAL, _AXPY, _DOT, _SIP, _NRM2, and _NRM2SQ. The first three functions differ from the corresponding standard BLAS by the lack of a scaling variable and the inclusion of a multiple-instance capability. The same differences exist for _NRM2SQ compared to the standard BLAS L2-norm. _NRM2 is identical to _NRM2SQ, except it does not include a square root computation. _SIP computes a self-inner-product. The functionality of _NRM2 and _SIP is required to support standard BLAS functionality in the DBLAS.
Because the LBLAS support six modes, the definition of the _NRM2, namely C -C + AAH{C E 3~}, has been extended to ~li C} f-lB{C} + AA OPA. SIP performs the inner product of a vector with itself, that is, C ~ C + AA °PA C E ~ } . This function differs from _NRM2 for complex data in that the _NRM2 function computes only the real value of the result, while _SIP computes both the real and imaginary components of the result. We have found this to be a useful optimization when a self-inner-product of complex data is desired. For a DBLAS _NRM2 operation, each instance of C may be distributed across many nodes. The square root operation on the result cannot be performed by each of the participating VUs, that is, each LBLAS routine. The partial results of each instance of C must first be reduced into a single result, then the square root of the result is taken. _NRM2SQ does compute the square root in parallel on all VUs in addition to the functionality of NRM2. This case is useful when the user performs a multiple-instance _NRM2, where individual instances are completely local to each VU.
OPTIMIZATIONS
All level-1 LBLAS routines have been separately optimized for 32-bit (halfword) and 64-bit (word) data. The primary differences between the two cases arise from the size of the register file and from the memory being organized as 64-bit words accessed via a 64-bit data path. Level-1 LBLAS routines limited by the memory bandwidth for input operands (loads) can achieve close to twice the 64-bit performance in 32-bit precision. For example, Table 4 CM.5/5E Peak Measured Node Performance on Level.1 LBLAS, Rt he peak efficiency for an inner product on real data in 64-bit precision is 50%, while in 32-bit precision the peak efficiency is 100%. For operations with a relatively large number of output operands, optimization for 32-bit operands may yield a sevenfold speedup. In order to realize this speedup, care has to be taken to ensure that the stored data are aligned with word boundaries. The loop orderings of the level-1 LBLAS kernels are divided into two classes. Referring to Figure 6 , class-I I kernels are vectorized with the instance-axis innermost.
After loading a segment of the instance-axis into memory, the algorithm steps along the problem-axis. Class-Il kernels are vectorized with the problem-axis innermost. One entire instance is computed prior to the following instance being addressed.
Before describing the optimization in detail, we summarize the peak performance and half-performance vector lengths achieved for the level-1 LBLAS on the CM-5/5E in Tables 4 and 5. The peak performances were achieved using either class-I or class-II kernels, whichever gave better performance with its optimal layout. Thus the lay-. :,;'B;';'¡B~,,' , ':~1~~,~'~ : :. , -'~°~«..: ,. ' , , -.
°, Table 5 ' &dquo;'&dquo;&dquo;&dquo;
Level-1 LBLAS GIN-515E Elalf-Performance Vector Length nT able 6
The Set of Tile Shapes, Vector Lengths, and Loop Unrollings Used for the Lovel-I LBLAS out for class-I kernels has unit stride along the instanceaxes.
Referring to Table 4 , the asymptotic efficiency realized by the LBLAS kernels for 64-bit data is seen to be in the range of 85% to 98%. For 32-bit data, the efficiency can sometimes be more than doubled by using both operands in a 64-bit load and computing in succession both operands for a 64-bit store (see sections 5.2.1 and 5.2.2). It is clear from Table 5 that nh for 64-bit data decreases rapidly as a function of the number of instances. The performance of the node processor is sufficiently high compared to the VUs that the looping overhead is very low in executing the kernels, resulting in high efficiency. The ng in the range of 1 to 4, even for relatively few instances per VU.
The 32-bit data show a somewhat different pattern. The peak performance for 32-bit data is realized when the stores to memory are performed in 64-bit precision (see section 3). In order to realize the added performance from this feature, a minimum vector length on the order of the tile size is required (see Table 6 ). The ng remains relatively constant at the tile length unless there is a sufficient number of instances to fill a tile along the instance-axis, at which point n~h drops significantly. The nh of the kernels utilizing 32-bit stores is approximately equal to that of the corresponding 64-bit kernels.
The set of tile shapes, vector lengths, and loop unrollings used by the level-LBLAS kernels is summarized in Table 6 . As a rule, class-I kernels are fully unrolled, while class-II kernels are not necessarily fully unrolled. For fully unrolled kernels the unrolling depth is determined by the vector length being used (maximum 16) and the number of available registers. The tile shapes are specified with the problem-axis length first and the instance-axis length second.
For all reported timings, the layout of the operands is optimal for the kernel being considered. Thus for a class-I kernel, the stride along the instance-axis is assumed to be one and the stride along the problem-axis is equal to the length of the instance-axis, that is, M. For class-II kernels, the stride along the problem-axis is one and the stride along the instance-axis is equal to the length of the problem-axis.
In the plots of kernel performance that follow, there are generally periodic ripples in the performance curves. The ripples are generally a function of the tile shape and unrolling. There is a distinct performance gain available from decomposing the problem exactly into an integer number of tiles. If this is not possible, then a penalty for underutilizing the tile is accrued. On small problems, underutilizing tiles can significantly affect performance. To minimize this effect, certain kernels alter the tile shape for small problems. For example, consider a problem with an axis length of 9, where the tile extent along that axis is 8. The kernel will decompose the problem into two tiles of size 5 and 4, respectively, rather than tiles of 8 and 1. Due to pipeline start-up costs, the minimum instruction execution time is four cycles, even for vectors shorter than length 4. Hence decomposing into tiles of extent 5 and 4 will take 5 + 4 = 9 cycles, while a decomposition into extents of 8 and 1 will take 8 + 4 = 12 cycles. By changing the tile shape, three cycles, or 25% of the execution time, can be saved.
_SCAL and _AXPY. The _SCAL operation
computes CT E--ABT, where C and B are two-dimensional arrays (one of which is an instance-axis) and A is a one-dimensional array. The function is performed by loading B into the register file and using a chained vector multiplication. As the CM-5/5E cannot simultaneously support chained stores and loads, the result is stored with a separate vector instruction. Similarly, the _AXPY operation performs the function CT f-CT + ABT. Figure 7 shows the efficiency as a function of the number of instances for the DSCAL-I and DAXPY-I and DSCAL-II and DAXPY II kernels. The extent of the problem-axis for the timing of the class-I kernels is 128, while the extent of the instance-axis for the class-I kernels is 8. Since DSCAL performs one multiplication per two memory references, the maximum efficiency is 25%. DSCAL-II achieves a peak efficiency of approximately 22%, and DSCAL-I achieves a peak of approximately 23%. DAXPY performs a multiply/add operation for every three memory references, at best. The maximum attainable efficiency is 33%. Both the DAXPY and DAXPY II kernels achieve a peak efficiency of approximately 31 %.
The maximum vector length used in the class-I kernels is eight, which is visible as a ripple in the corresponding plot in Figure 7 . For the class-II kernels the maximum vector length of 16 is used together with an unrolling depth of 2, resulting in performance peaks separated by 32 elements.
Due to the loop ordering of the DAXPY I kernel, it is susceptible to TLB thrashing. This is visible in Figure 7 as a sharp drop-off in performance at 256 instances. An extra loop is placed around the kernel in order to alleviate this problem, as can be seen by the curve for DAXPY I with TLB protection.
The _SCAL and _AXPY functions are the ones most highly dependent on memory bandwidth of any of the currently implemented LBLAS. The factor of seven speedup available by implementing two 32-bit stores as a single 64-bit store tends to dominate the performance of the 32-bit kernels. Both class-I and class-II kernels have been written that exploit this fact. The combination of axis of vectorization and unit stride results in a large number of kernels. A total of 27 32-bit kernels have been implemented.
The SSCAL kernels have characteristics very similar to the SAXPY kernels. We only discuss the latter here. The SAXPY and II kernels have been optimized for unit stride access on either B or C, but not both. The nonoptimized kernels for a general (nonunit) stride on both B and C achieve a peak efficiency of approximately 16%. The theoretical peak in this case is 18% because the inner loop of the kernel requires two 32-bit loads (I cycle each) and one 32-bit store (3.5 cycles) to perform two floatingpoint operations. A small speedup can be achieved when B has unit stride, since loading an element of B effectively only requires one-half of a cycle. A far greater speedup can be achieved when C has unit stride. A load and a store of an element of C each take one-half of a cycle, yielding a theoretical maximum efficiency of 50%. A further optimization is possible when both B and C have unit strides. However, this optimization has not been implemented. The high-frequency ripple observed in Figure 8 for the SAXPY-I unit stride kernels is an even/odd stride phenomenon. When the instance stride is one or even and the instances are 64-bit word aligned, then they can be treated as a single block. If the instances are misaligned, then the first element is handled separately and the rest of the instances are treated as a single block. In contrast, for an odd stride other than one, alternating instances will have alternating alignments (see Figure 9 ), and we break the problem into three distinct blocks. The first block is the instances that are 64-bit word aligned (half of the instances), which are computed by a single kernel call with a double stride on the problem axis. Next, the first (misaligned) elements of the odd instances are handled with a call to a general stride SAXPY II kernel (invoked with one instance). Finally, the rest of the elements of the odd instances are computed by a call to the unit stride SAXPY-I kernel, with a problem-axis length of one less than the &dquo;true&dquo; problem-axis length. In the odd stride cases, the overhead associated with the extra kernel calls, as well as the reduced efficiency of operating on a size 127 rather than 128 problem, is apparent in Figure 8 . In the case of the C unit stride kernel, this is apparent as an approximately 4% reduction in efficiency.
For the SAXPY II kernels in Figure 8 , ripple occurs in the performance curves with a frequency of 8, 16, and 32 elements. (This ripple is also present in the SAXPY I curves, but it is masked by the high-frequency ripple.) The source of the ripple is the high cost of non-64-bit load/stores. Only even, fairly large blocks of stores can be implemented as 64-bit operations. Eight (32-bit) stores is the minimum number that can be implemented in this way. Blocks of 32 32-bit stores are most efficient (for physical implementation reasons). For the C unit stride kernel, the cost of 32-bit stores is particularly high, which appears as a ripple with significant amplitude. Figure 10 shows the performance of the SAXPY kernels for small problem sizes. The general stride and B unit stride SAXPY I kernels have an n1¡z of approximately 2. As 64-bit stores cannot be exploited with less than eight instances, the ng of the C unit stride SAXPY I kernel is 8. The SAXPY-II kernels have similar characteristics, with the general and B unit stride kernels having an ni,, of approximately 3 and the C unit stride kernel an nil! of approximately 32. 5.2.2 _Dot. The inner product C ~ C + ABT reduces the product of two vectors to a single element. In the multiple-instance case, the product of a collection of pairs of vectors reduces to an array of rank one lower than that of the input vectors. Figure 11 shows the performance of ' the DDOT-I kernel for a problem-axis length of 128. This kernel performs approximately two memory accesses per multiply/add operation, so the efficiency can at best be 50%. Figure 11 shows the performance of the DDOT-II kernel for eight instances. The kernel uses a maximum Fig.10 Performance of the SAXPY-1 and SAXPY-11 kernels on small problems. ' Fig.11 l Performance of the DDOT and DDOT II kernels,. vector length of 8 with an unrolling depth of 8. The combination of vector length and loop unrolling results in performance peaks for problem-axis lengths being multiples of 64. The performance decreases slowly between multiples of 64 elements due to the increasing significance of lower performing kernels handling the excess elements. The ripple between each 64th element has a frequency of eight. It is due to the efficiency in operating on axes of lengths being integer multiples of the vector length 8. The peak efficiency in the example shown is approximately 46.5%. Figure 11 shows that the class-I kernels are very efficient already for very few instances and a problem-axis length of 128. If the operands have both short problemaxis and instance-axis, then class-I kernels are generally more efficient than class-II kernels when there are four or more instances. For larger problems the layout of the arrays with respect to DRAM page boundaries determines which kernel is more efficient. Figure 12 shows the efficiencies of the SDOT and II kernels. If the vectors of halfword data are laid out in memory with unit stride, then it is possible to use halfword loads and stores. With all operands accessed with unit stride, the measured peak efficiency for the SDOT kernels reaches 92% before it drops due to DRAM page faults and TLB entry misses. With an odd axis stride other than one, successive halfwords along an axis alternate between the first and second halves of different full words in memory. The current set of kernels makes no attempt to exploit halfword loads and stores for odd strides other than one.
When there are few instances, the SDOT I kernel (both vectors unit stride) is particularly sensitive to vector lengths that do not fit exactly into the registers. The extra elements are loaded with halfword operations, which is relatively very expensive. This results in the sharp peaks observed in Figure 12 .
The performance of the general stride case is similar to that of the DDOT-I kernel, and the case with one vector having unit stride falls between the two. Although only a single element is stored for each inner product, the relatively high expense for a halfword store makes it important to address also the issue with halfword stores for inner products on short problem-axis. The need to perform many small inner products arises in explicit drops off with very large problem sizes due to DRAM codes for the solution of partial differential equations on page faults and TLB entry misses without blocking bea regular grid. For instance, the common five-point cen-yond the register tiles.
tered difference stencil in two dimensions results in inner Figure 14 shows the performances of the NRM2-I I products of length 5 in each grid point. The effect of taking kernels with a problem-axis of length 128 and the perforadvantage of halfword stores for small inner products can mance of the NRM2-II kernels for eight instances. With be significant; however, we have not currently imple-the DNRM2-I kernel the effect of TLB thrashing becomes mented this optimization. apparent when the addresses of all of the instances do not In order to support all the modes and precisions, 64-bit fit into the 64 TLB entries. In the 51 kernels consist of 17 different vectorization quired. In the case of SNRM2-I, the effect is first apparent schemes. Of these, 3 are 64-bit kernels and 14 are 32-bit in the region of 512 instances. In order to avoid TLB kernels. The 32-bit kernels are optimized to take advanthrashing, an extra outer loop has been added to the kernel, tage of unit strides along either the problemor instancesuch that the inner loop operates only on instances for axis. This large number of distinct kernels is necessary which the addresses fit into the TLB at one time. The because of the possible combinations of 64-bit misalign-DNRM2-I kernel, for which the performance is shown, ments associated with performing 64-bit memory operaalso includes this added loop.
tions and 32-bit arithmetic.
Also apparent from Figure 14 is the effect of DRAM The efficiency of the SDOT kernels for small problems page faults and TLB misses on the SNRM2-I kernel. The is illustrated in Figure 13 . The kernels optimized for unit performance degradation without blocking is smaller than stride require a vector length of at least 16 in order to for the DNRM2 kernel due to the smaller data size. For a exploit the benefits of 64-bit loads and hence have an nll2 given stride, page faults are relatively less frequent. The at approximately 16. Local performance maxima occur at DNRM2-II and SNRM2-II kernels both achieve approxivector lengths that are multiples of 16. The class-11 kernels mately 99% efficiency for very large problem sizes. show a far less steep performance curve (due to the Another significant concern in developing the kernels overhead associated with reduction along the axis of vecwas low overhead and as smooth a performance behavior torization). The general stride kernels have an ny of as possible as a function of the extents of the problem-and approximately 20, and the unit stride kernel has n~h of instance-axes. For short axes, the effects of the minimum approximately 60.
vector-length-4-based instruction set for the VUs and the 5.2.3 NRM2, NRM2SQ, and _SIP. The NRM2, node processor overhead for loop control are significant -NRM2SQ, and SIF LBLAS functions all use the same concerns. Figure 15 illustrates the performance of the kernels. The only difference is that _NRM2SQ performs kernels for small problems. Figure 15 clearly shows the a square root operation. For the purposes of reporting the smoothness of the performance curves in the very short overall performance, this is considered negligible. axis region. It is apparent that the nw for the class-I kernels The _NRM2 routine performs the operation C ~-C + is approximately 3, and that of the class-II kernels is A4 and has only one input and one output operand. One approximately 40. The lower efficiency of the class-II memory reference per floating-point multiply/add sufkernels arises because of the overhead incurred in the fices, giving a theoretical efficiency of close to 100%. reduction along the direction of vectorization. Figure 14 shows the performance of the DNRM2 and ' _ SNRM2 kernels. The kernel is not generally memory 6 Level-2 LBLAS bandwidth limited, so no optimization for word loads and lem consists of a large number of relatively small in-Two level-2 LBLAS functions have been implemented, stances. At present, no optimizations have been implenamely _GEMV and _GER. As for the level-1 LBLAS, mented to exploit this situation.
the main difference compared to the standard BLAS is the The measured peak efficiency of both the DNRM2 and exclusion of scaling parameters and the inclusion of a SNRM2 kernels is approximately 97%. The performance multiple-instance capability. The LBLAS interface also Table 7 Peak Efficiencies for Level-2 LBLAS with Square Matrices Table 8 Haft-Performance Axis Length for Level.2 LBLAS with Square Matrices supports GEVM, which is implemented (without performance penalty) as a call to GEMV, with the strides and axis extents appropriately modified. _GEVM is not discussed further, as it is considered identical to _GEMV for the purposes of this paper. Table 7 summarizes the peak efficiencies and Table 8 summarizes the half-performance vectors for the level-2 LBLAS. The half-performance figure is computed for square matrices. Similar to the peak performance tables for the level-1 LBLAS, the peak performances in Table 7 and 8 were achieved using the loop ordering that gave the best performance (see the next section) with its optimal array layout.
OPTIMIZATIONS
For the level-2 LBLAS, three-dimensional register tiles are used for some loop orders. The tile shapes, vector lengths, and loop unrollings used for the level-2 LBLAS are summarized in Table 9. 6.2.1 _GER. The LBLAS outer-product operation is performed on two two-dimensional arrays A and BT, yielding a three-dimensional array C of shape P x R x M. The number of instances is M, as before. The input operands each contain one problem-axis, while the output operand has two problem-axes. As for other LBLAS functions, all operands have one instance-axis. The loop orderings available to outer-product kernels can be discussed in terms of the axes labeling of the destination operand C. Some potential loop orderings are shown in Figure 16 . In discussing loop orderings, we refer to an axis by its extent, and the loop ordering as a sequence of axes extents with the order from left to right corresponding to loops from the innermost to the outermost. Thus PRPRM refers to a loop order in which elements within a column are accessed in the innermost loop, the vectorized loop, and columns accessed in the next outer loop. The two innermost loops together form the register tile. The next outer loop handles all tiles along a column, followed by a loop for all tiles along a row. Multiple instances are handled in the outermost loop. Reducing the number of DRAM page faults and TLB misses may require additional loop partitioning, resulting in up to nine letter strings defining the loop order. For all reported timings for level-2 LBLAS, the layout of the operands is optimal for the kernel being considered, that is, the strides of the axes are increasing from the innermost to the outermost loop. Thus, for instance, for loop ordering PRPRM, the P-axis has unit stride, the stride along the R-axis equals the extent of the P-axis, and the stride along the M-axis equals the product of the extents of the P-and the R-axes.
The outer-product requires that tiles of A and B be loaded into the register file. The loading of C is chained with the multiplication and addition of the appropriate pieces of A and B. The tile of C is then stored. One of the tiles of A or B can be kept in registers, while the next tile of that operand is loaded. The process is repeated until one of the vectors is exhausted. Then, the next tile of the other vector is loaded. Given a sufficiently large register file, approximately M(PR + P + R) loads and MPR stores are required, while 2MPR floating-point operations are performed, leading to a maximum efficiency of approximately 2MPRI(2 x 2MPR) = 50%. W ~¡I!II¡¡iU : ~':~~9°~i.~~&reg;~;s~3; , Table 9 The Set of Tile Shapes, llector Lengths, and Loop Unrollings Used for the Level-2 LBLAS Fig.16 Loop orderings for _GER. Fig. 17 Performance of the PRPRMand PRRPM DGER kernels.
Kernels that utilize loop orderings PRPRM, PRRPM, MPRRPM, and MPRPRM have been implemented. By manipulating the strides and axes order in the routines' argument list, the kernels can emulate the loop orderings RPRPM, RPPRM, MRPPRM, and MRPRPM, respectively. The remaining loop orderings can be shown to yield lower efficiency than these orderings, or, when implemented, were found to yield no significant advantage. The choice of which kernel to invoke is made at runtime and is dependent on the strides and lengths of the result matrix C (see section 8).
Generally, kernels with P-axis vectorization shall be used when the P-axis of C has smallest stride. If the R-axis has smallest stride, then the same kernels are used, with the A and B operand arguments switched. For example, the PRPRM kernel then effectively becomes a RPRPM kernel. Figure 17 illustrates the performance of the PRPRM and PRRPM kernels with eight instances. The minimum number of memory cycles for the PRPRM kernel is approximately 2PRM + RM and that of the PRRPM kernel is approximately 2PRM + PM. For oddly shaped matrices, it is beneficial to use the PRRPM kernel when R » P. In all other cases, the PRPRM kernel is marginally to significantly faster, since it incurs fewer page faults. Also, the PRPRM kernel does not suffer from the TLB thrashing problem due to the direction of vectorization. Thus, in most circumstances, the PRPRM kernel is preferred over the PRRPM kernel. Figure 18 illustrates the performance of the PRPRM kernel for small problem sizes. The half-performance When problem-axes are short, or the instance-axis is the innermost axis of C, then vectorization along the instance-axis often yields the best performance. Thus kernels with the loop orderings MPRPRM and MPRRPM have been implemented. The performance of these kernels is shown in Figure 19 . It is apparent from this figure that the peak performance of the instance-axis vectorized kernels is significantly less than that of the row-axis vectorized kernels (Figure 17 ). The reason for this disparity is that the instance-axis vectorized kernels require two-and . three-dimensional tiles to be stored in the register file. The problem-axis vectorized kernels require only one-and two-dimensional tiles. For example, the PRPRM kernel has a tile with P-and R-axis lengths 16, but the M'PRPRM kernel has a tile with extents 8, 4, and 2 along the M-, P-, and R-axis, respectively. Three-dimensional tiles have small extents, resulting in the reduction in efficiency evident in Figure 19 . The instance-axis kernels also suffer from TLB thrashing. A loop nest has been added in order to minimize the impact of this phenomenon.
The looping orders that have been implemented for the 32-bit _GER kernels are similar to those for the 64-bit kernels. Performance of the SGER kernels is illustrated in Figure 20 . Using only 32-bit loads and stores, the asymptotic performance level of the kernels is somewhat less than two floating-point operations for each 4.5 memory cycles (due to the 3.5-cycle cost of a 32-bit store), or an efficiency of 22% (not counting the cycles for loading A and B, etc.). The performance of the GER kernels is dominated by the time to load and store C. The 32-bit kernels have been optimized to exploit 64-bit memory operations on C. The efficiency of thusly optimized versions of the PRPRM and MPRPRM kernels is shown in Figure 21 . The asymptotic efficiency of these kernels is 100%. The effects of blocking are particularly apparent in the performance curves of the kernels optimized for unit strides. This is due to the relatively large overhead of the nonunit stride stores, which must be performed on the tail end of the axis when the axis length is not evenly divisible by the block size (in Figure 21 the block size is 16). As the size of the operands increases, the relative cost of these Fig. 21 Performance of the PRPRM and MPRRPM SGER kernels on unit stride matrices. additional stores diminishes, so the amplitude of the ripples decreases as axis length increases. 6.2.2 _GEMV. There are 36 choices for loop orderings of matrix-vector multiplication, C ~ AB, where A is of shape P x Q, B is Q x 1, and C is P x 1. One of the interesting loop orderings for matrix-vector multiplication corresponds to the level-1 LBLAS routine _AXPY; a sequence of operations Z(:) -X(i)A(:, i) + Z(:) are performed. But, in GEMV, the accumulation vector Z(:) is allocated to the register file, and elements of the vector X are preloaded. The columns A(:, i) are loaded from memory as needed. The order of the two innermost loops is PQ. Using this loop order, operations on several columns are chained together.
Interchanging the P and Q loops results in a matrixvector multiplication based on inner products. But, used in the context of _GEMV, one of the input operands, X, is shared between all instances of the other operand. Thus the level-1 LBLAS routine DOT is not used for matrixvector multiplication. For the QP loop ordering, vectorization is on the Q-axis, but the reduction is performed on several rows simultaneously. The two loop orderings of interest in treating a succession of tiles are PQ and QP, creating loop orderings QPPQ and QPQP, respectively. With a vector length of YL along the Q-axis, every VLth element along this axis is added together. A final reduction on VL partial sums is required.
For matrix-vector multiplication with multiple instances, one can simply put the instance-axis outermost.
But, for large strides along the problem-axes, vectorizing along the instance-axis ( Figure 22 ) may yield better per- Fig. 22 Loop orderings MQPPQMand MPOPOM for matrix-vector multiplication.
formance even though no reduction takes place along the M-axis. This involves the sequence of operations z(i, :) x(j, :)A(i, j, :) + z(i, :) (with the freedom of switching the i and j loops).
Interesting loop orderings without a need to consider DRAM page faults and TLB misses are any ordering of { P, Q, M within a register tile (called inner loops) followed by any ordering of { P, Q, M for looping over tiles (called outer loops). Fortunately, five loop orderings (the ones marked with asterisks in Tables 10 and 11) suffice to guarantee optimal or close to optimal performance for all shapes and allocations of the operands A, B, and C. These loop orderings are PQPQM (AXPY like), PQQPM (AXPY like), QPQPM (inner-product like), MPQQPM (instance-axis vectorization), and MQPQPM (instanceaxis vectorization). Figure 23 shows the performance of the kernels with loop orderings PQPQM and PQQPM. The register tile shapes for the two kernels are { a, ~3, ~y} = { 16, 32, 1 and { 32, 16, 1 }, respectively, where { a, ~3, y} are the extents of the tile in the P, Q, and M dimensions, respectively. Because these kernels vectorize along the P-axis of A, the P-axis has been chosen as the independent variable in Figure 23 . The load/store time complexity (ignoring DRAM page faults, etc.) of the PQPQM loop order is (2P + PQ + QE, and that of the PQQPM loop order is (Q + a 2PO PQ + 2 ~ .
When both P and Q are large, both of these Table 10 Continued are dominated by the PQ term and hence their performances are approximately equal. With unit stride on the P-axis, the PQPQM order incurs less page faults and, in general, is faster than the PQQPM order, particularly when P is large. Therefore, the PQQPM order is used fairly rarely, and the PQPQM order is used when the stride along the P-axis is small or the extent of this axis is large.
Vectorizing along the P-axis is generally the most efficient of the three options. Kernels that vectorize along the Q-axis require a reduction in the direction of vectorization. This reduction can result in a significant decrease in efficiency. Figure 24 shows the performance of the kernel with the loop order QPQPM. In Figure 24 , the number of columns is chosen as the independent axis because the kernel vectorizes along this axis. The peak performance of the QPQPM loop order is significantly worse than that of the loop order PQPQM. However, when the stride along the Q-axis is small and the stride along the P-axis is large, the QPQPM loop order still performs better than the PQPQM loop order. This fact is illustrated in Figure 24 , where A has unit stride along the Q-axis. When the Q-axis is innermost and has a large extent, the cost of page faults reduces the efficiency of the PQPQM kernel, even for large values of P. ~i.Ylll'~~~ '¡j ~ CÓIi'ii~~.¡Bi.8~~I f A has been laid out with the instance-axis innermost, or with a relatively long instance-axis, it is often beneficial to vectorize over the instance-axis. However, because instance-axis vectorized kernels require a larger register set than problem-axis vectorized kernels, their efficiency is significantly lower. Two instance-axis vectorized _GEMV loop orders have been included in the CMSSL. These are MPQQPM and MQPQPM. The register tile that has been implemented is { 8, 4, 2 } . Several other register tile shapes were investigated, but none were found to offer significant benefit. The two chosen kernels are identical with respect to register usage, but the inner looping order can be adapted to the layout of A. For this reason, only the performance of the MPQQPM kernel is discussed. The performance of this kernel on square matrices is shown in Figure 25 . The peak performance is significantly worse than that of the PQPQM kernel when the stride along the P-axis is small. Thus the MPQQPM kernel is only used when a significant number of page faults would be incurred vectorizing along the P-axis. These kernels have a small value of ni on the order of two or three (this can be read off of the right-hand set of curves in Figure 25 ). The instance-axis vectorized kernels suffer from TLB thrashing, and an extra level of looping has been added to minimize this effect.
The only difference in the current LBLAS between 64-bit and 32-bit _GEMV kernels is that the 32-bit kernels use twice as many registers. This results in a greater flexibility in the selection of { a, [3, y}. In 32 bits, the PQPQM kernel has the register tile { 32, 64, I }, the PQQPM kernel has a { 32,16, 1 register tile, the QPQPM kernel has an { 8, 8, 1) tile, and the MPQQPM and MQPQPM kernels have an { 8, 8, 7 tile. The performance of these kernels is similar to that of the corresponding 64-bit kernels.
The memory architecture of the CM-5/5E has some of the features that prompted the design of level-2 and level-3 BLAS. The primary motivation for the design of the level-3 BLAS was the appearance of cache-based memory systems. By appropriately blocking the access pattern of the memory references, the application can exploit the speed of the cache. The CM-5/5E vector units do not utilize a data cache. However, there is a need for blocking in order to minimize DRAM page faults and TLB thrashing. This blocking is already built into our level-2 LBLAS.
The level-2 _GEMV kernels are balanced between arithmetic load and memory load for a single data path to memory, except for the vector loads and stores. Thus, on the CM-5/5E, only a relatively small gain can be expected from level-3 LBLAS for most matrix shapes. We estimated the gain based on memory bandwidth requirements for many matrix shapes from 10% to 15%.
Although the estimated gain for level-3 LBLAS kernels was small, we implemented several DGEMM kernels using level-3 blockings. However, the performance of these kernels never exceeded that of the corresponding level-2 kernels, and was often significantly worse. The reason for this result is that level-3 kernels require a larger working data set to be held in the registers. This necessitates that the lengths of individual vector instructions be short, which in turn may increase memory traffic and overhead. Thus no level-3 LBLAS are included in the CMSSL, and the LBLAS GEMM are implemented by calls to level-2 kernels. _GEMM for real data is performed by a sequence of calls to a single kernel. One of the matrices is decomposed into a set of vectors, and the operation is performed as a sequence of either matrix-vector or vector-matrix multiplications.
Complex multiplication is often decomposed as four real multiplications and four real additions. The same goes for matrix-matrix multiplication, which results in a complexity of {4y~ + lower order terms} on matrices of size n x n. For complex operands, the M3 method (Hingham, 1992) makes use of a decomposition of complex multiplication into three real multiplications and five additions, resulting in an overall complexity of {3~yrt3 + lower order termsl. In the case of the CMSSL implementation of _GEMM, the M3 method leads to a speedup of 10%-15% over the four multiplication method for sufficiently large matrices. When the matrices are small, it is not beneficial to use the M3 method, and the traditional approach is retained.
Kernel Selection
The kernel invocation is a runtime decision that may have significant performance impact. Often entire applications are heavily dependent on one or more LBLAS calls inside an inner loop. The difference in performance of two kernels on a given layout can be as much as an order of magnitude. It is therefore important to give every consideration as to how the kernels are selected. On the other hand, kernel calls for small problems execute very rapidly, and any overhead incurred in deciding which kernel to invoke can result in a significant degradation in performance. Several strategies have been adopted to minimize the cost of kernel selection. First, a low-overhead interface to the LBLAS has been implemented. This interface utilizes a set of cached parameters whereby only the parameters that change between calls need to be changed when invoking the LBLAS. Second, the kernel selection mechanisms have a mechanism for determining that a small problem is at hand and that the overhead for selecting the optimal kernel may well be worse than that of an incorrect kernel selection. In this case, a less rigorous but quick selection is made. Furthermore, the kernel selection mechanism is only invoked if certain parameters, such as axis extents, change. Therefore, if the kernel call is embedded in a loop where, for instance, only the addresses of the operands change, the kernel selection is bypassed after the first iteration. When kernel selection has been done, the addresses of all of the appropriate add, sub, and noadd kernels (see section 5.2.2) are stored, and each can be repeatedly invoked. This is useful, for example, when executing successive kernel calls in order to implement complex arithmetic. Only a single kernel selection is made, while four kernel calls are invoked. If a &dquo;large&dquo; problem is to be executed, it is important that the kernel that is selected is close to optimum in performance. Table 12 shows the choices that the selection mechanism must make for the available LBLAS. The key factors in determining this selection are the extents and strides of the axes of the operands. Several mechanisms were tested in order to determine which provided the most accurate and fastest evaluation. In almost all cases it was found that the quickest and most accurate method was to determine the order of the axes in increasing stride. If the axis with the smallest stride has a sufficiently long extent (typically on the order of 8 to 16), then the kernel that vectorizes over this axis is selected. When more than one kernel exists with the same vectorization (e.g., DDOT and DGER), it is necessary to examine the other characteristics of the arrays. The kernel selection of the level-2 LBLAS is typically based on the operand that is a threedimensional array rather than those that are twodimensional because the three-dimensional array typically dominates the execution time.
Conclusions
This paper discusses the implementation and performance details of the LBLAS for the CM-5/5E. The LBLAS are implemented in SPMD mode and are optimized for the vector units of the CM-5/5E. In order to achieve a maximum degree of freedom in exploiting parallelism, the CMSSL allows for multiple-instance function calls. To accommodate this, the LBLAS have been optimized for either problem-axis or instance-axis vectorization. To ensure as robust and uniformly smooth a performance as possible, a runtime selection mechanism is invoked that chooses among the available kernels. This selection mechanism must be both fast and accurate in order to ensure high performance for small as well as large problems. Several mechanisms have been implemented to improve performance on problems that are small or have unusual aspect ratios. These include a low-weight calling sequence, a high-speed path through the selection mechanism, and careful consideration of the vectorization and loop unrolling within the kernels.
The LBLAS supports the data parallel languages CMF and C* used either in global mode, in local mode with message passing, or in the mixed-mode global/local corresponding to HPF with extrinsic procedures. In order to implement BLAS within such an environment, we have implemented a two-level structure. The DBLAS perform any necessary data motion and some reduction. The LBLAS perform the actual computation. The performance optimization efforts performed at the LBLAS level can be exploited both locally and globally. The LBLAS interfaces have been designed to enable both local and global calls transparently.
The latest version of the CMSSL contains 238 distinct LBLAS kernels, totalling approximately 200,000 lines of C and assembly code. A wide variety of additional codes were analyzed or written and tested to evaluate for potential inclusion. The result is a uniformly high performance set of codes that have found widespread implementation in user codes.
Appendix: The Time Complexity for Matrix-Vector Loop Unrollings
The time complexity of an operation depends on the CPU utilization and memory traffic (both bus cycles and page fault overheads). Tables 10 and 11 list a number of loop orders and their time complexity. Those marked with an asterisk (*) in Tables 10 and 11 have been chosen for inclusion in the CMSSL.
For outer loops, only the orderings { PQM, QPM} are of interest, since if M is made an inner loop, either B or C or both are loaded for every iteration of the outer loops, and never get reused. This cuts the ordering choices down to 12, as listed in Tables 10 and 11. Tables 10 and 11 also list the load and store cycles needed for 64-bit kernels, the optimal tile extents { a, ~3, ~y} for 64 registers, and the number of page faults. We give exact formulas for the first four orderings while leaving out the low-order terms for the other orderings. The loop orderings QPPQM and QMPPQM need P x Q x M cycles for the reduction of the running sum vector to one element, and are slower than other orderings most of the time. The orderings PMQPQM and PQPQM both are good for small strides on the P-axis. Making the P-axis the innermost of the three outer loops results in relatively few page faults. When the M-axis has the second smallest stride, the ordering PMQPQM has a better chance of incurring the fewest page faults. Consider as an example the tiles { 32, 16, 1) for the loop order PQPQM, and tiles with shape {8,2,4} or {8,4,2} for the loop order PMQPQM. The two orderings will have roughly the same number of page faults because of the longer loop extents of P in the PQPQM ordering. For this reason, loop orderings PMQPQM, PMQQPM, and QMPQPM are not implemented. The orderings MPQPQM and MPQQPM will give similar performance as the orderings MQPPQM and MQPQPM when M has the smallest stride but will lose to some other orderings on other layouts, and are not implemented. There are two interesting choices for the outer loops, namely PQM and QPM as explained above. No performance benefit can be derived from having the M-axis loop anything but outermost in the outermost loop nest. The outer loop ordering determines the load/store cycles of B and C. For the ordering PQM, elements of B are reused for whole columns of A, and the total number of loads is Q x M. Elements of C are loaded and stored for every iteration of the outer loops, and the total is 2 x P x M x Q . Similarly, the loads/store cycles for the QPM outer loop ordering are Q x M x P for B and 2 x P x M for C. Since the shape of the a inner loop region is roughly a square, choosing the outer loop ordering is mainly based on the extents of the P-and Q-axes. The rule of thumb is to put the shorter axis outside the longer one. The difference will become apparent only when P and Q are very different. Any loop ordering with the Q-axis being the innermost is based on inner-product formulations of the matrixvector multiplication, and thus needs &dquo;cleanup&dquo; for vectorized inner products. Examples of such loop orderings are QPPQM, QPQPM, QMPPQM, and QMPQPM.
In the formulas for the number of page faults for each ordering, the highest order terms are always on the innermost axis-the axis being vectorized. The rule of thumb in picking optimal kernels is always to pick the ordering that puts the axis with smallest stride innermost.
BIOGRAPHIES .
David Kramer received a B.Sc. degree (in engineering) with distinction from the University of the Witwatersrand, South Africa, in 1987. He received an M.A. and Ph.D. in computer engineering from Princeton University in 1990 and 1993, respectively. From 1992 through 1995 he was a scientist at Thinking Machines Corporation, where he participated in the development of the CMSSL, runtime systems, and high performance message-passing libraries. Since October 1995, he has been with the Oracle Corporation, where his focus is on commercial applications of high performance computing systems. S. Lennart Johnsson received an M.S. and Ph.D from Chalmers Institute of Technology, Gothenburg, Sweden, in 1967 and 1970 , respectively. From 1970 to 1979 he was affiliated with the Central Research and Development Laboratories of ASEA AB, Sweden, where he initiated and led the development of large computer based real-time computer systems for electric utilities, intelligent controllers, and mathematical software. He wrote one of the first sparse matrix software packages in the industry in 1970-1972 and developed parallel algorithms for scientific applications in the mid-1970s. From 1979 to 1983 he was a senior research associate in computer science at the California Institute of Technology, where he taught VLSI design and started a course in scientific computing on parallel architectures. He joined the faculty of Yale University in 1983, where he introduced courses on parallel algorithms and architectures, and continued his re-search on architectures, algorithms, and software for high performance parallel computers for the computational sciences. Since 1986 he has been the director of computational sciences at Thinking Machines Corporation, where he initiated the development of a refined instruction set for the connection machine models CM-2 and CM-200, and is leading the development of efficient communication routines and the development of the Connection Machine Scientific Software Library (CMSSL). Since 1990, he has been Gordon McKay professor of the practice of computer science at Harvard University, where he again introduced education in scientific computation on parallel scalable architectures.
Dr. Johnsson is an editor of the Journal of Parallel and Distributed Computing, the Journal of High Speed Computing, the Journal of Concurrency: Practice and Experience, the Journal of Numerical Lir~earAlgebra with Applications, the International Journal of Supercomputer Applications, and the Journal of Scientific Programming. He is the author or coauthor of over 90 articles and conference papers, and numerous technical reports. He is the corecipient of the 1986 Outstanding Paper Award at the International Conference on Parallel Processing and is a recipient of the John Ericsson Medal. He has served as a board member of the Computing Research Association and serves on the USRA Science Council for CESDIS. He has served on the program and organizing committees for several conferences on parallel computing.
Yu Hu received a B.S. degree from the University of Science and Technology of China in 1988 and an M.S. degree in computer science from Yale University in 1993. He is now a doctoral candidate in computer science at Harvard University. His research interests are in the fields of languages, compilers, and architectures for parallel computing, with a special interest in scientific applications. He is a member of the ACM and IEEE.
