This paper discusses design features in currently available RISC microprocessors that sometimes result in less-than-optimal sustained performance on largescale scientific calculations. Recommendations for future designs are suggested.
Int roduct ion
Scientists accustomed to running large-scale, computationally intensive applications have traditionally utilized conventional vector supercomputers, such as those manufactured by Cray Research, Inc., Fujitsu or NEC. However, many scientists are now using RISC workstations, not just to edit their source codes and display their results, but to perform their computations as well. Further, RISC processors are now being incorporated into highly parallel supercomputers, and some observers predict that such systems will ultimately dominate high-end computing.
The reason for this interest is of course the recent sharp rise in the peak performance of RISC processors, and the expect ation of continued increases in the future.
Indeed, on many scientific applications, particularly those that enjoy high levels of data locality, RISC-based computers already feature superior sustained performance per dollar, when compared with conventional vector computers. However, not all applications run equally well on the RISC systems.
This phenomenon is analogous to the situation in vector computing, where some scientific applications run quite well, but others do not. Given the important role that RISC processors are likely to play in the future of scientific computing, it is clear that users and designers of these systems need to better understand the factors that affect delivered performance.
RISC is an acronym for '(reduced instructed set computer," a concept popularized a few years ago by researchers at Stanford, Berkeley and IBM. In this paper, however, "RISC processor" will be used more loosely to designate any of the recently developed high-performance floating-point microprocessors.
Indeed, the issues raised herein have very little, if anything, to do with the basic concept of reduced instruction set computing. Rather we will deal with other features common to many of these processors.
In the following, 64-bit data is assumed in all discussion of performance rates. Thus, for instance, Mflop/s denotes millions of 64-bit floating point operations per second, and Mword/s denotes millions of 64-bit words per second.
Division
Several features common to many current RISC processors result in less-than-optimal performance when systems incorporating these processors are used for large-scale scientific computations.
One example of such a feature is the design of the divide operation.
The IBM RS6000 processor has an IEEE-compliant floating point divide operation, which employs a Newton iteration scheme and requires 20 clock periods.
The Hewlett-Packard PA-RISC floating point divide also requires 20 clock periods. The MIPS R4400, which is used in the new Challenge series workstations from Silicon Graphics, Inc., requires 36 clock periods. Of currently popular RISC processors, it appears that only the Sun SuperSPARC can perform a floating point divide in under ten clock periods (see Table 1 ).
On the Intel i860, divisions are performed in software by means of a Newton iteration scheme. This sequence requires 38 clock periods.
If the inner loop, n of these are floating-point divisions.
Such considerations are not entirely academic.
Colleagues of the author at NASA Ames Research Center, who developed and implemented the NAS Parallel Benchmarks [6, 7] , have found that the slow performance of the divide operation on the Intel i860 processor significantly reduces the sustained performance of the iPSC/860 system on the block tridiagonal and scalar pentadiagonal benchmarks. Figure 1 give performance rates with the Mvect option, which utilizes an instruction that bypasses cache. As can be seen, this results in improved long vector performance, but poorer short vector performance. Even with this option, long vector performance is only about 60% of the short vector, cached data performance.
For other types of Fortran loops typical of scientific application codes, the results are similar. For instance, on the NAS Parallel Benchmarks [7] , the 128-node Intel iPSC/860 system typically achieves only about five percent of its peak multiprocessor performance. This is partly due to contention in the network, but the other principal factor is low single node performance, typically only four to six Mflop/s from Fortran or C.
While further improvements in the compiler can be expected to marginally improve these rates, there seems little hope that sustained rates of more than eight to ten Mflop/s will ever be achieved on the main body of real scientific codes [11] . IBM RS6000 cache produces some rather odd behavior with respect to strides.
Programmers of cachebased systems are prepared for slowdown anytime the strides of memory accesses are not unity, since these result in poor cache utilization. But few programmers are aware that additional slowdowns result for certain strides such as 73 or 102. It turns out that such strides result in poor performance because they are very nearly simple fractions of 512 [5] .
The Hewlett-Packard PA-RISC processor runs at 99 Mhertz. It has a large cache, typically one Mbyte, and it can fetch or store one operand every clock period (after the first ). However, data in main memory data can be fetched or stored at a rate of only one word every three clock periods. As with several of the other processors, the PA-RISC can perform up to two floating point operations every clock period.
The R4400 processor, which is used in some new Silicon Graphics workstations, runs at 100 Mhertz. It can fetch or store a operand from the internal cache every clock period. It has an external cache, typically one Mbyte, from which it can access a word every other clock period, which is the same rate for data in main memory.
It can perform a floating point operation every other clock period, for a peak rate of 50 Mflop/s. Performance Characteristics has been added.
The DEC "Alpha" RISC processor features a full 64-bit design and-a 150 to 200 Mhertz clock, depending on model.
Since it can perform a floating point operation in every clock period, its theoretical peak performance is cited as 150 to 200 Mflop/s. However, in spite of its much faster peak speed, the size of the Alpha's primary cache is the same as that of the i860: only eight Kbytes.
Like other recently announced RISC processors, the bandwidth (and latency) to and from main memory for the DEC Alpha is significantly lower than that of the cache. The Alpha can load or store a word from main memory only about every four clock periods (peak rates). Thus the ratio between achievable internal cache and main memory bandwidths is at least four.
The latency for main memory accesses on the Alpha is about 20 clock periods, which is more than six times that of cache accesses (three clock periods).
In an attempt to deal with this disparity, the DEC Alpha, like some others, features support for a secondlevel external cache, typically consisting of about one Mbyte of SRAM. The peak rate for loads and stores from second-level cache is one word every two to three clock periods, and the latency is roughly eight clock periods.
This information is summarized in Table 1 . It is interesting to note that the Linpack 100 x 100 performance of an individual system appears to be well estimated as one-half of the minimum of the memory bandwidth and peak performance figures. The only systems that achieve significantly greater rates than this estimate are the DEC Alpha and HP PA-RISC processors.
But these systems benefit from an external cache, which is sufficient to completely contain the Linpack 100 x 100 benchmark.
Indeed Note that this figure is much closer to half of its main memory bandwidth figure (37.5 Mword/s).
No Linpack 100 x 100 figure was available for the SuperSPARC; however, its Linpack 1000 x 1000 figure of 22 Mflop/s is in general agreement with this rule.
Performance per Dollar
One advantage of RISC processors over vector processors is that they obtain respectable performance on loops with apparent or real recursions, or with short loop lengths.
On the other hand, with some tuning the vector systems generally achieve rather high performance.
On Other important scientific applications are not so fortunate.
The most common reasons for less-thanoptimal performance are: (1) the application was previously implemented on another system (such as a vector system) where data locality was not a performance issue; (2) the code employs numerical algorithms that do not feature high levels of data locality.
We have already discussed the impact of the divide operation on large-scale scientific applications. Now let us investigate the impact of the cache design on these codes. To that end, four types of numerical algorithms will be examined to see whether they can be expected to run well on cache-based RISC processor systems (1) as is, (2) after processing by intelligent software tools that may be available in the future, (3) after significant revision by expert programmers, and (4) after the wholesale substitution of advanced, cacheefficient algorithms in important compute-intensive routines.
In the following, by an "expert programmer", we will mean a programmer who is highly skilled in tuning scientific codes on cache-based computers and is generally knowledgeable in numeric techniques, but who does not necessarily have specialized expertise in state-of-the-art numerical algorithms.
Loop Ordering
The order in which loops are ordered can significantly affect the sustained performance of this code on a cache-based system. Consider, for example, the following code fragment: DIMENSION A(15, 1000), C(15) . . . This code design features unit stride inner loops, which are definitely better for cache systems. The fact that the inner loop vector length is only 15 has no adverse consequence on a RISC system.
On the contrary, this limited vector length is an advantage on the RISC system -thecvector is cache-resident and hasavery high level ofdatare-use. Indeed, while the first variant runs nearly twice as fast than the second on a Cray Y-MP, the opposite is true on a node of the Intel iPSC/860.
In this csse, "smart" compilers are now available that can automatically perform this loop interchange. But other csses require changes over multiple subroutine boundaries, rather than merely within the context of a single nested loop as above (consider the case where the loop lengths are not constants as above but instead are subroutine arguments). . Thus it appears it will be a while before highly effective, fully automatic software tools of this sort are available.
Transpose Operations
At the heart of many two and three-dimensional scientific applications, similar operations must be performed in each dimension. On vector computers these are often implemented as a sequence of calls to subroutines that are virtually identical, except for the dimension in which multidimensional arrays are accessed.
It is clear that a Fortran routine which accesses arrays by other than the first dimension will not perform well on a cache-based system, since such accesses have nonunit stride. One solution is to revise the design of such programs to perform array transpositions between the computational steps, so that computations can always be done with unit stride data accesses, The resulting code may even be simpler than before, since often the same unit stride computational routine can be employed in each dimension.
It should be pointed out, however, that in some cases the performance of the resulting code is not much better than the original, since there is insufficient computation to offset the cost of the transpose operation.
But in other csses it is profitable to make such a change. Even if an application is recoded with a design that interleaves computation and array transposition steps, there remains the problem of performing array transpositions efficiently on a cache-based system.
Straightforward
Fortran loops to transpose a large matrix are not effective since they involve typically involve large memory strides.
It is possible to perform transpositions in a cacheefficient manner by "blocking" the transpose: one fetches two opposing blocks from main memory to cache (note that each column of these blocks can be fetched with unit stride), transposes each block in cache, and then returns each of the transposed blocks to the opposite location. This is best described as follows, where & and Bij denote blocks small enough that both can simultaneously fit in cache:
Another efficient scheme for transposing arrays in a hierarchical memory system is based on Frsser's algorithm. This scheme is described in [3] .
It is certainly possible now for expert programmers to make these transformations, although it is laborious to do so for a very large program.
It is probable that future compilers and other software tools will be able to automatically transform such code into more cacheefficient designs. But as before, it seems unlikely that highly effective tools of sort will be available for two or three more years.
Fast Fourier Transforms
The next example to be examined is the computation of the one-dimensional fast Fourier transform (FFT).
Unfortunately, many FFT algorithms in the literature, if implemented in a straightforward manner, include such undesirable features as large, powerof-two memory strides. These strides also hamper vector computer performance.
Fortunately, there are variant FFT algorithms that do not involve power-oftwo memory strides, and which in fact can be implemented with exclusively stride one memory accesses
[4]. These algorithms have been used on vector computers for some while.
However, even these algorithms typically have one feature that is undesirable for cache systems: the principal main memory arrays are both accessed roughly t times to perform a 2t-point FFT. Is it possible to reduce the number of times these data arrays are accessed? Yes, as it turns out. However, no amount of loop restructuring or other superficial manipulation of the code will accomplish this. Instead, a completely different technique for performing the FFT must be employed. One such algorithm can be sketched as follows, where the size of the input complex vector is n = nl n2 words, and where the cache is assumed to hold at least 4b max(nl, nz) words. See [3] for details.
1. Consider the data in main memory aa a nl x nz complex matrix in column-major (Fortran) order.
Fetch the data b rows at a time into the cache.
For each batch of b rows, perform b individual n2-point FFTs on the b x n2 complex array in cache.
2. Multiply the resulting data in each batch by certain roots of unity: the element fet ched from location (j, k) in the original complex matrix is multiplied by e-2=ijkln.
3. Transpose each of the resulting bx nz complex matrices into a n2 x b matrix, using the cache, and store the resulting data in main memory. Store successive bat ches of data in successive contiguous sections of main memory. rows to the same locations in main memory from which they were fetched. The result is a correctly ordered discrete Fourier transform.
With this algorithm, the principal main memory data arrays need only be accessed two times, no matter what the size of the FFT. Note that all of the individual nl-point and n2-point FFTs are performed entirely in the cache.
It is true that the FFT is an operation that is often available in vendor-supplied libraries, and thus one could argue that individual scientists do not need to be concerned these issues -they can merely use library routines. However, many scientists regrettably have incorporated their own "home-grown" FFT routines into their program files, in order to make their codes more easily port able between systems. Also, it is important to note that vendor-supplied FFT routines in many cases do not employ an advanced algorithm. For instance, the FFT routine supplied by Intel on the iPSC/860 employs a conventional algorithm that is efficient only for transforms small enough to fit into cache (see Figure 2) . Of the major RISC vendors, only IBM appears to have implemented an advanced FFT algorithm
[1] similar to the one described above.
Iterative Methods
Iterative methods are increasingly important in solving large, sparse systems of linear equations. They are typically found at the center of economically important, large-scale computations. The most important methods in use today include the multigrid and Krylov subspace methods, of which the conjugate gradient method is the prototype.
The book of Golub and
Van Loan [10] is a good reference for these methods. Unfortunately, these methods cannot be restructured to work well with memory hierarchies.
The problem is to solve the equation Az = b where A is a given sparse nonsingular matrix and b a given vector. The set of nonzero elements of A is generally much too large for cache, since it typically requires many Mbytes of storage. The conjugate gradient and related methods compute a sequence {x~} of vector iterates that converges to the solution z.
The nature of these algorithms is that all of A must be accessed from memory, once per iteration, before any of it is accessed again, so the computation proceeds at main memory rather than cache speeds. In the case of the Krylov subspace methods, a dot product involving the vector Axk has to be computed before $k+l can be. Thus, the matrix-vector product at one iteration cannot be pipelined in any way with the matrix-vector product at the preceding or following iteration.
With the multigrid methods, data on fine grids is accessed only one or two times before computation passes to coarser grids. These grids thus represent a bottleneck analogous to the dot products in Krylov subspace methods.
At the present time, there is a significant amount of research activity in this field, and there is some hope for improved algorithms in the future. Improved preconditioners, for example, may reduce the number of iterations required for the conjugate gradient method to converge. Domain decomposition methods, which work by solving a sequence of smaller systems of equations for subsets of the unknowns, use much smaller sparse matrices that may fit in cache. There is also some hope for "block iterative methods," which may provide the power of the iterative schemes in a more cache-efficient design. Some other iterative methods, such as the Jacobi, Gauss-Seidel, and successive overrelaxation methods, may be organized to work well with memory hierarchies.
In general, however, these latter methods are much slower and less reliable than multigrid and Krylov subspace methods. In any event, none of these alternate schemes can be automatically derived from the codes for other iterative methods.
Future Outlook
The problem of quantifying data locality in various types of scientific applications is a very important and timely one. This problem is closely related to the question of whether an application can be efficiently implemented on a distributed memory multiprocessor.
Clearly this problem deserves a great deal of study by the research community, and the answers will only be evident after many scientists have studied many different applications.
We have seen that in each of the four cases studied above, there are either known algorithms and implementation techniques that are better suited to cache memories, or else there is some hope that advanced algorithms
and tools now being developed will provide scientists with a cache-efficient alternative in the future. It might turn out that it is always possible to find algorithms and implementation techniques that are reasonably cache-efficient, in the sense that data once fetched into the cache can be profitably accessed numerous times before being returned to main memory. In that case the RISC vendors' assumption that caches are broadly effective in sustaining high performance will be upheld.
On the other hand, it might be that there remain a significant number of important scientific applications which feature an unavoidably low degree of cachelevel data locality. For these problems, RISC systems with large ratios in performance between data in cache and main memory will have much less of an edge in sustained performance per dollar, compared with other architectures, and may even be surpassed in this statistic by other designs.
While there is some uncertainty on the long-term outcome on this question, in the near term (within the next two or three years), the answer seems considerably clearer: a significant segment of important scientific applications will not perform at optimal rates on current cache-based RISC systems. Thus even at this point in time, it seems essential that designers of cache systems be reasonably modest in their designs.
An analogy with the history of vector computing is instructive in this regard. One of the early vector systems, the CDC 205, featured superior performance on codes with long, unit stride loops.
Other vector systems, not ably the Cray-1 and Cray X-MP, featured a more flexible design, allowing nonunit strides and making more modest assumptions about typical vector lengths.
These latter systems, while they could not compete with the CDC 205 on specially selected, highly-tuned codes, exhibited respectable per-formance across a much wider spectrum of real-world applications. This was a principal reason that they ultimately prevailed.
Conclusions and Recommendations
Many in the scientific computing community have observed that both the workstation market and the high-end supercomputer market have the same design goal: a high level of sustained performance on real scientific and engineering applications. These same persons have further argued that the massive investment being made in RISC microprocessors will force supercomputer vendors to employ these "commodity"
processors in their systems.
However, this may not necessarily happen -it may turn out that the workstation market has sufficiently distinct requirements from high-end supercomputing that each will pursue its own path in processor de- If this divergence does occur, then much of this discussion may prove moot. On the other hand, if it becomes clear that the requirements of these two markets are converging, then it is essential that both designers and the users of RISC systems consider these issues. Otherwise both the high-end supercomputing market and even the workstation market itself may suffer from less-than-optimal sustained performance on many important applications.
Two features of current RISC processors have been discussed in length above: (1) the large cost of divide operations, relative to the cost of adds and multiplies, and (2) the limited bandwidth between processors and main memory, relative to the bandwidth between processors and cache, or in other words overly optimistic ratios between cache and main memory performance.
Both are problems of degree rather than fundamental concept.
What does the author recommend? First of all, I suggest that the cost of an integer divide operation be no more than about ten times the cost of an integer add or multiply, and that the cost of a floating point divide operation be no more than about ten times the cost of a floating point add or multiply (based on a series of divides with non-constant divisors). This ratio is higher than the ratio on Cray systems (four), but it seems acceptable even for applications that make heavy use of divisions.
With respect to main memory bandwidth, I certainly urge designers of future systems to make this as large as possible, recognizing that there will always be design costs and trade-offs that will limit this. What about the balance between bandwidth and latency to main memory, as compared with bandwidth and latency to cache? A factor of two appears tolerable, and a factor of four is not fatal. But I believe that ratios beyond this level will not add significant value for many large scientific codes, and whatever resources were devoted to such caches could more profitably be employed in seeking ways to obtain higher bandwidth and lower latency to main memory.
One additional useful design point can be deduced as follows. Just as programmers of vector systems can be expected to tune their codes so that inner loops are reasonably long and so that strides are not powers of two, it is reasonable to ask some tuning effort by programmers of cache-based RISC systems. For example, it is reasonable to assume that with some effort many scientific application can be revised to feature mainly unit stride accesses (for instance by employing computational steps interleaved with array transpositions).
In return, it seems reasonable to ask that future RISC-based systems be able to process contiguous data without significant slowdown. For inst ante, it seems reasonable to insist that a unit stride DAXPY loop run equally fast for data in cache as for long vectors that cannot be held in cache.
It may be that changes can be made to the design of the cache systems in RISC processors to make them more broadly effective for large-scale scientific computation.
Some have suggested special instructions that load data from main memory, at a stride, directly into the registers (the Intel i860 has an instruction along these lines). Another possibility is to make changes to the basic cache organization, such as the number associativit y classes. Parallel computing using RISC processors would receive a boost if processors included support' for high speed, low latency interprocessor communication (such as "virtual shared memory" ). At the present time, manufacturers of parallel systems that employ RISC processors must utilize separate custom-designed devices to provide this functionality, and the cost of this custom circuitry is emerging as a dominant factor in the total cost of the system. Even in a workstation system, such features could prove valuable as a basis for implementing multiworkstation, networked parallel computing.
In any event, it is a pity that there is not a greater degree of mutual communication between the RISC processor community and the scientific computing community.
Clearly both communities have much to gain from such interaction. This article is written with the hope of fostering this dialogue.
Acknowledgments
The author wishes to acknowledge a significant con- 
