We have developed a performance bounding methodology that explains the performance of loop-dominated scienti c applications on particular systems. We model the throughput of key hardware units that are common bottlenecks in concurrent machines. The four units currently used are: memory interface, oating-point, instruction issue, and a \dependence unit" which is used to model the e ects of performance-limiting recurrences. We propose a workload characterization, and derive upper bounds on the performance of speci c machineworkload pairs. Comparing delivered performance with bounds focuses attention on areas for improvement and indicates how much improvement might be attainable.
Introduction
Computer scientists and engineers use performance evaluation as a tool to achieve several di erent goals. Computer architects are interested in understanding existing and proposed machines in order to improve the design of new machines. The developers of libraries, compilers and operating systems focus on e ective utilization of machine resources. Application developers are concerned with optimizing speci c programs by understanding performance bottlenecks. End users may only be interested in choosing the fastest or most cost-e ective machines and application packages.
An e ective performance evaluation technique can provide insight for each of these groups. Many researchers have evaluated scienti c computers by focusing on the expected performance. These studies may involve detailed measurement on implemented machines (e.g. 1, 2]), the timing of large applications (e.g. 3, 4] ), or analytic performance models (e.g. 5]). We believe that a more appropriate approach to improving performance for scienti c applications is to bound the best achievable performance that a machine could deliver on a particular code and then try to approach this bound in delivered performance.
This approach more accurately re ects the manner in which the scienti c computing community views performance and the extent to which they are willing to work to optimize their codes.
We present a technique for determining and approaching performance bounds for scienti c loop-dominated codes, using the Livermore Fortran Kernels on the IBM RS/6000 high performance workstation as a running example to illustrate the method. These bounds focus on the latency and bandwidth of speci c machine components, particularly memory and oating-point units, since these units are common bottlenecks that are critical to the design of today's highly concurrent scienti c computers. The remainder of this section discusses the relationships between the memory and function unit bandwidths and overall performance, as well as between their latencies and register storage requirements. Section 2 presents a performance bound model for the RS/6000 and uses it to evaluate and improve delivered performance. Section 3 develops a theory for evaluating the register requirements of an optimally scheduled scienti c loop as a function of the operation and memory access latencies of a machine.
Optimum Performance for a Scienti c Machine-Application Pair
The term \scienti c application" refers here to programs that are dominated by loop constructs (e.g. FORTRAN DO-loops) iterating over large arrays of oating-point data.
Examples include LU decomposition, fast Fourier transforms, and nite element techniques.
Because the vast majority of the execution time for these programs is spent in relatively small and well structured code segments, programmers can a ord to invest focused e ort in code optimization, discovering techniques that will hopefully be included in future compilers, and hardware quirks or imbalances that hopefully will be eliminated in future systems.
Previous attempts to discuss optimum performance for scienti c computation have generally been dismissed because the typical measures were far from achievable. For example, peak MFLOPS (million oating-point operations per second) that are solely based on hard-ware factors are frequently quoted, but rarely useful. Instead, we propose a performance bound that combines machine and application characteristics and represents an asymptotically achievable measure of optimum performance, at least for loop-dominated scienti c code.
Bandwidth and Latency Limits on Optimum Performance
Scienti c loop-dominated codes typically present signi cant opportunities for e ective optimization. Consider the following nested Fortran loop, implementing a matrix multiplication: DO Each iteration of the inner loop is logically independent of all other iterations. In a fully pipelined CPU, the inputs for one iteration can be read from memory while the results from previous iterations are being calculated. Two logical processes are operating in this loop:
the Access process moves data between the CPU and the memory, while the Execute process operates on that data 6].
While this example code is vectorizable because of the independence between the iterations of the inner loop, non-vectorizable code may also exhibit such access-execute parallelism. The typical inner product method of writing a matrix multiplication iterates over K in the inner loop. The program then has data dependencies (through the value C(I,J)) that prevent successive oating-point multiply-add operations from executing concurrently (unless the compiler is clever enough to recover and restructure the underlying algorithm).
However, since A(I,K) and B(K,J) are never written to, copies may be loaded concurrently with the execution of an earlier iteration of the inner loop, and results may be stored later.
Access-execute parallelism is inherent in both versions of the program.
The original version of the matrix-multiplication example executes three memory and two oating-point operations for each inner loop iteration. The key to achieving high performance for this loop code, on most competitive machines, is to utilize 100% of the available memory bandwidth. For the second version of the code, two memory and two oating-point operations are required for each inner loop iteration and performance may be limited by the latency of the oating-point add (the minimum possible time between a oating-point add and a later dependent operation).
Register Requirements for Tolerating Latency
Most high speed scienti c computers are designed to provide high peak performance for codes with large amounts of access-execute parallelism, and some execute parallelism as well. This is accomplished by providing high bandwidth oating-point and memory units and very fast system clocks. High bandwidth and fast clocks can be expensive to provide, and frequently result in deep pipelines with increased latency. For example, memory reads in the Cray-2 can be issued at the rate of one per 4.1 nanosecond clock, but their latency is typically 45 clocks. Application parallelism requirements are less severe for superscalar machines with caches.
Long latency operations do not reduce performance if enough parallelism is available in the application, su cient e ort is spent in program optimization, and enough bu er space is present in the processor. Bu er space, usually implemented as processor registers, is reserved to store the results of pipelined operations as soon as they are issued. If perform- ance is not sacri ced as operation latencies increase, then more operations will be executing concurrently in the pipelined function units and the memory, and more bu er space will be required. Section 3 addresses this issue by considering the relationship between operation latencies and register requirements.
Bounding Optimum Performance -IBM RS/6000
This section develops a method of nding an upper bound on optimum performance for a scienti c application on a highly concurrent processor. 7, 8] The methodology is illustrated by deriving and using these bounds to evaluate and improve the performance of the IBM RS/6000 ( gure 1) on a set of scienti c code kernels.
The run time of an application is bounded from below by the product of the minimum amount of time (number of clocks) that must be spent per inner loop iteration, t l , and the number of iterations. This bound can be made fairly tight for loop-dominated scienti c applications. For a highly concurrent machine, t l is determined by rst computing the minimum time that must be spent in each of several potential bottleneck units of the machine and then taking the maximum of these times. The bounds formulation for the IBM RS/6000 ( gure 2) identi es these times for four common bottleneck units:
t f for the oating-point unit, which is evaluated by counting the number of oating- which the CPF bound would have to be recalculated.
When the t l bound is achieved, the unit(s) selected by the max function is kept continuously busy by the operations that are counted. Other units, other operations, and other factors that may limit concurrency within and among these units are simply ignored.
When these do inherently limit performance, the t l bound will not be achievable in practice. The goal of this research and this case study is to evaluate performance relative to this simple bound, to nd mechanisms for approaching the bound performance, and to explain the gaps between achievable and bound performance when they do occur. The rst 12
Livermore Fortran Kernels 13] are used as the application workload for the experimental study in this paper. Figure 3 shows that actual measured performance for compiled code averages only 52.4%
of architectural performance bound 1 . Only LFK3 and 4 achieve 90% or more of their performance bounds; the worst loop (LFK11) achieves only about one fourth of its bound.
Improving Delivered Performance
The gap between delivered and architectural (bound) performance has two causes: the compiler may fail to optimize the code well (resulting in poor performance) or the model may in the compiled inner loop body code of 9 of the 12 LFKs (Table 2 ). To assess their e ect on performance, we applied two methods to remove them.
In the rst method, the registers are re-allocated so that the content of the registers that store reusable data will not be changed until the data is used. Arithmetic operations are then recoded to use the data in these registers, and the redundant loads are discarded.
Some initial loads may need to be inserted before the loop. The second method is used only in LFK12 where the rst method is di cult to apply. The loop is unrolled twice and di erent registers are used for odd and even iterations. Redundant stores are simply removed and nal stores are inserted after the loop. Since there is no o cial disassembler for the RS/6000, nor an executable assembly code output from the compiler, we used Goblin 15] to disassemble the compiled code, and performed the necessary modi cations on that code.
To illustrate the rst method, consider the following example.
Example 1 Excise the redundant loads in the compiled code of LFK1
From the source code of LFK1 ( gure 4), it is evident that ZX(K+11) could be used as ZX(K+10) in the next iteration. However, the current compiler loads both of them in every iteration. The modi ed code reserves fp0 to hold the reusable data, ZX(K+11),
for the rst instruction of the following iteration, and eliminates the redundant load.
Notice that the code execution sequence had to be reordered and the initial value of ZX(K+10) had to be loaded before the loop begins.
Usually, removing redundant memory operations increases performance. However, for loops such as LFK7 and 8, which have more oating-point arithmetic instructions than essential memory operations, eliminating all redundant memory operations does not guarantee higher performance. Moreover, when register reallocation is di cult, necessitating a large degree of unrolling or register moves through the FPU pipeline 16], the burden on the FPU can increase. Consequently, only one memory reference was removed in LFK7, and three in LFK8. All the redundant memory operations were removed in the other loops.
The resulting performance improvement is shown in gure 5.
Unroll Twice and Interleave
The next step is to reduce the stalls caused by RAW dependence among FPU arithmetic instructions. 2 In RS/6000, the second instruction of two consecutive dependent FPU arithmetic instructions will typically be stalled for one clock before entering the FPU pipeline (FPU pipeline latency for multiply-add is usually 2, sometimes more when data exceptions are possible) 14]. These stall clocks are added to the time, t f , required in the FPU. Table 2   2 In IBM RS/6000, when an lfd ( oating-point load double) and a dependent FPU arithmetic instruction are issued in the same cycle and the datum to be loaded is in cache, the operand will arrive before the arithmetic instruction needs it, and no stall cycle occurs. Thus we only consider dependent FPU arithmetic instructions. shows the number of such RAW stall cycles per iteration for the compiled code. For example, LFK9 with no redundant memory operations, su ers at least 8 RAW stall clocks due to this e ect, which aggravates the FPU bottleneck. One independent FPU instruction placed between two consecutive dependent FPU instructions will ll an otherwise wasted stall cycle in the pipeline. It may be possible to achieve this by reordering the instruc- have at least one such con ict. These con icts were eliminated by unrolling and interleaving three loop iterations, or by a similar cyclic scheduling technique, which can always place a dependent store or a nonstore instruction three clocks after an FPU arithmetic instruction.
Although LFK10 also has this problem, it was not eliminated since overall performance was not improved.
Fine-Grain Code Scheduling
In 2.1.5 Performance Improvement Figure 5 shows the step by step performance gains from these code modi cations of LFK1-12. The average performance was increased from 52.4% to 77.6% of the performance bound by eliminating redundant memory operations. This simple procedure (applied initially) contributed most to the average performance improvement. The largest gains are in LFK5 and 11, where the rst lfd instruction of each iteration in the compiled code reloads the data value that was stored by the last instruction in previous iteration. These load operations are stalled until the data are written into the data cache from the Pending Store Queue. When these redundant loads were removed, performance was signi cantly improved. The second step elevated average performance to 87% of the performance bound.
Although the net contribution of this procedure is less than that of the rst step, several loops posted signi cant improvement gains. The nal average performance achieved after the third step is 93.7% of the performance bound, representing a 1.79 times speedup over the original compiled code (table 3) . These hand coded improvements of the disassembled code were achieved by applying fairly general techniques to approach the performance bound, indicating that the RS/6000 compiler (release 2.01) could improve delivered performance signi cantly.
Measuring Steady-State Inner Loop Performance
In gure 5, the nal performance achieved for some loops (LFK2 and 6, as well as 4, 9, and 10) is still noticeably less than the bound. To better bound their performance, we now consider e ects previously ignored in the model. The model ignored cache miss e ects, but these loops t entirely in cache and no cache miss penalty was observed in our measurements. Figure 6 displays the nal achieved performance as in gure 5, except that the measured CPF for LFK2, 4, 6, 9 and 10 are taken to be the values of c in Table 4 . With this adjustment, all loops are seen to achieve at least 94% of the performance bound in steady state, and the average is 97.6%.
Thus for the RS/6000, this simple bound model is believed to characterize extremely well the achievable (architectural) steady-state performance for scienti c inner loops and c and k n ?h are believed to characterize, respectively, the achieved steady-state inner loop performance and the run-time overhead of the complete scienti c kernels, except for cache miss e ects and register spilling which have not occurred here. Table 4 : The achieved and bound performance for LFK 2, 4, 6, 9, and 10 using the derived k, h and c values of formula 2.1. 
Preprocessor and New Compiler
We have used the preprocessor (V2.01) of the RS/6000 to see how much it can improve delivered performance. By manually inserting directives and setting various optimization switches, the best harmonic mean performance we obtained is 13.4 MFLOPS over the rst 12 LFKs. The performance gain is less than half of the hand code improvement primarily because the compiler does not e ectively take advantage of the preprocessor's source code restructuring. It still fails to detect the redundant memory operations, and is not able to remove most of the RAW hazard stalls even when the basic block is enlarged by unrolling.
The new RS/6000 V2.02 compiler, even without the preprocessor, improves the harmonic mean performance to 14.3 MFLOPS (1.39x improvement over V2.01) for the same workload. Most of this performance gain was achieved by removing the redundant load in each of LFK5 and 11. Since these loads were stalled due to the still pending stores to same locations, performance of the loops increased by 2.25x and 2.69x, respectively. Even though the V2.02 compiler eliminated 6 redundant stores and one RAW hazard in LFK10, its performance improved by only 10%. The other loops were not a ected by more than 5%
and some actually ran slower. Adding the V2.02 preprocessor gains another 1.10x improvement in the harmonic mean to 15.7 MFLOPS due to restructuring such as loop interchange and unrolling. However, unless the compiler can take advantage of the preprocessor by employing more sophisticated scheduling to remove stalls, the performance improvement that can be obtained by the preprocessor is limited. Moreover, although the V2.02 compiler is much improved over V2.01, it can still be further improved by dealing in a systematic and goal-directed manner with the problems cited in previous sections.
Cache Characterization
In our LFK experiments, all data arrays of the programs t in the IBM RS/6000 cache, but the data arrays in other applications or on other machines may not. Due to the progress of VLSI technology, cache system designs are becoming more complex and the memory subsystem descriptions provided by vendors are usually too simple to represent the real performance. In this section, we present a simple methodology that can be developed further to predict real memory performance. This method uses the idea of load/store kernels to learn about the cache structure and calibrate its performance 21]. The basic program used here is:
In this study, we ran the kernel many times to get steady state performance. One of the load kernels is DO 1 I = 1, ARRAY SIZE, STRIDE*4 P = P + X(I)*X(I+STRIDE) Q = Q + X(I+STRIDE*2)*X(I+STRIDE*3)
Two parameters, ARRAY SIZE and STRIDE, were varied in order to obtain enough data for analyzing the cache characteristics. Figure 7 shows the cache performance measured by this method. It is evident that the average time required to load an 8-byte data word depends on the array size and stride. fully serviced constitute the trailing-edge e ect. Neither the leading-nor trailing-edge e ects are fully documented in the available literature.
H H H H H H H H H H H H H H H H H H H H H H H H
The performance of store references can be modeled similarly by replacing the load kernel with a store kernel. Mixtures of loads and stores and other stride patterns can also be characterized with corresponding kernels. This method is directly portable to other machines and leads to a straightforward experiment-based derivation of cache access bandwidth, latency, and inferred structure in the absence of full documentation. By characterizing an application's reference sequence, the results of this method can be used to add appropriate memory access latency e ects to the bounds discussed above.
3 Register Requirements as a Function of Latency
As described in the previous section, a performance bound based almost solely on the bandwidth of various function units is nearly achievable for the LFKs. Except for those LFKs limited by t d , the latencies of the function units have little e ect on performance. The approachability of these performance-bounds is primarily due to the ability of good code scheduling techniques to exploit a su cient degree of the available parallelism within and among the loop iterations to mask memory and function unit latency, and to avoid adding redundant operations.
When a machine is targeted for such scienti c workloads, computer architects tend to tradeo increased latency (in processor clocks) for increased function unit bandwidth.
However, application parallelism can be used to mask increased latencies only if su cient bu er space (e.g. CPU registers) is available. While the RS/6000 has enough registers for the optimum code schedules discussed above, it is important to understand how register needs grow as a function of latency in order to understand performance issues for other applications and future machines.
In this section, the relationship between pipeline depth (latency) and register requirements is determined for loops when t d = 0 and no datum is used more than once, i.e.
the dependence graph is a forest of trees. Registers requirements are considered only for optimum performance schedules.
Cyclic Scheduling
Compilers that use cyclic instruction scheduling (CS) 17] algorithms for loop code are able to generate code schedules that reach optimum performance for a range of programs 3 .
We will assume that CS is used to obtain optimum performance code schedules with the minimum register requirements of any such schedule, even though this problem is known to be intractable 23, 24] . In order to nd an optimum performance schedule with the minimum register requirements, exhaustive search techniques were used 25]. instruction is de ned by a resource reservation template and a latency (or several latencies to di erent classes of readers of its result).
CS is derived from
As an example of Cyclic Scheduling, consider the following code from LFK12: 
Instruction Templates
De nition 1: Instruction Template The template for a machine instruction indicates which hardware resources are needed to execute that instruction, and when (relative to instruction issue) those resources are reserved exclusively for this instruction.
De nition 2: Depth Independent Templates An instruction has a depth independent template (DIT) if the template is invariant with respect to pipeline depth (latency), e.g. no ports where con icts may occur are used when writing results either to memory or to the register le. Resources with possible con icts can be used at any time that remains at some constant number of clocks after issue time.
The RS/6000 has depth-independent templates for the instructions modeled here under normal operating conditions. 4 The problem of packing instruction templates from a kernel into an MRT of a given size is identical to the bin-packing problem, which is known to be NP-complete 23]. Since the RS/6000 is modeled as a DIT machine, the instruction templates do not need to be modi ed to model a machine with longer pipelines, and so there exists one bin packing problem for each kernel to be scheduled, independent of the pipeline depths.
Register Requirements vs. Increasing Latency
As in the previous two subsections, when applying equation (3.1) we assume that all kernels have dependence graphs that are forests of trees. To consider the relationship between machine architecture and register requirements, it is useful to introduce the concept of a base machine, which will be used to separate out the e ects of pipeline depth. the result of instruction i can be issued at any time one or more clocks after instruction i, provided that there are no hardware resource con icts.
The notation R k (S) will be used to refer to the minimum register requirement for kernel k whenS is a U element vector. Element s u is the pipeline depth of function unit u. When only the depth of unit u is being varied, the notation will be shortened to R k (s u ). \Pipeline depth" and \latency" are used interchangeably in the following discussion. Proof: Because each instruction that uses u has a depth independent template there will be one set of lled MRTs that apply to all values of s u . Let z be any integer constant.
Consider two identical machines except that M 1 has s u = z and M 2 has s u = z + MII.
Since the kernel's dependence graph is a forest of trees, equation (3.1) can be applied to determine the register requirement. Increasing s u by exactly MII will increase the register requirement by one for each use of function unit u. Exactly this increase will occur for each of the MRTs examined. Since the register requirement for each MRT is increased by the same amount, and the same set of MRTs may be used for both machines, the MRT that resulted in the minimum register requirement for M 1 will also do so for M 2 . In other words, the optimum MRT for R k (z) will also be optimum for R k (z + MII). First, increasing either latency or instruction issue bandwidth tends to increase register requirements. Increasing the pipeline depth for a function unit will increase the lifetime of each register that the function unit writes to. However, as long as su cient parallelism and registers are available, increasing the pipeline depth will not a ect performance. On the other hand, greater instruction bandwidth will tend to improve performance by reducing MII. However, more operations will be active concurrently, and thus more registers will be needed to store pending results.
The second observation is that the theory developed here can be used to generate code schedules that are tolerant of unpredictable runtime behavior. For example, the RS/6000
Fortran compiler assumes that a memory read hits in the cache and the data is available on the cycle after the load is issued. 
Concluding Remarks
This work is based on the premise that the proper focus of performance evaluation studies for scienti c computers is on optimum rather than expected performance. Optimum performance studies and metrics have earned a bad reputation because it is easy to derive trivial or unachievable performance bounds. However, realistic bounds for actual codes are quite useful for designing future compilers, architectures and machine implementations.
Achievable bounds also help applications programmers determine when, where, and how to invest e ort to optimize code, and indicate what fraction of the available performance has already been achieved.
We have developed an e ective technique for specifying such performance bounds by focusing on the bandwidth provided by key machine units, as well as the speci c bandwidth requirements of particular applications. These bounds have proven useful for optimizing scienti c loop codes on the IBM RS/6000. The RS/6000 is generally credited with good performance on many scienti c codes, including the Livermore Fortran Kernels. The application-speci c performance bounds developed here suggested that the RS/6000 actu-ally loses a great deal of its potential performance, even though the achieved performance is often impressive. Guided by the performance bounds, a few straightforward code optimizations were able to raise the average e ciency on these codes from 52.4% to 93.7% of the bound, which represents a performance speedup of 1.79. Without the performance bounds, it would have been very di cult to decide which kernels to focus e ort on, what optimizations to apply, and when to cease optimization work.
It is apparent that the heuristic approaches used in the RS/6000 Fortran compilers often unroll more or less than needed to mask pipeline latency, insert redundant memory accesses, and frequently fail to deliver the best possible performance. The e ciency of both the compilation process and the compiled code would be signi cantly improved if the com- In general, performance is also a ected by cache misses and register spills. Load/store kernel experiments are recommended for characterizing cache structure and performance.
The e ect of function unit and memory latency on register requirements for optimum performance code has also been investigated. The key result here is that register requirements for many scienti c kernels closely approximate a linearly increasing function of both pipeline depth and achieved performance. This relationship has proven useful for evaluating architectural tradeo s, and will also help in compiling high performance codes.
