Modern high-performance computer platforms are capable of achieving incredible levels of code execution speed. One way they increase performance is by taking advantage of parallelism found in algorithms. To this end, many of these systems o er multiprocessor parallelism. Furthermore, many also o er software pipelining to take full advantage of lowlevel, or code-level, parallelism 1]. This is parallelism actually present in the way machine instructions are dispatched. Also of paramount importance is that these machines take full advantage of their complicated memory systems. Most of the standard optimization techniques will really only provide maximum performance if the computer's memory system is being used in an e cient manner. Most shared-memory architectures have some type of memory hierarchy. The main reason memories are implemented in this fashion is to optimize the price-performance ratio, given the widening gap between central processing unit (CPU) speed and main memory performance. CPU speeds are currently doubling about every 2 or 3 years while the speed of main memory has historically doubled only about every decade. These tiered memories, with their nondeterministic behavior, are hard to manage and predict. This makes the job of the compiler's code generator that much more di cult. Memory systems have become so complicated on some architectures that slight memory reference changes on codes may speed up or slow down execution by an order of magnitude.
Since registers provide fast data access, one goal of the compiler back end is to allocate and assign registers in an e ective manner. The register allocator tries to assign a register to each register candidate. Since register access is very fast, the compiler should generate code that reuses these assigned registers as much as possible. To do this, the register allocator and scheduler should work closely together 2]. This is actually a very complicated phaseordering problem. At the least, the scheduler should order code in a way that instructionlevel parallelism can be exploited, and the register allocator should give top priority to assigning a register to frequently used variables. Several source-level optimizations can be performed with the goal of increasing memory locality and instruction-level parallelism and thus assisting the code generator.
The problem, however, is that when these optimizations are pursued too aggressively, they can reach a point of diminishing returns. When the compiler starts to run out of available registers to use, register pressure is said to be high. At the point where registers are no longer available, the register allocator must actually \spill" a register's content to memory to free it for other uses 3]. On tiered memory machines, such an action can be detrimental to varying degrees. If the value is written to cache, the access time is very small, but the cache manager may still have to invalidate the cache line. This operation can cause the cache line to be rewritten to main memory. Actual main memory access can be very expensive. On the Silicon Graphics (SGI) Power Challenge architecture, this delay, though seldom reaching this point, can be as high as 90 cycles. Accordingly, the governing hypothesis for this study is that memory locality and code-level, parallelismenhancing transformations are bene cial only to the point where register pressure becomes very high.
Experimental Methodology
The example codes listed in this paper were all run on the SGI Power Challenge architecture. This is a 64-bit architecture using 75-MHz MIPS R8000 processors. There are 32 64-bit oating-point registers available to the assembler. The architecture is superscalar and can dispatch up to four instructions per cycle. Prefetching is not implemented in the Power Challenge architecture. This machine uses a hierarchical memory structure like the one described previously.
Loop unrolling and loop fusion were the two transformations that were studied in this experiment. These are common transformations, and loop unrolling especially is the most heavily used transform to increase instruction-level parallelism. All programs were written in C and were compiled with the SGI MIPSpro compiler version 6.0.2. Two compile options were used. The rst was -O2, which turns on extensive optimization. These optimizations are conservative in that they almost always provide some speedup and maintain oating-point accuracy. The second was -O3, which is aggressive optimization. The main consequence of -O3 optimization is that it turns on software pipelining. The code scheduler attempts to pipeline innermost loops whenever possible.
A discovery made halfway through these trials led to a small change in the analysis of the results. In version 6.0.1 of the compiler system (running at the University of Delaware), the pipeline scheduler would give up if it could not generate a schedule without register spilling. The newer version of the compiler (running at the Army Research Laboratory), however, will still schedule pipelined loops with spill code introduced. The general hypothesis remains the same. The new twist is that spilling will limit pipelining usefulness.
Results

Loop Unrolling
Loop unrolling replicates the body of a loop some number of times known as the unrolling factor. Loop unrolling has the ability to increase performance in two ways. First, it reduces loop overhead by performing less compare and branch instructions. Second, it increases work performed in the resulting larger loop body by allowing more opportunity for optimization and register usage. Most of the increase in performance speed on the SGI is because multiplication and addition instructions may be overlapped in the multiple instruction cycle. A simple 2-D matrix multiply code fragment was used to test unrolling e ects on the R8000 processor. This code is listed in Appendix A. Four unrolled versions of the matrix multiply were implemented in di erent functions. There is a caveat. The author does not claim this code to be the best version of matrix multiply possible. Simply, the base version is straightforward and provides a good example of unrolling for memory locality. Other C codes with loop reordering and splitting will undoubtedly come closer in reaching neartheoretical peak on the SGI architecture than these versions. The function MM basic is the basic matrix multiply loop. The optimizer unrolled the inner loop four times when this was compiled. The hand-coded unrolling of the other functions was performed on the outer and middle loop nests. The exact unrolling can be seen in the code listing in Appendix A. The optimizer did not unroll the inner loop in these cases.
Various data were collected during program execution. The results are displayed in Table  1 . Column one lists the function name. Columns \-O2" and \-O3" list the run times of the code compiled with the two ags, respectively. The rest of the columns pertain only to the executable compiled with -O3 optimization. \Cycles/Iteration" lists how many computer cycles were required to perform one complete iteration of the inner loop. For instance, As evident from the timing pro les, the function MM basic is perhaps the worst way of performing a matrix multiply. This poor performance results from the ine cient way in which memory is being utilized. The best way to check on memory performance is through pro ling. Two pro ling mechanisms are available on the SGI operating system: prof and pixie. Comparison of their outputs tells on a procedure-by-procedure basis how well the memory system is performing. Prof In contrast, pixie instruments the code with counters at the beginning and end of basic blocks. It counts only the number of cycles the program executes and does not account for cache misses, bank con icts, etc. The abbreviated pixie output is given as follows: Optimizers can a ect the accuracy of pro ling. Therefore, the pro led executables were created with optimizations disabled. In the best case, pixie is reporting that MM basic should complete in about 198 seconds. Prof is showing that it is taking about 280 seconds. This shows that the current structure of the code is not working well with the memory system. The unrolled code fragments dramatically illustrate the advantages of loop unrolling. In these cases, loop unrolling was the means to achieve register (or loop) blocking. By unrolling the various loops, loads and stores for several array elements were highly reduced. The memory system performed much better, as evident from the run time as well as the closely matched times given in the prof and pixie pro les.
Getting the maximum bene ts from a compiler usually requires having a detailed knowledge of the many optional ags to control the ne points in the compiling process. The MIPSpro compiler is no di erent. With standard options, the compiler could not pipeline the loop body for MM unroll 3 because the loop body was too long. The compile option -SWP:body ins=250 was used to increase the maximum size of a loop body that would be considered for software pipelining.
Loop unrolling led to great speed increases. Unrolling with pipelining allowed the basic matrix multiply to execute at 33% e ciency. Without unrolling, e ciency is only around 10% of the maximum throughput. Loop unrolling with the goal of register blocking achieved even greater results. The software pipeliner, which allows di ering loop iterations to overlap, was able to achieve speedup over standard -O2 optimization in almost every case.
Unrolling does reach a point of maximum usefulness in these test cases. With each function, more and more unrolling was done in order to promote register reuse and instructionlevel parallelism. MM unroll 2 has extensive unrolling but does not produce any spill code. Implementation of MM unroll 3 however, produces extensive spilling. A quick check of the statistics reported in Table 1 As expected, register spilling does hurt the speed of execution in this case. A massive amount of spills and restores has been added by the scheduler, and even pipelining cannot hide the resultant delays. In the pipeline message, there is a statement saying 14 possible stall cycles may exist. Most stalls on this processor occur due to the oating-point unit and the integer unit of the CPU becoming unsynchronized. Several factors may lead to this occurrence. Indirect addressing, such as a b i]], will cause the lookup of b i] to complete before the load/store can begin. Multidimensional arrays, prevalent in these examples, lead to similar problems. These integer unit operations paired together with the many oating-point multiplication operations may be the reason the compiler is warning of worst-case synchronization stalls. How much of the degradation in MM unroll 3 is attributable to stall cycles and how much is attributable to spilling is hard to determine.
Loop Fusion
Loop fusion is a process where two or more adjacent loops are merged into a single loop. Loop fusion has the potential to increase performance by reducing loop overhead and increasing instruction-level parallelism. A somewhat contrived example was used to test fusion on the R8000 processor. The loops were deliberately designed to give variables long live ranges and hence to make things as di cult as possible for the scheduler to achieve scheduling, not to mention pipelining, without introduction of some spill code. The code is listed in Appendix B. The loops in the NotFused function are named loop1, loop2, and loop3. Table 2 lists the execution results. If the loops are not pipelined, the fused loop does indeed outperform the three separate loops. The multiple compare and branch instructions executed in the three loops can be extremely costly because they often interfere with maximum instruction issue per cycle. In this case, the reduction in loop overhead increased code speed by 1.4. The fused loops did create a small amount of spill code, but the e ects seem to be negligible compared to cycles lost on compare and branch instructions.
For pipelining, however, the spill code seemed to cause a greater problem. The pipeliner reported numerous potential problems: Stalls are once again present, and this time there is a warning about the number of cycles required to deal with resources and recurrences with memory references. All of these problems seem to have a very bad cumulative e ect on the nal performance. This code was not written with any regard to memory locality. The three loops taken separately and pipelined performed fairly well, but could still not perform as well as the fused, nonpipelined loop. The spill code in the pipelined fused loop has put extreme burdens on the memory system and has caused a severe loss of performance.
Conclusion
To be of maximum usefulness, the scheduler of a compiler must be able to fully take into account the extremely complicated memory systems in most of today's shared-memory, high-performance computers. As has been shown from these examples, transformations to increase memory locality as well as reduce loop overhead and promote instruction-level parallelism can be extremely advantageous. They are, however, extremely interrelated, and promoting one often takes place with the detriment of the other. Is there a best choice for ordering these transformations or some way of knowing how much of one to perform? Building such knowledge into the compiler will be very di cult. Optimal scheduling is itself an NP-complete problem, and predicting memory system behavior is di cult. Building extensive information about the memory system into the compiler will undoubtedly greatly increase compile time and with the nondeterministic behavior of the memory system still have the potential to not be totally accurate. Those codes worthy of extensive analysis and optimization will probably be best served by having compilers that generate detailed messages about the actions they took that will allow the programmer to make more informed choices about optimizing source-level code structure. It seems that only through pro ling and modifying code by hand can maximum performance be achieved on a per-architecture basis. Some general conclusions are noteworthy, however:
Loop unrolling is very e cient at promoting instruction-level parallelism. Loop fusion is very e cient at removing costly compare and branch instructions and may be more e cient than pipelining in some cases. Large loop bodies with somewhat random or erratic memory access patterns will seldom bene t from pipelining. These loops will either be better o not pipelined or distributed and then pipelined if possible. Codes written that take into account the memory system should in most cases bene t from pipelining.
Loop unrolling to promote register reuse is only e cient to just prior to the point where spill code must be introduced.
A 
