We introduce a new experimental C compiler in 
Introduction
The design of computers in general and microprocessors in particular has shown a steady increase in both performance and complexity. Advanced techniques such as pipelining and out-oforder execution have increased the design and verification effort required to create a viable product. To overcome some of these problems, hardware designers have been exploring ways to move functionality into the compiler. From RISC to current designs such as Intel's IA64, the compiler has played a greater role in simplifying the hardware while maintaining the current trend of performance improvement.
The MIRV compiler is designed to analyze trade-offs between compile-time and run-time knowledge of program behavior. MIRV enables research into this area in four ways. First, the compiler is built with a modular filter architecture. This allows the researcher to easily write optimizations and explore their placement in the phase ordering. Second, the retargetable code generator and low-level optimizer support both commercially available microprocessors and the popular SimpleScalar simulation environment. This allows both realistic performance evaluation as well as explorations into next-generation computer instruction set architecture. Third, MIRV provides an interface for program instrumentation and profile back-annotation. This allows studies into runtime behavior as well as profile-guided optimizations. Fourth, the compiler environment that we have developed around MIRV provides easy regression testing, debugging, and extraction of performance characteristics of both the compiler and the compiled code.
In this report we introduce MIRV and compare its performance to the GCC compiler. This report also introduces a package of SPEC binary executables which are compiled with GCC and MIRV at various optimization levels. The purpose of this document is to explain the compilation and simulation environment in which the binaries were produced and to summarize the performance differences between the compiled code. Several notable results are presented.
The organization of the rest of this paper is as follows. Section 2 describes the compilation environment that we used to generate the results shown. Similarly, Section 3 outlines the simula-tion environment. Section 4 introduces the performance graphs shown in the appendices and Section 5 describes some interesting observations made from the performance graphs. We conclude with Section 6. The appendices contain detailed compilation and simulation results as well as provide additional detail on the optimizations that were performed during compilation.
Compilation Environment
We tested seven compiler configurations. The first is labeled 'SSsup' which is the SimpleScalar supplied binary, available at the SimpleScalar web site [2] . The next three configurations were compiled in our test environment with the GCC 2.7.2.3 port to the PISA instruction set. This tool is available from UC-Davis [7] . We also used a pre-release version of binutils 2.9.5 for the assembler and linker. These were slightly modified from sources at Cygnus [8] . The final three configurations were compiled with MIRV and used the same assembler and linker as the GCC builds.
The MIRV compiler implements the most common optimization passes. The exact order of application of the optimzation filters is given in Table 4 in Appendix A. For comparison, Appendix B contains the optimizations applied in the GCC compiler.
MIRV always applies register coalescing and graph coloring register allocation in the backend, regardless of the optimization level. The allocator is implemented with the standard graph coloring algorithm except that it does not implement live range splitting or rematerialization [3] . This means that it is not fair to compare GCC -O0 with mirv -O0 since GCC does not perform register allocation at the -O0 optimization level.
Simulation Environment
The SimpleScalar 3.0 sim-outorder simulator was used with default parameters [4] . Table  1 shows the relevant default parameter values. All simulations were performed in little-endian mode.
We used the SPEC95 integer benchmarks and several of the SPEC00 benchmarks [5, 6] . All benchmarks were run to completion on the data set indicated in the table; we modified the supplied input sets to allow the simulations to complete in a reasonable amount of time (about 100 million instructions). The benchmarks are described in Table 2 and the exact input sets are shown  in Table 3 .
SPEC Performance Graphs
The full set of graphs comparing MIRV to GCC can be found in Appendices C and D. These graphs show various metrics for each of the eight SPEC95 benchmarks and selected SPEC00 benchmarks. Table 6 explains each of the graphs and any special notes on how the data was gathered. For the SPEC95 benchmarks, we include the PISA binary supplied on the SimpleScalar website [2] as a comparison point. These benchmarks were compiled with the arguments "-O2 -funroll-loops". There are no supplied binaries for SPEC00 benchmarks, so no information appears for those in our graphs. The full set of results is attached in Appendix F.
The only anomalous behavior we observed during simulations was in the vortex benchmark, where we discovered that the SimpleScalar supplied binary had been compiled with the flag '-DOPTIMIZE'. The GCC and MIRV binaries that we initially built were not compiled with this flag because we did not know about it. The flag turns on various optimizations in the vortex code itself (it is a preprocessor directive). We added '-DOPTIMIZE' to our simulations and the anomaly was solved.
Performance Observations
Several interesting observations can be made from the data shown in Appendices C and D. These observations could fall into several categories which are examined in the following subsections. It is important to keep in mind the simulator configuration shown in Table 1 .
Comparing MIRV to GCC
GCC has no register allocation in -O0. MIRV has graph coloring allocation and register coalescing (simple copy propagation). Since GCC and MIRV unoptimized code is otherwise very similar, we can use these two bars to show an estimate of the importance of register allocation. For example, MIRV -O0 execution times are often 20% faster than GCC -O0 and sometimes much faster. This benefit is solely due to register allocation. MIRV-O1 and -O2 performs a little worse than GCC. This is borne out in the graphs on cycles and dynamic counts of instructions, memory references and branches. The dynamic instruction mix graphs point out that MIRV is uniformly higher than GCC in all categories of instructions (Appendix E), particularly in memory operations. When MIRV produces better code than GCC, it is often because it has reduced the number of 'other' instructions (this happens in go, ijpeg, vortex, and vortex00).
The graphs show that dynamic instruction count is often a very good indication of the number of cycles the benchmark will take to execute. However, there are several counter-examples. For instance, the mirv-O2 instruction count for perl is 2% worse than for GCC-O2 but the binary executes 9.6% faster. The opposite happens on go. 
Comparing SPEC95 to SPEC00
There are several characteristics that differentiate SPEC95 from SPEC00. IPC ranges from 1 to 2 for SPEC95 and 0.6 to 1.8 for SPEC00. The average number of instructions per branch is 4 to 6 for SPEC95 and 4 to 8 for SPEC00 (ignoring ijpeg and the unoptimized binaries).
SPEC00 instruction cache miss rates are very low except for the vortex benchmark. The instruction cache simulated in this work is 16KB. The floating point benchmarks art and equake have very small source code -each is only one source file and have 1270 and 1513 lines of source code, respectively. The integer benchmark mcf is similarly small at 2412 lines of code. These benchmarks are similar to compress, ijpeg, and li95 in the SPEC95 suite. The other SPEC95 benchmarks have a much higher miss ratio than SPEC00. SPEC00 vortex has slightly higher miss rate than SPEC95 version of vortex.
SPEC00 data cache miss rates are much higher than SPEC95. Whereas SPEC95 miss rates are generally less than 2% (5% for compress), SPEC00 miss rates are usually around 4%. art is a particularly notable example with up to a 40% miss rate. Within a given compiler, optimization generally makes the data miss rate worse. This is to be expected as optimizations cause more efficient use of registers, thus eliminating the "easy" load and store operations and leaving those that are essential to the algorithm. A prime example of this is the art benchmark, where the data cache miss rate increases from 15% to 40% as optimizations are enabled from -O0 to -O2. At the same time, however, the number of data references is cut by a factor of three. The low fruit has been harvested and the "essential" memory accesses remain in the benchmark. The unified L2 cache suffers a higher miss rate in SPEC00 as well.
The SPEC00 binaries presented here are much smaller than the binaries for SPEC95. This is one reason that the instruction cache performs so much better for SPEC00. On the other hand, the instruction window is much busier in the SPEC00 than it is in SPEC95 as shown in the register-update-unit utilization graph. One might expect smaller programs to make less usage of the instruction window, but because of the high data cache miss rates it appears that instructions are held up longer in the window.
To summarize the differences between SPEC00 and SPEC95, we saw that IPC and data cache performance were lower for the newer benchmarks, but that these programs exercised the instruction cache less because of their smaller code size. This points out the importance of selecting the appropriate set of benchmarks for a given architectural study. Instruction cache studies should probably avoid many of the SPEC00 benchmarks because they do not stress the instruction cache. On the other hand, data cache studies would emphasize SPEC00 because it strains the data side of the caching system much more than SPEC95. SPEC00 also seems to require a bigger instruction window to avoid window-full stalls. The two suites together seem to provide a nice complement of characteristics; most studies should use both suites.
Comparing Optimization Characteristics
MIRV and GCC optimizations exhibit similar characteristics across most of the benchmarks but are there exceptions. For example, -O2 optimization usually produces code that runs slightly faster than -O1 code. However, in the case of the vortex benchmark, -O2 code is slightly worse than -O1 code for MIRV. This is due to register promotion which in this case increases the register pressure to the point of introducing additional spilling code.
Branch prediction accuracy is generally much worse for unoptimized binaries. One reason for this is simply the larger number of branches that are executed (20% fewer branches are executed in -O2 than in -O0). For both SPEC95 and SPEC00, prediction accuracies range from roughly 82% to 98% and usually optimizations increase prediction accuracy by 4% or more.
GCC optimizations usually increase the number of instructions retired per cycle (IPC) but for MIRV the opposite is the case.
Both compilers typically demonstrate a reduction in instruction-cache miss rate with optimizations enabled. For vortex, MIRV optimizations also result in an increase in instruction cache miss rate but GCC optimizations actually improve instruction cache performance for this benchmark. For the li benchmark, the reverse occurs.
Obtaining and Installing the Binaries
The version 1 binaries used to produce the data in this report are available on the MIRV website [1], including the binaries supplied on the SimpleScalar website [2]. The README file there explains how to install the binaries.
Conclusion
This report has introduced the MIRV compiler. As its performance improves, we encourage architecture researchers to use these binaries in conjunction with the SimpleScalar simulation environment as examples of highly optimized programs. As they evolve, these will include advanced optimizations that are not available in GCC and so should be more representative of state-of-the-art compilation techniques. cpu2000/, Warrenton, Virginia, 2000.
[ 
Appendix B. GCC Optimizations
The table shows the optimization sequence when '-O3 -funroll-loops' is turned on. The following flags are enabled: '-fdefer-pop -fomit-frame-pointer -fcse-follow-jumps -fcse-skipblocks -fexpensive-optimizations -fthread-jumps -fstrength-reduce -funroll-loops -fpeepholefforce-mem -ffunction-cse functions -finline -fcaller-saves -fpcc-struct-return -frerun-cse-afterloop -fschedule-insns -fschedule-insns2 -fcommon -fgnu-linker -mgas -mgpOPT -mgpopt'. The table is somewhat incomplete because of the lack of documentation on GCC internal operations. art00 equake00 gzip00 mcf00 vortex00 vpr00 art00-SSsup art00-gccO0 art00-gccO1 art00-gccO2 art00-mirvO0 art00-mirvO1 art00-mirvO2 equake00-SSsup equake00-gccO0 equake00-gccO1 equake00-gccO2 equake00-mirvO0 equake00-mirvO1 equake00-mirvO2 gzip00-SSsup gzip00-gccO0 gzip00-gccO1 gzip00-gccO2 gzip00-mirvO0 gzip00-mirvO1 gzip00-mirvO2 mcf00-SSsup mcf00-gccO0 mcf00-gccO1 mcf00-gccO2 mcf00-mirvO0 mcf00-mirvO1 mcf00-mirvO2 vortex00-SSsup vortex00-gccO0 vortex00-gccO1 vortex00-gccO2 vortex00-mirvO0 vortex00-mirvO1 vortex00-mirvO2 vpr00-SSsup vpr00-gccO0 vpr00-gccO1 vpr00-gccO2 vpr00-mirvO0 vpr00-mirvO1 vpr00-mirvO2 Dynamic Instructions (millions) other branches stores loads 
Optimization Applied
c o m p r e s s 9 5 g o ij p e g li 9 5 m 8 8 k s im p e r l v o rc o m p r e s s 9 5 g o ij p e g li 9 5 m 8 8 k s im p e r l v o rg c c 9 5 c o m p r e s s 9 5 g o ij p e g li 9 5 m 8 8 k s im p e r l v o r
Bytes (millions)

Appendix F. Detailed Results
