Do SPECmarks (the figures of merit obtained from running the SPEC benchmarks under certain specified conditions) accurately indicate the performance to be expected from real, live work loads? We measured miss ratios for the entire set of SPEC92 benchmarks for a variety of CPU cache confieurations. We found instruction cache miss ratios in general, and data cache miss ratios for the integer benchmarks, to be quite low. Data cache miss ratios for the floatingpoint benchmarks are more in line with published measurements for real work loads.
Do SPECmarks (the figures of merit obtained from running the SPEC benchmarks under certain specified conditions) accurately indicate the performance to be expected from real, live work loads? We measured miss ratios for the entire set of SPEC92 benchmarks for a variety of CPU cache confieurations. We found instruction cache miss ratios in general, and data cache miss ratios for the integer benchmarks, to be quite low. Data cache miss ratios for the floatingpoint benchmarks are more in line with published measurements for real work loads.
PEC benchmarks' have become such an important measure of CPU performance that some system developers are parameterizing their designs to maximize SPEC benchmark performance, even when this might lead to lower performance on other, perhaps more realistic, work loads. Similarly, compiler writers have been concentrating on producing good code for the frequently executed inner loops of some of the SPEC benchmarks. Several factors, including strong industrial support for the System Performance Evaluation Consortium, the realistic nature of the benchmarks, and acceptable code portability, have led to the wide use of these programs for benchmarking purposes, and their consequent influence on system design.
SPEC benchmarks are a selection of nontrivial programs chosen to standardize benchmarking and assembled to provide a standard set of realistic benchmarks for intersystem comparisons. See Price' and Hinnant3 for a discussion of the many problems with the benchmarking situation prior to SPEC. To improve the verification and reproducibility of results, SPEC benchmark results must include a description of any source code modifications, compiler and operating system release numbers, machine characteristics, and most other factors that can affect the reported results.
The SPEC92 benchmark suite consists of six integer-intensive C programs (compress, eqntott, espresso, gcc, sc, and xlisp), and 14 floating-pointintensive programs (alvinn, doduc, ear, fpppp, hydro2d, mdljdp2, mdljsp2, nasa7, ora, spice, su2cor, swm256, tomcatv, and wave5). The SPEC benchmarking procedure is to run each program to completion on the target system, with only one user process active. Then the metric is the ratio of that runtime to the runtime of the same program on a DEC VAX 11/780, as measured originally at the start of the SPEC effort. The geometric mean of those ratios over the integer and floating-point intensive programs yields the SPECint92 and SPECfp92 respectively, which are the figures of merit. Considerable effort has been expended on creating computer systems (hardware and software) to optimize SPEC benchmark results. The recent very high benchmark results for the matrix300 SPEC89 program demonstrated the success of these efforts, and forced SPEC to exclude matrix300 from the SPEC92 benchmark release. These efforts raise two questions. In what ways should the system be designed to perform well on the SPEC benchmark suite? Is this a good idea?
One important aspect of CPU performance, and probably the most important of the architectural aspects (as opposed to technology parameters, such as circuit speed), is the performance of the memory hierarchy. SPEC benchmark results are quite sensitive to cache size, as may be seen by In terms of the SPEC benchmarks, our two questions change. What miss ratios can be expected when running the SPEC benchmarks on a machine with a cache of a given design? Are these miss ratios comparable to those for "typical" user work loads, or some definition of "typical?"
We present measurements of the cache miss ratios of the entire SPEC92 benchmark suite and comment on their potential use in the design of caches and memory hierarchies. We also compare the SPEC cache miss ratios to design target miss ratio^,^ miss ratios measured using hardware monitors at Amdah16 and on DEC VAX-series machines,',8 miss ratios observed from very long address traces,' and other miss ratios that include operating system and multiprogramming behavior. Note that miss ratios for multiprogrammed work loads with significant operating system activity are known to be high.'"." We find that the miss ratios for the SPEC benchmarks are generally lower than should be expected from multiprogrammed work loads.
SPEC cache performance
We compiled and ran the SPEC programs on DECstations that contained the Mips R2000 and R3000 microprocessors running version 4.1 of the DEC Ultrix operating system. We used 18 IEEE Micro version 2.0 of the C compiler and version 2.1 of the Fortran compiler with the optimization level according to the SPEC Makefiles. We then used the Mips Pixie tool to generate address traces to feed directly to the Tycho" cache simulator.
Pixie modifies the compiled code to generate a trace record for each load, store, and basic block entry; it then constructs trace records for all instruction fetches from the basic block records. Tycho uses algorithms that, for a given block size, simulate all cache sizes and associativities in a single pass through an address trace." Note that since our traces are derived from the Mips architecture; different results will be obtained for other CPUs and other compilers.
We varied cache size from 1 Kbyte to 1 Mbyte, set size from one (directmapped) to eight, and set block size from 16 to 256 bytes. All caches used the Least Recently Used replacement algorithm and the lowest order available address bits to select the set. We simulated instruction, data, and unified caches, without periodic cache flushing, as the SPEC benchmarks are typically run in a uniprogrammed environment. Miss ratios represent the complete execution of a benchmark and include start-up as well as steady-state effects.
The use of Pixie to generate address traces allows simulation of only user, and not system, references, and our data is for user code only. Table 2 shows the user and system times for an execution of each of the benchmarks when run on a DECstdtion 5000/240 Mips processor-based workstation. The system time accounts for 1.5 percent of the total runtime for the benchmark suite, and the linear average of the percentage of system time for each benchmark is 2.5. The fraction of system time is sufficiently low that we believe user-state-only measurements of cache miss ratios are a very accurate approximation of the miss ratios when they include both user and system state memory accesses. Table 2 also lists the number of instruction, data, and total user memory references made by each program. The SPEC92 release specifies that compress is run 20 times with the same input and gcc is run four times with the same input. The number of references we report here corresponds to one of these runs. Note that the trace reflects a 4-byte memory interface; the trace would be different for a different memory interface width. Note also that the trace includes only actual program loads, stores, and instruction fetches. It does not include the extra memory activity such as instruction prefetch that would occur on most machines.j For analysis of some of the benchmark programs and their execution behavior, see Saavedra-Barrera and Smith,l4. and Saavedra-Barrera.'j To increase our confidence in our results, we compared them with two other studies that ran the SPEC benchmarks on a Mips EO00 microprocessor. Pnevmatikatos and Hi1116 presented cache miss ratios for the four integer SPEC89 benchmarks (eqntott, espresso, gcc, and xlisp). They used a different compiler (gcc) and a tracing methodology that excluded library references. Nevertheless, most miss ratio differences are less than 0.01. In a few cases, however, a seemingly small miss ratio difference translates into a substantial relative change.
We are inclined to place the most confidence in the results presented here, since this analysis has used much more mature and sophisticated compilers. But the comparison demonstrates that cache miss ratios, instruction counts, and related measures are, as might be expected, sensitive to the compiler used. We must thus caution readers that your actual mileage may vary. Cmelik et al." give instruction counts for the SPEC89 benchmarks. With one exception, spice, their ~ ~-counts are close to ours. We cannot explain the difference for spice, although simulation runs at both Berkeley and Madison yielded consistent results.
Simulating these caches required 200 to 400 microseconds of CPU time per memory reference in each trace. Assuming an average 300 p.s per memory reference, simulating all 20 SPEC benchmarks requires some 980 days or nearly 40 months of CPU time. Including false starts, simulation errors, and operating system bugs, we used three to four years of machine time to compute our results. This type of measurement would not have been possible if it had been necessary to pay for CPU time on a time-shared machine. (Workstations aren't free, but they are a lot cheaper than the same number of cycles on a time-shared machine.) With seven machines available for running simulations at Berkeley and Madison, we generated these results in less than seven months of calendar time.
Results
In our simulations we varied the block (line) size from 16 to set-associativity from one (direct mapping) to eight for instruction, data, and unified caches. For brevity, we do not include the complete set of results, but readers may access an electronic copy via anonymous FTP: ftp reggiano.cs.wisc.edu (or: ftp 128.105.8.27) reply to login: anonymous reply to passwd: type any non-null string here cd SPEC92 get README get fullmissratios.ascii get ful1missratios.postscript.Z bye We first examine instruction cache miss ratios for the different programs. For alvinn, compress, ear, eqntott, hyrdo2d, mdljdp2, mdljsp2, nasa7, ora, swm256, and tomcatv, instruction cache miss ratios are very low. They are generally less than 0.0001 for caches as small as a few kilobytes. These programs spend much of their execution time in a few small routines; the SPEC89 program matrix300, for example, spends about 99 percent of its execution time in one small basic block in the code.'*,li.18 Miss ratios for sc, espresso, su2cor, xlisp, spice, and wave are only slightly larger, as miss ratios again fall below 0.0001 for cache sizes as small as 16 or 32 Kbytes. Instruction cache miss ratios are largest for doduc, gcc, and fpppp, yet are well below half a percent for caches as small as 64 or 128 Kbytes. None of the SPEC benchmarks makes significant use of more than 128 Kbytes of instruction cache.
Miss ratios for data caches are larger, especially for several of the floating-point Fortran benchmarks, but for the most part are quite low as cache size approaches 1 Mbyte. Miss ratios for ora, fpppp, xlisp, and doduc are the lowest among the SPEC suite, dropping below 1 percent for caches as small as 16 or 32 Kbytes, and falling below 0.0001 for a 64-Kbyte cache. Results for ear, mdljdp2, and espresso are also low, especially when the set size is greater than one, and somewhat larger for direct-mapped caches. Among the integer programs, compress, eqntott, and gcc exercise fairly large data caches; miss ratios remain above 1 percent until cache size reaches 512 Kbytes.
The floating-point programs nasa7, spice, su2cor, swm256, tomcatv, and wave5 exhibit the largest data cache miss ratios. Miss ratios for su2cor, nasa7, spice, and wave5 are several percent until the cache size reaches 1 Mbyte, causing miss rates to fall below 1 percent. Swm256 and tomcatv require extremely large caches when the cache block size is small. Data cache miss ratios are over 12 percent and 6 percent respectively for a 1-Mbyte cache at a 16-byte block size. Each successive doubling of block size at 1 Mbyte reduces data cache miss ratios by almost half, and miss ratios do become less than 1 percent for a 128-byte block for tomcatv and a 256-byte block for swm256.
Unified (data and instruction) cache miss ratios usually fall between instruction and data cache miss ratios, as the strong locality in instruction references offsets the weaker locality in data references. We observed several instances in which unified cache miss rates were higher than corresponding data cache miss rates (doduc, fpppp, ora, xlisp). This behavior occurs mainly at larger cache sizes coupled with low associativities, and when separate instruction and data cache miss ratios have fallen to nearly zero. The low associativity causes instruction and data references to conflict for cache sets, while such conflicts do not occur in separate instruction and data caches. Note that a split, direct-mapped instructioddata cache pair is more like a two-way set-associative unified cache than a direct-mapped unified cache.
It is worth noting that there are a few anomalies in the data for the effect of associativity on miss ratio. Generally, miss ratios decrease with increased degrees of set associativity, since the probability of mapping conflicts decrease^.'^ It is possible, however, that miss ratios can increase with increasing associativity if certain reference patterns are present in the memory reference string. We noted just that effect at one or more data points for the fpppp, spice, tomcatv, and doduc miss ratios.
Evaluation
Let's compare some other studies before discussing whether the SPEC applications make suitable cache benchmarks.
Smith5 includes several measurements taken with a hardware monitor at Amdahl Corporation on various models of the Amdahl 470V machines. These machines ran a standard internal benchmark containing supervisor, commercial, and scientific code. Results showed that supervisor state miss ratios were much higher than problem state miss ratios. Also the miss ratio for each user and supervisor state could be approximated by equations of the form m = a * k ' , where a and b are constants and k is the cache size in kilobytes.
provide cache miss ratios taken via hardware measurement from VAX 11/780 and VAX 8800 computers. The 11/780 has an 8-Kbyte, write-through unified cache with an 8-byte block size and a set size of two. The 8800 has a 64-Kbyte, write-through, direct-mapped unified cache with a 64-byte block size. In both cases, the time-shared work loads were measured at DEC in an engineering environment.
Smith*" introduced the design target miss ratios (DTMRs) to represent typical levels of performance, averaged over a wide class of work loads, and ranging from workstations to time-shared mainframes. (In practice, miss ratios for workstations would probably be lower, and for large time-shared mainframes would probably be higher.) Smith synthesized them from real (hardware monitor) measurements that ex- and to set-associative ~a c h e s .~~J~ Agarwal et al.1° presented miss ratios that include the effects of operating system references and multiprogramming by using microcode to capture address traces from multitasked machines. These effects can more than double miss rates from those measured in a uniprogrammed, user-only environment. They used a varied set of 20 applications programs.
Borg et al.9 generated miss ratios for very long address traces using tools similar to our own; those traces were over 12 billion memory references long. The traces were used to evaluate the performance of a variety of caches. They used three individual traces and another that was a multiprogramming work load consisting of several jobs.
It is important to note that although some of these studies are rather old, we have been unable to find newer or better data. Many other studies used traces, but we believe those other work loads are no more representative. Were any of these real measurements to be repeated today, the programs and memories would be larger, and the miss ratios (for a given size cache) would be higher. All of the data in the literature suggests that operating system activity significantly increases miss ratios.
ing-point, and complete SPEC92 suite across the entire range of simulation parameters. These averages represent the unweighted arithmetic mean of individual program miss ratios. The average miss ratio is calculated using the formula where n is the total number of programs, M, is the number of misses for program i, and R, is the number of references for program i.
The unweighted arithmetic mean of the program miss ratios gives the miss ratio of a work load in which each program runs for the same number of references. Figures 1 and  2 show average miss rates plotted against the design target miss ratios (labeled DTMR) and primary cache miss ratios from Borg, Kessler, and Wall for a multiprogrammed work load (labeled Borg). Unfortunately, miss ratios from the other studies are not available for separate instruction and data caches, but are plotted against SPEC unified cache results in Figure 3 . Previous results based on different block sizes WAX 11/780, VAX 8800, Agarwal et al.") or different associativities (VAX 8800, Borg et al.') have been adjusted for these paraineters using ratios of miss ratios from prior studiesi A look at Figure 1 suggests that instruction cache miss ratios for the SPEC benchmarks are unusually low. They are as low as one fourth of the design target miss ratios and one half of Borg's miss ratios.
In Figure 2 we see that data cache miss ratios for the SPEC integer and floating-point benchmarks bracket the DTMRs for small cache sizes and are close for the larger sizes for which the DTMRs are defined. All of them are above the Borg et al. measurements. Both sets of SPEC benchmarks approach zero miss ratio for moderately large caches. We would not expect the miss ratios in a time-shared system to approach zero until the cache was as large as main memory because of misses due to task switching (cold start). Were the cache the same size as main memory, misses would appear as I/O activity, but would still occur. Figure 3 contains unified cache measurements from the various other studies in addition to SPEC and design target miss ratios. These include Amdahl 470 supervisor and user state miss ratios (plots labeled 47O.sup and 470.user), VAX 1 U780 and VAX 8800 miss ratios (plots labeled VAX.780 and VAX.88001, and miss ratios from Agarwal et al.'' for a multiprogramming level of 3 (plots labeled Agarwal.mul3). (We plot the Amdahl data from the fitted curve in Smith;' the original data points are not available.) Note that the VAX 8800 data was collected from a very heavily used time-shared system. The Amdahl 470 supervisor data was collected from the execution of a standard internal Amdahl commercial work load. For both the VAX 8800 and Amdahl data, the level of supervisor activity was quite high. Following in decreasing order of miss ratio are the DTMRs and Agarwal's multiprogrammed miss ratios. SPEC floating-point, VAX 11/780, and Amdahl 470 user state miss ratios follow, and the SPEC integer miss rates are smallest by a wide margin.
All of the data in the literature suggests that operating system activity significantly increases miss ratios. First, operating system code tends to loop less than user code, and so instruction miss ratios are high. Second, operating system routines are usually called into the cache by an exception, interrupt, or trap; run for a short time; and finally are replaced from the cache before they run again. They effectively always face a cold-start situation. Sanguinetti" observes that for the Amdahl580, routines must execute over 600 times per second to stay cache resident.
Third, operating system activity is associated with timesharing and high levels of multiprogramming; frequent task switching means that programs are constantly experiencing cold start. As illustrated by Figure 3 , miss ratios for the SPEC benchmarks are considerably below those for any work load with significant operating system activity. As noted earlier and as shown in Table 2 , the SPEC benchmarks actually contain very little operating system activity.
Mogul and Borg" report similar differences in cache performance between computation-bound and multiprogrammed environments. The SPEC floating-point benchmark miss ratios are quite close to the DTMRs, the data from Agarwal et al., and the VAX 11/780 measurements. For large cache sizes they are also very close to the Amdahl 470 user program miss ratios. The SPEC integer benchmark miss ratios are the lowest.
WE CARRIED OUT THIS STUDY FOR TWO REASONS: to
show measurements of the cache performance of the SPEC benchmarks and to comment on the usefulness of those benchmarks for cache and memory system design. While the cache performance of the SPEC benchmarks varies from program to program, we found that the floating-point bench- Comparisons with other studies show that the SPEC integer benchmarks have miss ratios much smaller than reported by any set of published measurements of hardware monitor results, those taken using a microcode tracer, or those from studies using very long traces. Miss ratios for the SPEC floating-point benchmarks seem consistent with previous measurements of user program miss ratios but are quite low relative to supervisor code miss ratios.
Note that no one unique work load or standard set of miss ratios exists; every environment will have its own work load and corresponding cache performance. From these measurements and comparisons, however, we conclude that miss ratios for the SPEC benchmarks could be considered representative of only a certain narrow environment-Unix workstations running user state CPU-bound jobs as the single active user process. The integer benchmarks have very low miss ratios and put very little stress on the memory system. The floating-point benchmarks provide reasonable measurements of memory system performance for user code, but they are still much better behaved than commercial and time-shared work loads.
The SPEC92 benchmarks are conspicuously lacking a significant operating system component. This lack affects their utility in two ways: miss ratios are very low, and the performance impacts of operating system functions themselves are not tested.
An important aspect of the validity of a benchmark suite is that the benchmarks affect the memory system in a manner similar to that of the work load represented. Our analysis helps readers determine whether the SPEC92 benchmark suite is suitable as a standard. We recommend that a similar analysis should also be done for any subsequent release of the benchmarks. Ip Jeffrey D. 
Reader Interest Survey
Indicate your interest in this article by circling the appropriate number on the Reader Service Card.
Low 153
Medium 154 High 155
