Single-Pass Memory System Evaluation for Multiprogramming Workloads by Conte, Thomas M. & Hwu, Wen-mei W.
May 1990 UILU-EN G-90-2214 
CSG-122
COORDINATED SCIENCE LABORATORY
College o f  Engineering
SINGLE-PASS MEMORY SYSTEM EVALUATION FOR MULTIPROGRAMMING WORKLOADS
Thomas M. Conte Wen-mei W. Hwu
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Approved for Public Release. Distribution Unlimited.
SECURITY CLASSIFICATION ÓF THIS PAG¿
REPORT DOCUMENTATION PAGE
ia. r e p o r t  sec u r it y  cla ss if ica t io n  
Unclassified ________
1b. RESTRICTIVE MARKINGS 
None
2a. SECURITY CLASSIFICATION AUTHORITY
none
2b. OECLASSIFICATION / DOWNGRADING SCHEDULE
_____ none
4. PERFORMING ORGANIZATION REPORT NUMBER(S)
3. DISTRIBUTION /AVAILABILITY OF REPORT 
Approved for public release; 
distribution unlimited
UILU-ENG-90-2214 CSG-122
S. MONITORING ORGANIZATION REPORT NUMBER(S)
none
6a. NAME OF PERFORMING ORGANIZATION 
Coordinated Science Lab 
University of Illinois
6b. OFFICE SYMBOL 
(If applicable)
N/A
7a. NAME OF MONITORING ORGANIZATION
NSF, NCR, NASA, ONR
6c ADDRESS (Gty, Stata, and ZIP Coda)
1101 W. Springfield Avenue 
Urbana, IL 61801
7b. AOORESS (City, Stata, and ZIP Coda)NSF:1800 G Street, Washington, DC 20552
NCR:Personal Computer Div.-Clemson
1150 Anderson Dr., Liberty, SC 29657
8a. NAME OF FUNDING/SPONSORING 
ORGANIZATION
same as 7a.
8b. OFFICE SYMBOL 
Of applicatila)
N/A
» m w f f w  number
NASA: NASA NAG 1-613 ONR:N00014-88-K-0656
8c ADDRESS (City, Stata, and ZIP Coda)
same as 7b
10. SOURCE OF FUNDING NUMBERS
PROGRAM PROJECT TASK
ELEMENT NO. NO. NO.
WORK UNIT 
ACCESSION NO.
11. TITLE (Includa Security Gasification)
Single-Pass Memory System Evaluation For Multiprogramming Workloads
12. PERSONAL AUTHOR(S)Conte, Thomas M. Hwu, Wen-mei W.
13a. TYPE OF REPORT 13b. TIME COVERED 14. DATE OF RFPORT (fear, Month, Day) 15. PAGE COUNT
Technical FROM TO 1990 May 22
16. SUPPLEMENTARY NOTATIONnone
17. COSATI COOES
FIELD GROUP SUB-GROUP
18. SUBJECT TERMS (Continue on reverse if necessary and identify by block number)memory system, cache performance, stack-based method
multiprogramming
19. ABSTRACT (Continue on reverse if necessary and identify by block number) _  ^ ,Modern memory systems are composed of levels of cache memories, a virtual memory system, and
a backing store. Varying more than a few design parameters and measuring the performance of 
such systems has traditionally be constrained by the high cost of simulation. Models of cache 
performance recently introduced reduce the cost simulation but at the expense of accuracy of 
performance prediction. Stack-based methods predict performance accurately using one pass 
over the trace for all cache sizes, but these techniques have been limited to fully- 
associative organizations. This paper presents a stack-based method of evaluating the per­
formance of cache memories using a recurrence/conflict model for the miss ratio. Unlike 
previous work, the performance of realistic cache designs, such as direct-mapped caches, are 
predicted by the method. The method also includes a new approach to the problem of the 
effects of multiprogramming. This new technique separates the characteristics of the indi­
vidual program from that of the workload. The recurrence/conflict method is shown to be 
practical, general, and powerful by comparing its performance to that of a popular traditional 
cache simulator. The authors expect that the availability of such a tool will have a large
20. DISTRIBUTION/AVAILABILITY OF ABSTRACT
S  UNCLASSIF1ED/UNUMITED □  SAME AS RPT. □  OTIC USERS
21. ABSTRACT SECURITY CLASSIFICATION 
Unclassified
22a. NAME OF RESPONSIBLE INDIVIDUAL 22b. TELEPHONE Ondude Area Code) 22c OFFICE SYMBOL
OD FORM 1473.84 MAR 83 APR edition may be used until exhausted. 
All other editions are obsolete.
SECURITY CLASSIFICATION OF THIS PAGE
UNCLASSIFIED
UNCLASSIFIED
•CCUftITY CIAMPICAT1<
7b. NASA Langley Research Center, Hampton, VA 23665
Office of Naval Research, 800 N. Quincy, Arlington, VA 22217
19. impact on future architectural studies of memory systems.
UNCLASSIFIED
SECURITY CLASSIFICATION OF THIS PAGE
Single-Pass Memory System Evaluation For Multiprogramming Workloads
Thomas M. Conte Wen-mei W. Hwu
Center for Reliable and High-Performance Computing
University of Illinois
hwuOcsg.nine.edn
Single-Pass Memory System Evaluation For Multiprogramming Workloads
Abstract
Modern memory systems are composed of levels of cache memories, a virtual mem­
ory system, and a backing store. Varying more than a few design parameters and 
measuring the performance of such systems has traditionally be constrained by the 
high cost of simulation. Models of cache performance recently introduced reduce the 
cost simulation but at the expense of accuracy of performance prediction. Stack-based 
methods predict performance accurately using one pass over the trace for all cache 
sizes, but these techniques have been limited to fully-associative organizations. This 
paper presents a stack-based method of evaluating the performance of cache memories 
using a recurrence/conflict model for the miss ratio. Unlike previous work, the perfor­
mance of realistic cache designs, such as direct-mapped caches, are predicted by the 
method. The method also includes a new approach to the problem of the effects of 
multiprogramming. This new technique separates the characteristics of the individual 
program from that of the workload. The recurrence/conflict method is shown to be 
practical, general, and powerful by comparing its performance to that of a popular 
traditional cache simulator. The authors expect that the availability of such a tool will 
have a large impact on future architectural studies of memory systems.
1 Introduction
Because of the role they play in the design of cost-effective memory systems, cache memories 
have occupied a special place in research into computer architecture. In 1986, Smith compiled 
a bibliography of 380 papers on the topic covering fifteen years of research [1]. A majority of 
these papers have focused on the performance evaluation for the design of cache memories. 
Some papers have evaluated cache performance as compared to other alternatives [2, 3]. In 
either case, cache design and evaluation is largely an empirical procedure. A benchmark 
set is selected and it is used to evaluate the cache performance for a design space. The 
performance evaluation technique of preference has been simulation. However, simulation is 
costly, limiting the design space and the number of benchmarks the designer can consider. 
Two directions have been undertaken in the literature into alternate cache performance
1
evaluation methods. An analytical approach was introduced by Denning in [4]. A recent 
example of an analytical approach is presented by Agarwal, et. al. [5] and has been used for 
design space exploration by Przybylski et. al. [6].
An alternative to an analytical model is a hybrid approach described collectively one- 
pass or stack algorithms. These were introduced by Mattson et. al. in [7]. They function 
by exploiting properties of stacking replacement policies to evaluate all fully-associative 
cache sizes in one pass over the trace. A recent example of the evolution of these ideas is 
in Thompson and Smith [8], where one-pass algorithms are presented for fully-associative 
buffers for realistic policy decisions such as write back and sector mapping. Traiger and 
Slutz [9] present a method that addresses various levels of set associativities and block 
sizes in one pass, but the amount of collected information required to reconstruct the cache 
performance is large. Due to this large storage requirement, their technique is impractical.
Even when using the trusted simulation techniques for evaluation of cache memories, the 
issue of approximating operating system effects is troublesome. Multiprogramming has the 
effect of partially- or completely flushing a buffer at arbitrary instances during execution. 
One approach to this problem used in [10] was to systematically flush the cache at fixed 
intervals. Using this technique, the designer can randomly insert context switches into a 
simulation, but to get stable results requires increasing significantly the number of simu­
lations. Also, it has been shown that assuming fixed context switching intervals is overly 
optimistic [11, 12]. Another approach is to estimate the cache performance using cold-start 
miss ratios, but an assumption of no-saved context after a context switch has been shown to 
not be true for large caches [12,10]. Combining several reference streams into a stochastically 
merged stream solves this problem, but at the cost introducing workload choice (e.g., sets
2
of benchmarks) into the evaluation problem [13]. Lastly, none of these approaches consider 
separately the voluntary context switching that occurs when a program makes a request of 
the operating system.
This paper presents the recurrence/conflict method of evaluating the performance for 
fully-associative, set-associative and direct mapped cache organizations for all cache and 
block sizes exactly using one pass over the trace. This method is based on the work of 
Mattson et. al., Traiger and Slutz, and Thompson and Smith, but alters the statistics col­
lected to accommodate a new model for miss ratio calculation [7, 8, 9]. This model for the 
miss ratio reduces substantially the traditional storage requirements of the collected infor­
mation, making the method practical. A portion of the information collected can be used to 
reconstruct multiprogramming effects due to both voluntary and involuntary (preemptive) 
context switching. Preemption frequency and partial flushes of the buffer are parameter­
ized to separate workload considerations from benchmark considerations. To evaluate the 
method’s practicality, the run time of its implementation is compared against a popular 
traditional cache simulator. Results from a set of benchmarks are presented to demonstrate 
the method’s operation.
2 A Method for Memory System Evaluation
A cache memory is a familiar concept. The dimension of a cache can be expressed as a three­
tuple, (C, B, S'), for a cache of size 2C bytes, with block size 2s  blocks, and 2s blocks for each 
associativity set Note that C > B +  S. For example, a cache of dimension (10,6,1) is a 1KB 
direct-mapped cache with a block size of 64 bytes. A cache of dimension (21,10,11) is of 
size 2MB with lKB-length blocks and it is fully-associative (such a cache models a modern
3
virtual memory system). The notation (C ,B , oo) is an abbreviation for fully-associative 
caches (5  =  C — B).
Common metrics of cache performance are miss ratios and traffic ratios. One method of 
calculating the miss ratio, />, is to count the number of instances that a miss occurred in N  
references. This number is the miss count, M , and the miss ratio is then,
P =
M
N  ’
a)
The traffic ratio, cr, a measure of the traffic on the memory bus generated by the cache, can 
be expressed as a =  2B p.
2.1 The recurrence/conflict model
Because the traffic ratio is derived from the miss ratio and the block size, the miss ratio 
suffices to characterize the performance of a cache memory. One method of calculating p is 
to use Equation 1. Another method of calculation is based on the observation that all hits 
occur due to recurring references. For example, consider the following string of references:
Reference number 1 2 3 4 5 6 7 8
Address 100 101 102 103 102 101 104 101
References to addresses 100,103, and 104 occur only once and result in a miss regardless 
of the cache organization. References to 101 and 102 occur more than once and hence 
have potential for a hit, dependent upon the cache organization. Such references are called 
recurring references, and there are three of them in the example (references 5, 6, and 8). The 
references between recurring references are termed the intervening references. For example,
4
references 3, 4 and 5 are the intervening references between the first recurring reference to 
address 101. There is a chance that for a given cache organization the intervening references 
will remove the recurring reference from the cache, resulting in a miss instead of a hit. Such a 
situation is termed a conflict. If there are R instances of recurring references and K  instances 
of conflicts, the miss ratio can be expressed as,
P =  1 -
R - K
N  ' (2)
This expression is termed the recurrence/conflict model for the miss ratio.
The method of calculating the miss ratio for a large class of cache organizations accurately 
is based on the recurrence/conflict model. The method involves the calculation of two arrays, 
r[B] and /c[C, 5 , 5], using one pass over the reference string. An algorithm to perform these 
computations is presented in Figures 1 and 2. The procedure, recurrence-conflict (a ) , is 
applied in-turn to each referenced address. The array stack[5] is an array of stacks managed 
by the routines push(-), topofstack(-), depth(-), and repush(-). The items kept on a stack 
are addresses. Note the two functions, zero_out_lsb(-) and count_trailing_zeros(-) The 
function zero.out_lsb(a, B) returns a with B least significant bits set to zero (i.e., the 
block address of address a). The function count_trailing_zeros(-) returns the number of 
trailing zeros in the binary representation of a number. The algorithm explores a predefined 
maximum design space, delimited by the parameters Cmax, 5 max, and 5max. Also, a special 
column is maintained in /c[C][5][5] for conflicts in fully-associative caches, k[C, 5 , oo]. Note 
that any conflict that occurs in a cache of size (C, 5 ,5 ) , also occurs in a cache of size 
(C — 1 ,5 ,5 ) . The while loop of procedure process-cycle (Figure 2) implements this 
observation. The miss ratio for a cache of dimension (c, 6, s) can be calculated from r[B] and
5
recurrence_conflict(a , voluntary_cs) : 
begin
N+-N +  1
for 0 to i?max do 
begin
block.addr«—zero_out_lsb(a, B) 
if on_stack(stack[B], block_addr) then 
d*— depth(stack[B], blockjaddr) 
process_cycle(jB, d, block.addr) 
repush(stack[B], block_addr)
else
push(stack[B], block.addr)
end
cs.count (block^addr) «— 1 
unmark_voluntary.es (blockjaddr)
if voluntary.es then mark_voluntary.es (block_addr)
end
end
Figure 1: Driver routine for the recurrence/conflict algorithm.
/c[C][i?][5] using Equation 3.
i C*max
p(c, b,s) =  1 -  — (r[6] -  /c[i, 6, s])- (3)
j=C
Since the recurrence/conflict algorithm is based on the LRU stack algorithm presented by 
Mattson, et. al. in [7], it is of complexity 0(N \gN ) on average. The space storage required 
for the resulting r[B] and /c[C, B ,5 ] is £?max +  (C'max +  1) x (flmax +  1) x (5max + 2) +  1 
words (typically, a C language long). Also, two arrays, Mi[C][B][S] and Mv[C][B][S] are 
calculated to account for multiprogramming (see Section 2.2 below). Hence, for a design 
space of size Cm&x =  31 (2GB), Bmax =  12 (4KB), and Smax =  3 (up to 8 ways, and fully 
associativity), the space storage requirements are approximately 6K words. This allows the 
statistics to be readily stored as a disk file. Since the arrays tend to be sparse, the disk file 
is smaller than this upper bound, in practice.
6
process_cycle(F?, d, block_addr) : 
begin
r[B] <— r[B] +  1
if top_of_stack(stack[i?]) = block_addr then return
Let a € stack[i?] and depth(stack[jB], a) =  d — 1
cs.count (a) <—  cs.count (a)+cs_count (block_addr)
if marked_voluntary_cs(block_addr) then mark-voluntary_cs (a)
num_unique <— 0
cs.points *— 0
vo luntary.es <— false
for a € stack[£] and depth(stack[i?], a) <  d do 
begin
dist<— count_trailing_zeros(|a — block_addr|) 
p[dist]<—p[dist] +  1 
max.dist <— max(max_dist, dist) 
num.unique *— num.unique +  1
if marked_voluntary.es (block_addr) then voluntary.es <— true 
cs.points <— cs.points + cs.count (a)
end
FA.cachesize <— |_lg num.uniqueJ
if FA.cachesize > then FA.cachesize <— Cmax
«[FA.cachesize, B, oo] <— «[FA.cachesize, B, oo] +  1 
M/[FA_cachesize][B][S']<—M/[FA.cachesize][5][5] +  cs.points 
dist<— max .dist
i*— 1
sum«— 0
for 5<—0 to 5max do
begin
while dist > 0 and sum <  i do 
begin
sum<—sum +  p[dist] 
dist«—dist — 1
end
Cmin,conf ®
if sum > i then
^  min, c o n + ^ + 1
K lC m in ,c o n f W S ^ KiC m in,conf]iB \iS ] +  1
end
M!lCmin)Conf}[B ][S }^ M I[Cmin!Conf}[B][S] + cs.points 
if voluntary.es then M v[Cmin conf][B][S}*-M v[Cmin conf][B][S] +  1 
i*—2 x i
end
end
Figure 2: The process-cycle procedure to calculate r[B] and «[C, B , S].
7
2.2 Multiprogramming effects
Estimating the effects of multiprogramming on cache performance is a well-known prob- 
lem [10]. Several techniques have been employed to approximate these effects. The cold 
miss ratio vs. warm miss ratio technique was examined by Easton in [12, 14]. Examples 
of statistical approaches can be found in [11, 13]. This paper presents a method based on 
the recurrence/conflict model. Multiprogramming is divided into two categories: voluntary 
context switching and involuntary context switching. These categories are explained below.
Voluntary context switching
A process performs a voluntary context switch when the continuation of its execution depends 
on a system service which may take a long time to finish. The frequency and timing of a 
voluntary context switch is solely a characteristic of the benchmark. The number of processes 
executed before a process returns from a context switch is, however, a function of the system 
load and the operating system scheduling policy. For example, the working set of a process 
may have been purged from the cache before it re-enters the run state after a context switch. 
This results in a degraded cache performance as compared to an ideal execution of the same 
benchmark without any context switching.
There are two pieces of information that are associated with context switching. One is 
the number of potential victims, defined as the number of non-conflicting recurring references 
which may be converted from a hit to a miss. This information is a function of the benchmark 
and the cache dimension. The method presented in this paper provides this information 
exactly. However, the fraction of the potential victims which are actually converted to misses 
is a function of the system’s load and the operating system’s scheduling policy. Hence, this
8
fraction is modeled as a parameter, £. The designer can vary the parameter value between 
0% and 100% to examine the changes in design decisions based on information collected in 
only one pass. This feature distinguishes this method from most of the previous ones where 
varying this parameter requires a re-simulation.
The total number of potential victims of all voluntary context switches is measured using 
the recurrence/conflict method. Each voluntary context switch point is marked in the trace, 
and the potential victims of this context switch point are identified as those non-conflicting 
recurring references which occur across the context switch point. This is implemented by 
marking references that occur immediately before voluntary context switch points.
The array, Mv[C][B][S], is used to record the number of potential victims of voluntary 
context switching. The method of updating My[-] is included in Figure 2. If My[c][6][s] is 
equal to n at the end of the execution, it indicates that for all caches (c',6, s), c' >  c, n of 
all the hits can be potentially converted to misses due to voluntary context switches. Given 
a percentage of preserved context across context switches, £, one can expect to find of 
the hits to be converted to misses. The miss ratio for a cache of dimension (c, 6, s) in the 
presence of voluntary context switching becomes,
p{c,b ,s) 1 -  —
/  r'-''max c
R [b }-  Ey j=C j=0 (4)
Involuntary context switch
Involuntary context switching occurs due to external events such as timer-implemented pre­
emption and I/O  device interrupts. The frequency and occurrance of involuntary context 
switching is a function of the system load and the operating system’s scheduling policy, but 
not a characteristic of the program. Therefore, it is assumed that an involuntary context
9
switch has an equal probability of occurring after any reference. With this assumption, the 
recurrence/conflict method derives the average number of potential victims, V/, due to each 
involuntary context switch. A parameter, Q , is defined as the effective quantum (average 
preemption interval). Hence, N/Q is the total number of involuntary context switches ex­
pected for the entire reference string. Therefore, the total number of hits that are converted 
to misses is {£NVi)/Q. Like £, one can vary Q over an arbitrary range to observe the impact 
of involuntary context switching frequency on the design decisions.
To derive the average number of potential victims due to each involuntary context switch, 
one can sum the number of potential victims for all possible switching points in the reference 
string and divide this sum by the number of possible switch points (N ). This is given in the 
following formula,
Vl =  Jj(^ (a ll switching points)Number of Potential victims for a switch Point)  ' (5)
By exchanging the roles of the context switches and the potential victims, Equation 5 can 
be rewritten in the following form,
V' =  jf(^ (a ll potential victims)Number of switchinS Points affecting potential victim) .
(6)
Equation 6 fits naturally into the recurrence/conflict method.
Due to the large number of context switching points involved, a counter, cs_count(-), is 
kept for each element on the stack. Each time a new stack element is created, this counter 
is set to 1. When a recurring reference is processed, the context switching count of its 
stack element is accumulated into that of the element above it before it is promoted to the 
top of the stack. In this way, all the references originally below the element will see the
10
same number of context switching points above them. The context switching count of the 
promoted element then is reset to one. (See Figures 1 and 2.)
The array, Mi[C][B][S], is used to record the total number of potential victims of all 
involuntary context switching points. If A/j[c][6][s] is equal to n at the end of the execution, 
it indicates that for all caches (c',6, s), d >  c, n of all the hits will potentially be converted 
to misses due involuntary context switching. The average number of potential victims per 
involuntary context switch for a cache of configuration (c, 6,5) is,
^  = (V
iV 3=0
The miss ratio for a cache of configuration (c, 6, s) under multiprogramming is expressed in 
Equation 8.
1 /  C m a x  C t A T y  \
p(c,b ,s) =  l - ± \ R { b } -  £  • (8)
2.3 Trace collection
A method of collecting the trace of a benchmark program is to annotate executable with 
special probe instructions. As these probes are executed, local-scope dynamic behavior is 
recorded. Such a method is termed, keyhole experimentation, to emphasize that it is a dy­
namic dual of retargetable “peephole” techniques used for local optimization [15]. Keyhole 
experimentation has been used to generate profiles of the programs’ behavior, although the 
potential for more than just profile information gathering exists. The keyhole probes can 
be placed by the compiler (e.g., GPROF [16]), the assembler (e.g., TRAPEDS [17]), or a 
separate object-code modifier (e.g., PIXIE [18]). Using keyhole experimentation at the com­
piler level is of greatest use to architects, since the compiler possesses information about the
11
program’s data and instruction structure before optimization. The Architects Workbench 
(CARA), created by Flynn at Stanford [19], is one such tool. The System Parameter Inde­
pendent Keyhole Experimenter (SPIKE) is a compiler-independent tool similar to CARA, 
constructed by the authors. The current version of SPIKE has been fitted into the GNU CC 
compiler [20], since GNU CC is capable of producing code for a variety of architectures.
3 Experimental results
The success of a cache performance evaluation method depends on its practicality. To
Table 1: The benchmark set.
Benchmark No. references Description
grep 4.1M The grep program from Unix, used 
for a search through /u sr /d ic t /w o rd s
tex 2.7M The TfeX typesetter, using the ‘TripTeX’ 
diagnostic input
yacc 722K The LALR(l) parser-generator from Unix, 
with the grammar from make) as input
investigate the practicality of the recurrence/conflict method, a set of benchmark programs 
was compiled for the MC68020 and their instruction reference behavior was instrumented 
using SPIKE. The benchmarks are summarized in Table 1.
The run time for the recurrence/conflict method was compared against Dinero III, a 
reliable public-domain cache simulator constructed by Mark Hill. These results are presented 
in Table 2. The minutes of (user-mode) run time were collected for each benchmark using 
an unloaded Sun 3/280. Note that although tex had approximately 1.2M less references 
than grep, it took longer to run. This is due to the nature of stack algorithms: the less 
locality present in a program, the larger the average stack depth. The worst slowdown was
12
Table 2: Running time versus Dinero III.
Benchmark
Time Average
ratio
RCM/Dinero
Recurrence/ 
Conflict model
Dinero III
(21,4,oo) (21,4,0)
grep 1:38 0:21 0:20 4.8
tex 4:35 0:17 0:16 16
yacc 0:53 0:03 0:03 18
by a factor of 18. However, given that the design space explored contained approximately 
31 x 10 x 5 =  1500 cache dimensions for each level in the memory system’s hierarchy, the 
recurrence/conflict model has a great advantage over conventional simulation.
Since tex had the most interesting locality, results of the miss ratio for tex are presented 
in Figure 3. Set associativity is represented as a solid line for S =  0, a dotted line for 5  =  2, 
and a dashed line for S =  oo. (Because of its high performance, S =  oo is only visible in the 
graph of B =  3.) After the execution of the recurrence/conflict method, the time required 
to generate the entire set of miss ratios for Figure 3 was under a second of user time. As 
many points as feasible were checked using Dinero, and all agreed with 100% accuracy.
To see the effects of multiprogramming, tex was evaluated assuming B — 3, for £ = 
100%, 90%, 80% and Q =  100,1000. The results are presented in Figure 4. Unfortunately, 
voluntary context switching information is not yet available in SPIKE at the time of this 
writing. Therefore, only the results of involuntary context switching were evaluated. The 
results are presented as the difference between miss ratios of uniprogramming and of mul­
tiprogramming (A p). Note that the preemption interval dominated for Q =  1000, whereas 
the percentage of flushed context (£) had a large effect for Q =  100. This implies that, for 
tex, beyond a certain Q saved context has little bearing on instruction cache performance.
13
5  =  3 5  =  4 5  =  5
C
5  =  6
C
C
5  =  7
C
C
5  =  8
C
Figui'e 3: Miss ratios for tex for various cache dimensions.
14
f  =  100%, Q =  100 i  =  90%, Q = i  =  80%, Q = 100
f  =  100%, Q =  1000 i  =  90%, Q =  1000 (  =  80%, Q =  1000
0.1 - 0.1 - 0.1 -
Ap
0.05 -
Ap
0 .0 5 -
Ap
0.05 -
0 - ------- 1------- r 0 —  i— i------- r 0 - "l 1 1
0 10 20 30 0 10 20 30 0 10 20 30
C C C
Figure 4: Multiprogramming miss ratios for tex.
15
4 Conclusions
This paper has presented a method to evaluate efficiently a very large design space for 
cache memories. When used to evaluate cache hierarchies satisfying the inclusion property 
(see [21]), an entire memory system can be evaluated in one pass. Although the algo­
rithm presented omitted the issues of write-back and sector-mapping for brevity, the stack 
algorithm extensions of Thompson and Smith are compatible with the recurrence/conflict 
method [8]. Hence, the method is general. The method was shown to be efficient and hence 
profitable to use.
The recurrence/conflict method is applicable to both design and architectural research. 
Combined with design criteria such as described in [6], there is the potential of an automated 
memory system design process. Since it evaluates a large memory system design space in 
one pass, techniques for architectural studies into other interacting system tradeoffs can 
be simplified and broadened in scope. Hence, there are a large number of future research 
directions possible using the recurrence/conflict method.
The inclusion of context switching effects into the method is an advance of previous 
work as it cleanly seperates the behavior of the benchmarks from the multiprogrammed 
performance characteristics they exhibit. Previous approaches involved measuring snapshots 
of actual multiprogramming and using these traces for cache simulation. Such approaches 
are restricted to phenomenological conclusions since the mix of executing processes and 
the interprocess timings are not adjustable after measurement. This illustrates a powerful 
feature of the recurrence/conflict model of the miss ratio: external effects that degrade cache 
performance, such as context switching or coherence protocol invalidations, can be modeled
16
as additional types of conficts, thereby isolating the performance of different design tradeoffs.
[For interested parties, a stable version of the tool written in portable C is freely available 
from the authors.]
17
Addendum
This report was presented for review to the International Symposium on Computer Archi­
tecture Program Committee in November of 1989. In December of 1989, Mark Hill and 
Alan Smith published an article in IEEE Transactions on Computers, entitled, “Evaluating 
associativity in CPU caches” (see [22]). Although Hill and Smith did not make the dis­
tinction between recurrences and conflicts, the presented algorithm is similar to the RCM 
method. Since it is common in Science for two distinct research groups to discover an idea, 
and common also for each group to have different insight, this report is being made available 
to present our insights into stack-based memory hierarchy analysis. The material in this 
report discussing evaluation of multiprogramming effects (context switching) is our own and 
not present in the Hill and Smith paper.
-  T. M. Conte and W. W. Hwu, March, 1990
18
Acknowledgements
The authors would like to thank Sadun Anik, David Griffith and all members of the IM­
PACT research group for their support, comments and suggestions. This research has been 
supported by the National Science Foundation (NSF) under Grant MIP-8809478, a donation 
from NCR, the National Aeronautics and Space Administration (NASA) under Contract 
NASA NAG 1-613 in cooperation with the Illinois Computer laboratory for Aerospace Sys­
tems and Software (ICLASS), and the Office of Naval Research under Contract N00014-88- 
K-0656.
19
References
[1] A. J. Smith, “Bibliography of readings on CPU cache memories and related topics,” 
Comput. Architecture News, vol. 14, pp. 22-42, Jan. 1986.
[2] J. R. Goodman and W.-C. Hsu, “On the use of registers vs. cache to minimize memory 
traffic,” in Proc 13th Annu. Int’l Symp. on Comput. Arch., pp. 375-383, Jan. 1986.
[3] R. J. Eickenmeyer and J. H. Patel, “Performance evaluation of on-chip register and 
cache organizations,” in Proc. 15th Annu. Int’l Symp. on Comput. Arch., (Honolulu, 
Hawaii), pp. 64-72, May 1988.
[4] P. J. Denning and S. C. Schwartz, “Properties of the working-set model,” Communica­
tions ACM, vol. 15, pp. 191-198, Mar. 1972.
[5] A. Agarwal, M. Horowitz, and J. Hennessy, “An analytical cache model,” ACM Trans. 
Computer Systems, vol. 7, pp. 184-215, May 1989.
[6] S. Przybylski, M. Horowitz, and J. Hennessy, “Characteristics of performance-optimal 
multi-level cache hierarchies,” in Proc. 16th Annu. Int’l Symp. on Comput. Arch., 
(Jerusalem, Israel), pp. 114-121, June 1989.
[7] R. L. Mattson, J. Gercsei, D. R. Slutz, and I. L. Traiger, “Evalutation techniques for 
storage hierarchies,” IBM Systems J., vol. 9, no. 2, pp. 78-117, 1970.
[8] J. G. Thompson and A. J. Smith, “Efficient (stack) algorithms for analysis of write-back 
and sector memories,” ACM Trans. Computer Systems, vol. 7, pp. 78-117, Feb. 1989.
[9] I. L. Traiger and D. R. Slutz, “One-pass techniques for the evaluation of memory hier­
archies,” IBM Research Report RJ 892, IBM, San Jose, CA, July 1971.
[10] A. J. Smith, “Cache memories,” ACM Computing Surveys, vol. 14, no. 3, pp. 473-530, 
1982.
[11] I. J. Haikala, “Cache hit ratios with geometric task switch intervals,” in Proc. 11th 
Annu. Int’l Symp. on Comput. Arch., (Ann Arbor, MI), pp. 364-371, June 1984.
[12] M. C. Easton, “Computation of cold-start miss ratios,” IEEE Trans. Computers, vol. C- 
27, pp. 404-408, May 1978.
[13] G. S. Shedler and D. R. Slutz, “Derivation of miss ratios for merged access streams,” 
IBM J. Research and Development, vol. 20, pp. 505-517, Sept. 1976.
[14] M. C. Easton and R. Fagin, “Cold-start vs. warm-start miss ratios,” Communications 
ACM, vol. 21, pp. 866-872, Oct. 1978.
[15] J. A. Davidson and C. W. Fraser, “The design and application of a retargetable peephole 
optimizer,” ACM Trans. Prog. Lang, and Systems, vol. 2, pp. 191-202, Apr. 1980.
20
[16] S. L. Graham, P. B. Kessler, and M. K. McKusick, “gprof: A call graph execution 
profiler,” in Proc. 1982 SIGPLAN Symp. on Compiler Construction, pp. 120-126, June 
1982.
[17] C. B. Stunkel and W. K. Fuchs, “TRAPEDS: producing traces for multicomputers via 
execution driven simulation,” in Proc. ACM SIGMETRICS ’89 and PERFORMANCE 
’89 Int’l Conf. on Measurement and Modeling o f Comput. Sys., (Berkeley, CA), pp. 70- 
78, May 1989.
[18] MIPS Computer Systems, MIPS language programmer’s guide, 1986.
[19] C. L. Mitchell and M. J. Flynn, “A workbench for computer architects,” Design & Test, 
pp. 19-29, Feb. 88.
[20] R. M. Stallman, Using and porting GNU CC. Free Software Foundation, Inc., 1989.
[21] J.-L. Baer and W.-H. Wang, “Architectural choices for multi-level cache hierarchies,” 
in Proc. 16th Int’l Conf. on Parallel Processing, pp. 258-261, Aug. 1987.
[22] M. D. Hill and A. J. Smith, “Evaluating associativity in CPU caches,” IEEE Trans. 
Computers, vol. C-38, pp. 1612-1630, Dec. 1989.
21
