Combining Sampling with Single-Pass Techniques for Efficient Cache Simulation by Conte, Thomas M. & Hwu, Wen-mei W.
December 1991 UILU-ENG-91-2254 
CRHC-91-32
Center fo r Reliable and High-Performance Computing
COMBINING SAMPLING WITH 
SINGLE-PASS TECHNIQUES
FOR EFFICIENT 
CACHE SIMULATION
T. M. Conte and W.-M. Hwu
Coordinated Science Laboratory 
College of Engineering
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Approved for Public Release. Distribution Unlimited.
UNCLASSIFIED___________
SECURITY CLA SSIFICA TIO N  OF THIS Pa GÉ
REPORT DOCUMENTATION PAGE
la. REPORT SECURITY CLASSIFICATION
Unclassified
2a. SECURITY CLASSIFICATION AUTHORITY
1b. RESTRICTIVE MARKINGS  
_None
2b. DECLASSIFICATION/ DOWNGRADING SCHEDULE
3 DISTRIBUTION/AVAILABILITY OF REPORT
Approved for public release;  
distribution unlimited
4. PERFORMING ORGANIZATION REPORT NUMBER(S)
UILU-ENG-91-2254 (CRHC-91-3 9
5. MONITORING ORGANIZATION REPORT NUMBER(S)
6a. NAME OF PERFORMING ORGANIZATION 
Coordinated Science Lab 
University of I l l inois
6b. OFFICE SYMBOL 
(If applicable)
N/A
7a. NAME OF MONITORING ORGANIZATION
N a tio n a l  Cash R e g is te r  
N a tio n a l  Science Foundation
6c ADDRESS (Cty, Stata, and ZIP Coda)
1101 W. Springfield Avenue 
Urbana, IL 61801
7b. ADDRESS (City, Stata; and ZIP Coda)
Denver CO 
Washington DC
8a. NAME OF FUNDING/SPONSORING 
ORGANIZATION
7a
8b. OFFICE SYMBOL 
(If applicatila)
9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER
8c ADDRESS (City, State, and ZIP Coda) 10. SOURCE OF FUNOING NUMBERS
7b.
PROGRAM  
ELEMENT NO.
PROJECT
NO.
TASK
NO.
WORK UNIT 
ACCESSION NO
Combining Sampling with S in g le -P a s s  Techniques for  E f f i c i e n t  Cache S im u la t io
12. PERSONAL AUTHOR(S)
Conte, T. M. and Hwu W-M
13a. TYPE OF REPORT 
Technical 
16. SUPPLEMENTARY NOTATION
13b. TIME COVERED 
FROM TO
14. DATE OF REPORT (Year; Month, Day) fl 5. PAGE COUNT 
1991 December 18 | 30
17. COSATI CODES
FIELD GROUP SUB-GROUP
19. ABSTRACT
18. SUBJECT TERMS (Continue on reverse if necessary and identify by block number)
• cache s im u la t io n ,  cache performance e v a lu a t io n  
s i n g l e - p a s s  techniques
Single-pass methods for cache simulation can calculate the performance of multi­
ple cache designs in one pass over the address trace. The primary drawback of these 
methods is their speed. Statistical sampling uses statistical techniques to condense 
the address trace. Sampling is used to improve the simulation speed by sacrificing a 
small amount of accuracy in the results. This paper presents a new approach to cache 
performance evaluation by extending the cache evaluation problem to include the trace 
collection problem and by merging single-pass cache simulation techniques with statis­
tical sampling techniques. The new approach both improves the speed of single-pass 
techniques and significantly improves the accuracy achievable with statistical sampling. 
Empirical data is presented to verify the viability of these techniques.
0  UNCLASSIFIED/UNLIMITED Q  SAME AS RPT 
22a NAME OF RESPONSIBLE INDIVIDUAI
□  OTIC USERS
21. ABSTRACT SECURITY CLASSIFICATION 
Unclassified
83 APR edition may be used until exhausted. 
All other editions are obsolete.
22c. OFFICE SYMBOL
_SEGJRITY CLASSIFICATION OF THIS PAGE
UNCLASSIFIED

Combining Sampling with Single-Pass Techniques for 
Efficent Cache Simulation
Thomas M. Conte Wen-mei W . Hwu
Center for Reliable and High-Performance Computing 
University of Illinois 
1101 West Springfield Avenue 
Urbana, Illinois 61801 
hwuflcrhc.u iu c . edu
December, 1991
Abstract
Single-pass methods for cache simulation can calculate the performance of multi­
ple cache designs in one pass over the address trace. The primary drawback of these 
methods is their speed. Statistical sampling uses statistical techniques to condense 
the address trace. Sampling is used to improve the simulation speed by sacrificing a 
small amount of accuracy in the results. This paper presents a new approach to cache 
performance evaluation by extending the cache evaluation problem to include the trace 
collection problem and by merging single-pass cache simulation techniques with statis­
tical sampling techniques. The new approach both improves the speed of single-pass 
techniques and significantly improves the accuracy achievable with statistical sampling. 
Empirical data is presented to verify the viability of these techniques.
0
Combining Sampling with Single-Pass Techniques forEfficent Cache Simulation
1 Introduction
Memory hierarchies composed of cache memories axe so common and so crucial to high- 
performance computer architecture design that performance evaluation of cache memories 
has received phenomenal attention. Smith recently catalogued 487 technical papers and 
reports that deal with some aspect of caching [1], The holy grail of cache performance 
evaluation is a fast-yet-accurate cache evaluation method. To this end, researchers have 
devised analytical models and novel simulation approaches [2] [3] [4]. One inherent difficulty 
with cache performance evaluation is the size of the address traces of real programs. Many 
methods have been proposed to reduce trace size including compacting the trace by using 
cache properties, using program properties, and exploiting compression techniques such as 
Lempel-Ziv compression [5]. These techniques are loss-less since no information is removed. 
One more technique of trace compression is a lossy technique, statistical sampling [6]. Sam­
pling involves talcing a fixed number of contiguous samples of fixed-size from the trace and 
then using this sampled trace to perform the cache simulation. Sampling has been shown 
to produce reasonably accurate results and has received considerable interest recently. One 
reason sampling succeeds so well is that the prime metric of cache performance, the miss ra­
tio, is an arithmetic average and hence easy to statistically approximate. The most difficult 
problem to solve for sampling-based cache simulation is estimating the cache contents at the 
beginning of each sample (i.e., the effects of the lost references). We will return to this issue 
shortly.
Thecache performance evaluation process involves collecting a trace, compressing it as
1
much as possible, and then simulating each possible cache dimension with this trace. To 
eliminate the number of required simulations, single-pass cache simulators can be used. 
Such simulators can simulate multiple cache dimensions all at once by exploiting the inclu­
sion property of stacking replacement algorithms (LRU is the best known member of this 
class of replacement algorithms). This method has been extended to include rigid place­
ment/replacement algorithms used in direct-mapped caches [7].
Several researchers have proposed that since a trace is so costly to store, it should be 
consumed by the simulator while it is being generated (termed, concurrent trace consump­
tion) [8]. W ith traditional multiple-cache simulations, concurrent trace consumption still 
requires that the traced program to be run time and time again. The obvious improvement 
is to combine single-pass cache simulation with concurrent trace consumption. However, 
where traditional cache simulators take 0(n) time and 0 (1) space for a trace of length n, 
single-pass methods taise 0 (n2) time and 0 ( 1) space.
This paper presents a viable method for combining concurrent trace consumption, single­
pass cache simulation and statistical sampling. Since the entire trace is available during 
concurrent trace consumption, the issue of how to predict the contents of the cache between 
samples is removed. Experimental results axe presented demonstrating that cache perfor­
mance evaluation using this method is high-performance. Furthermore, the new sampling 
technique requires a significantly smaller sample size than traditional sampling to achieve 
the same accuracy.
The remainder of this paper reviews the central concepts of concurrent trace consumption, 
single-pass cache methods, and statistical sampling. It then discusses the extensions to single­
pass methods for sampling. Three of the four integer benchmarks from the SPEC benchmark
2
set version 1.0 are used for an empirical demonstration of the method.
2 Concurrent Trace Consumption
Since traces may be several billion references long, the most-viable storage medium for the 
trace is onto archival storage such as reel-to-reel tape. Because archival storage has slower 
throughput than mass storage disks, storing the trace can be time consuming as well as space 
consuming. The slowest stage of the cache simulation process then becomes the interface 
between the trace and the cache simulator.
The idea of concurrent trace consumption is to use the trace while it is being generated [8]. 
The production of the trace can be performed using hardware monitoring. However, this 
would involve special hardware design and that level of resource commitment is not generally 
available. More commonly, concurrent consumption is used in conjunction with a software 
tracing tool. This section discusses a compiler-based instrumentation tool. Such a tool adds 
additional code to a program to emit a trace when the program is run. Similar approaches 
have been used by AE, MPtrace, and pixie [5],[9],[10].
2.1 Spike
This paper uses Spike to modifiy programs to emit traces concurrently. Spike is a a compiler- 
based tool written by Michael Golden and the authors [11]. Spike adds code to programs 
compiled with the GNU C compiler (currently, version 1.40). Spike works by surrounding all 
load and store operations with instructions to save the relevant addresses away in a buifer. 
Instructions are also added at the beginning of each basic block to check the buffer to see if 
it is full. To guarantee that the buffer does not overflow during the execution of the previous
3
basic block, the buffer is created with a spare amount of space. The buffer is emptied by 
shipping its contents to the simulation down a communication path. When the program is 
executed, a Unix socket is opened to a remote host to serve as the communication path. The 
simulator is then started on that host. A program compiled with Spike and annotated in 
this manner is referred to as a spiked program.
2.2 Tracing limits
The ultimate limit to the speed of any cache simulation that uses concurrent trace con­
sumption is the speed of the spiked program with no simulation to consume the trace. This 
limit corresponds to overhead due to two factors: (1) the extra inserted instructions used 
to perform the tracing, and (2) the cost of using the communication channel. The limit is 
termed the baseline time and is expressed in total minutes of run time that the unloaded 
spiked program takes to complete.
Methods can be used to encode the trace to reduce the number of inserted instructions and 
the cost of using the communication channel. A decoder is then built for this encoded trace 
using compile-time information [5]. These approaches are best suited for non-concurrent 
trace storage approaches since the decoder adds an additional stage to the concurrent trace 
consumption pipeline. For this reason, Spike does not use such encodings for data-memory 
traces.
3 Single-pass cache simulation
A traditional cache simulator uses a data structure that is an exact replica of the tag store 
of the cache being simulated. The simulation involves updating this data structure at each
4
reference. When an address in the trace is not present in the tag store structure, the 
corresponding cache miss is recorded. The advantages of such a technique are its efficiency 
and simplicity. The time complexity for such an algorithm is 0 (n ) in n inputs. A simple 
array can be used for the tag store of a direct-mapped cache. (For non-directed-mapped 
configurations, the replacement policy has to be included, which results in a slightly more- 
complex data structure.) Since the-tag store does not change in size during simulation, the 
space complexity is 0 (1).
The disadvantage of the traditional cache simulator is its lack of generality. A simulation 
must be performed for each configuration of cache under study. Hence the term multiple-pass 
cache simulator can be used to describe the traditional simulator since it requires multiple 
passes over the trace.
Single-pass cache simulation techniques rely on the inclusion property of stacking re­
placement algorithms. Exploitation of this property allows these class of simulators to find 
the miss ratios for an entire design space of cache dimensions with one pass over the trace. 
The space complexity of these algorithms is directly proportional to the static program size. 
Hence, it is 0 (1 ). The disadvantage of these approaches is their time complexity, which is 
0 (n 2) [3]. This section gives a brief introduction to single-pass methods. For a more detailed 
account, the reader is referred to [3] [4]. The particular single-pass simulation approach pre­
sented in this paper is based on the recurrence/conffict model of the miss ratio. The model 
is introduced below followed by a brief description of the simulation method. This section 
closes with a suggested implementation strategy for the algorithm.
5
3.1 Recurrences and conflicts
The metric used in many memory system studies is the miss ratio. This is the ratio of 
the number of references that are not satisfied by a cache at a level o f the memory system 
hierarchy over the total number of references. The miss ratio has served as a good metric 
for memory systems since it is a characteristic of the workload (e.g., the memory trace) yet 
independent of the access time of the memory elements. Therefore, a given miss ratio can 
be used to decide whether a potential memory element technology will meet the required 
access time for the memory system. The recurrence/conflict model of the miss ratio is 
best illustrated with an example. Consider the trace of Figure 1. The recurrences in the 
trace are accesses e, f ,g  and h. Without context switching, all the four recurrences would 
produce a hit in an infinite cache. In the ideal case of an infinite cache in the absence of 
context-switching, the miss ratio may be expressed as,
9 =
N - R
N 1 ( i )
where R is the total number of recurrences and N  is the total number of references. Non­
ideal behavior occurs due to conflicts. A dimensional conflict is defined as an event which 
converts a recurrence into a miss due to limited cache capacity or mapping inflexibility. 
For illustration, consider a direct mapped cache composed of two one-byte blocks shown in 
Figure 2. (Note that in practice, such a small cache would be impractical to build.) A miss 
occurs for the recurring recurrence e because reference d purges address 1 from the cache due
Reference a b c d e /  9 h
Address 0 1 2 3 1 2 1 2
Figure 1: An example trace of addresses.
6
Reference:
Address:
A
0  mils
B
1 miss
c
2  miss
D
3 * miss
block 0: 
block 1:
E F
1 miss 2
2
1
* Dimensional conflict
EX 2 2LI 1 3
G H
1 2
2 2 2
1 1 1
Figure 2: An example two-block direct-mapped cache behavior.
to insufficient cache capacity. Hence, d represents a dimensional conflict for the recurrence e. 
The other misses, a, 6, c and d, occur because these are the first references to addresses 0 ,1 ,2  
and 3, respectively. Therefore, the following formula can be used for deriving cache miss 
ratio, />, for a given trace, a given cache dimension and a given pattern of context switching:
N - ( R - D )
p = -------- N--------- (2)
where D the total number of dimensional conflicts. This is a general model and can be ex­
tended account for other effects, such as conflicts due to multiprocessor cache coherence [12].
3.2 Reference streams and cache dimensions
The formal abstraction of a benchmark’s trace is termed a “reference stream.” This is a 
sequence of address references, w(k), of length N  (0 <  k < N). The addresses are addresses 
in the lowest level of a cache hierarchy, which is assumed to be a linear space (e.g., the 
virtual space). When they are required, such references will be represented by lower-case 
Greek letters, such as a, (3,7 . The reference stream is assumed to be generated by a single 
process in a multiprogramming system. A time variable, fc, is a measure of the system
7
clock. Also, a reference will be called as a voluntary context-switch point if the benchmark 
relinquished the CPU after the reference (e.g., a system call was performed).
The dimension of a cache is expressed using the notation, (C, B, S), for a cache of size 
2C bytes, with block size 2B bytes, and 2s blocks contained in each associativity set. The 
term set size is used to mean associativity level, or the number of blocks per set. Cache size 
is the total number of bytes per cache. Block size has been called line size elsewhere [13]. 
Note that C >  B  +  S. The notation (C, B, oo) is an abbreviation for the dimension of a 
fully-associative cache (S =  C +  B ). For example, a cache of dimension (10, 6, 0) is a 1KB 
direct-mapped cache with a block size of 64 bytes; and, a cache of dimension (21,10,11) 
(alternately, (21,10, oo)) is o f size 2MB with lKB-length blocks and it is fully-associative. A 
dash is substituted for an entry in the triple to indicate all caches of that dimension. Hence, 
(—,5 ,1 ) are all caches with block size 32 bytes and 2-way associativity. Caches are assumed 
to use LRU replacement and map addresses into sets using bit selection [4],
It is useful to partition the reference stream by setting the block offset portion of all 
addresses in the stream to zero. This produces a block reference stream, wB(k), is defined 
such that,
wB(k) =  2b w(k)
I F
In binary, this is equivalent to setting the least-significant B  bits to zero. The number 
of recurrences is measured for the block reference stream, and denoted R[B], Dimen­
sional conflicts, D[C, B, S\, are measured for each cache dimension using a single-pass tech­
nique [4],[7],[14].
8
3.3 Least-recently-used (LRU) stack operation
An LRU stack operates as follows: when an address, u?b (&) =  a , is encountered in the block 
reference stream, the LRU stack is checked to see if a  is present on the stack. If a is not 
present, it is pushed onto the stack. However, if a  is present (e.g, it is a recurring reference),
Reference: 0 1 2 3
Reference:
0 1 2 3
0 1 2
2
0 1
1 2
0
1 2 1 2
3 1 2 1
2 3 3 3
0 0 0 0
Figure 3: An example of LRU stack operation.
it is removed from the stack, then repushed onto the stack. This is illustrated in Figure 3 for 
the example reference stream at the beginning of this section (Figure 1). LRU stacks were 
first introduced by Mattson, et al. in [3].
An LRU stack is represented as Sb (k), maintained for a block size B  at time k. The 
¿th ordered item of Ss(k) is expressed as, 5 b (&)[*]- The stack may also be expressed as 
an ordered list, such that Ss(k ) =  {<Sb (&)[0], 5 b (&)[1], . . . ,  <S,b (&)[|5b (&)|]}. The following 
operations are defined for the stack: 
the push(-) function,
p u sh (5 B(fc),a ) =  ( a ,  SB(0], SB(fc)[l], 5B(fc)[|SB(fc)|]} ,
the w here(-) function,
where(SB(fc), a ) =  if Sb Ì&H«] =
9
and, the repush (-) function,
repush(SB(fc),a) =  {a , Sb(*0[O], Sb^H 1]» . . . ,  SB(*)[where(SB(fc),a) -  1],
SB(fc)[w here(5B(fc),a) +  1], 5B(i)[|5B(fc)|] } .
w h ere (5 b (^ ) ,« )  and repush (5B (fc),a ) axe undefined when a £  Sb (&)- When 5b (&) and 
a axe understood, it is convenient to define A =  w h e r e ^ B ^ )?  «)• Note that push(*) and 
repush (-) axe defined as side-effect-free functions, rather than procedures. This is to remove 
dependence on the time variable, k.
I
For an address a =  ws(k ), the least-recently used (LRU) management policy for a stack 
is shown in Figure 4. In Step 1.1, the references between the top of stack and the recurring 
reference have been referred to as the set F =  {/?,• | /?,• =  <Sb (& ~  1)[¿], 0 <  i <  A ) .  The LRU
1. i f  a  € Ss{k — 1) then
1.1 do_recurrence(o!, T)
1.2 Sb(&) <— repush(5B(fc — 1),«),
2. else 5b (^) «— push(SB(& — 1) , « )
3. N  <- N  +  1
Figure 4: The least-recently used management policy for a stack, Sb (&) (adapted from 
Mattson et al.).
policy is essentially a definition for calculating Sb {k) from Ss(k — 1) and ct.
3.4 Recurrence/conflict-based single-pass simulation
Single-pass cache simulation algorithm is created by expanding the do_recurrence procedure 
of Figure 4 [4]. A single-pass algorithm this procedure that uses the recurrence/conflict model
10
is presented in Figure 5. This algorithm is based on the original algorithm of Traiger and 
Slutz [7]. However, where Traiger and Slutz recorded temporal localities, this algorithm 
records recurrences and conflicts. Since temporal locality functions can occupy considerable 
space, using recurrences and conflicts is an advantage. In this respect, the recurrence/conflict 
approach is more similar to the algorithm of Hill and Smith [4]. It is presented here as a 
pedagogy and prelude to single-pass sampling techniques that will be discussed in the next 
section. The remainder of this section explains Figure 5.
3.4.1 The operation of do_recurrence
Whenever a reference is found on the stack, it is a recurrence. Therefore, the calculation 
of the number of recurrences ( i2[£ ]) is implemented by recording the number of times the 
procedure of Figure 5 (Step 1). The remainder of the algorithm is devoted to calculating 
the dimensional conflicts (D[C, B, 5 ]). This is now explained.
The loop over all intervening references of Step 2 calculates the raw information for 
determining two classes of cache organizations. The maintenance of the number of unique 
references (u) in Step 2.1 is used to calculate the largest-sized fully-associative cache with 
a dimensional conflict (c0o). This calculation is done in Steps 3 and 4 by taking the lg (log 
base two) of this count. The remainder of Step 2 calculates a histogram of a function of 
the current reference (a ) and each intervening reference (/?,) (Step 2.4). This function is the 
lowest power of two factor of the arithmetic difference between the two references (Steps 2.2 
and 2.3). For a range of direct-mapped caches, this function is equivalent to the largest 
cache size in which a miss will still occur (Cmc)• The remainder of the procedure uses this 
information to calculate this cache size for all associativities (Steps 5-8).
11
do_recurrence(a, T):
1 R[B] <- fl[J3] +  1
2 for/?,- € T do
2.1 U <r- U +  1
2.2 d « -  |/?,- -  a|
2.3 2? 4— ctz(d)
2.4 p{z] « -  p[z] +  1
2.5 <— max(z, zmax)
3 Coo « -  Llg wj +  #
4 T>[coo, B , oo] <— T>[coo, 2?, oo] +  1
5  ^ * m^ax
6 Starget 1
7 nss *— 0
8 for S  <— 0 to 5max
8.1 Cmc *— B
8.2 while z >  0 and nss < Siargei
8.2.1 nss *— p[z]
8.2.2 z *— z — \
8.3 if nss >  Starget f hen
8.3.1 Cmc <— z +  s +  1
8.4 D[Cmc, B , s] <— D[Cmc, B, s] +  1
8.5 Starget 2 x  £target
Notation:
Symbol Definition
a Current reference
A Intervening references from T
U Number of unique references
d Address distance
ctz (d) Counts trailing zeros (in binary) for d
z Count of trailing zeros
p[z) Histogram of counts of trailing zeros
n^uut Maximum trailing zeros number
Starget Target set size
nss Number of references in the same set
Cmc Largest cache with a dimensional conflict
N Total number of references
Figure 5: The recurrence/conflict single-pass cache simulation algorithm.
12
The histogram (p[z]) is processed for all associativities by scanning the histogram from 
largest to smallest potential conflicting cache size. A set size can be thought of as conflict 
tolerance. The larger the set size, the more conflicts between a. and /?,• can occur before 
a miss occurs. In Step 6 to 8, the set sizes axe considered in increasing order to see how 
many conflicts can be tolerated. For each set, the largest cache size in which a miss will 
occur (Cmc) is the product the same cache size for a direct-mapped cache times the set size 
(Step 8.3.1, note that addition of these exponents of base 2 implies multiplication). If no 
conflicts remain in the histogram, the only conflict accounted for are those that occur in 
caches containing a single block (Step 8.1).
A more detailed discussion for a related single-pass cache algorithm is presented in [4],
3.5 Implementation issues
Efficient implementation of a single-pass algorithm can result in a constant speedup over a 
straight-forward yet inefficient implementation. Although such a speedup does not change 
the asymptotic behavior, it still has significant effect on the speed of simulation. This section 
closes with a discussion of some implementation issues.
The stack itself can be implemented as an array, yet such an implementation would 
require compaction when a reference is repushed. The most-efficient implementation known 
to the authors is a doubly-linked list. This is illustrated in Figure 6. The advantage of such 
an implementation is that it is easy to delete a stack frame from the middle of the stack in 
constant time without the need for compaction.
In Mattson et al. [3], the set-existence operation, a G S s(k — 1) (Step 1 of Figure 4) was 
determined by scanning the whole stack from top to bottom. This can be modified by using
13
Figure 6: The stack implemented as a doubly-linked list.
a table lookup, where each entry of the new table contains a pointer to the stack frame. 
This is illustrated in Figure 7. The current implementation of the recurrence/conflict-based 
single-pass algorithm uses a hash table for this table lookup. It has been shown that such 
a hash table construction can be used to improve the complexity of single-pass methods 
restricted to fully-associative caches [15].
Stack
Figure 7: Use of table lookup to find arbitrary entries on the stack.
14
The dilemma with a single-pass cache implementation is that although caching systems 
axe being simulated, the memory behavior of the implementation tends to be rather poor. 
This occurs because the stack frames are allocated sequentially according to the order of 
initial reference to locations in the trace. However, as simulation progresses, stack frames 
that are sequential in the stack tend not to be allocated in the same virtual memory page. 
This results in poor paging rates when the doubly-linked list is traversed.
One approach to the memory behavior dilemma is as follows: when a repush  occurs, the 
repushed stack frame is repushed not by manipulating the doubly-linked list pointers but by 
exchanging it byte-per-byte with the stack frame that was allocated before the top of stack. 
The doubly-linked list pointers axe then repaired to maintain the correct stacking order. This 
has the eifect of forcing the blocks near the top of the stack to be contiguous and therefore 
exploiting the locality of the trace. When this was implemented in the recurrence/conflict- 
based single-pass cache simulator, it resulted in a 10% to 15% improvement in run time and 
a near-perfect (close to 0%) paging fault rate.
4 S tatistica l S am pling o f  A d d ress  Traces
Cache simulation is mainly involved with summarizing the cache performance due to the 
entire trace by reporting a small set of statistics. By and far the most common statistic 
is the miss ratio and its decomposition into recurrences and conflicts was discussed in the 
previous section. Since the miss ratio is an arithmetic average over time, it can be accurately 
predicted by statistically sampling the trace [6].
15
4.1 Formal definitions of sampling
Consider a reference stream, ws(k). Statistical sampling takes Ns samples of length Ls from 
this trace. Since during concurrent trace consumption the actual length of the trace is not 
known a priori, the number of references between samples is also fixed. This sample gap has 
length Lq. For simplicity, it is assumed that the trace is an integral multiple of Ls +  Lq 
(i.e., N  =  Ns * {Ls +  La)).
Note that each sample is a contiguous block of L references. These samples are they 
applied to the cache simulator in the order they were taken from the trace. The state of the 
cache is unknown between each sample. The major issue in sampling is how to repair the 
state of the cache between the application of each sample.
4.2 State repair techniques
Perhaps the simplest state-repair technique flushes the cache between each sample. Such 
flushing causes references inside the sample that have been recurrences in the full trace to 
appear to be unique references. Therefore, the miss ratio is highly inflated.
Another method is to wait until the cache is warmed up. One warm-up criterion might 
be to use the first half of the sample to prime the cache and then record recurrences and 
conflicts for the second half of the sample. This works well for small caches, but forces ever 
larger sample sizes as cache size grows.
Other methods are to wait until each set in the cache is filled with references before 
recurrences and conflicts are recorded for that set [16], or redefine the miss ratio in terms 
of the lifetime of references [17]. However, these approaches are limited to multiple-pass 
algorithms. In a single-pass algorithm, although the set membership is known when con-
16
flicts axe determined, the state of each set in each possible cache configuration needs to be 
maintained. This requires a huge amount of information.
A final method is to remove all unique references from the trace. To elaborate on this idea, 
suppose the recurrence/conflict single-pass method was extended to measure the number of 
references whose status was unknown. It is traditional to term these references fill references. 
Let F[B] be the count of fill references for block size B. The miss ratio using this technique 
is then,
R [B )-D [C ,B ,S \
pN -
re-arranging:
, R [B ] -D [C ,B ,S } -
p =  l---------------- N------------------- ’
(from [18]). This shows that removing the fill references is equivalent to weighting these 
references by the miss ratio of the remainder of the trace. Figure 8 shows a single-pass 
algorithm modified for this form of sampling. This algorithm is applied to every reference 
in the sample. Between samples the LRU stack is flushed (Sb 0)-
1. for i <— 0 to Ns — 1
1.1 SB{0) 0
1.2 for j  <— 0 to Ls
1.2.1 a <— wB(i x (Ls 4- Lg) +  j )
1.2.2 if a G SB(j — 1) then
1.2.2.1 do_recurrence(a, T)
1.2.2.2 Sb U) <- repush(5B(i ~  1),«),
1.2.3 else
1.2.3.1 Sb U) <- push(5fl(j -  l),a )
1.2.3.2 F[B] <- F[B\ +  1
1.2.4 N * -N  +  1
Figure 8: A generic single-pass cache simulation algorithm extended for sampling.
17
4.3 A  no-state-loss sampling technique
In a situation where concurrent trace consumption is occurring, the entire trace is available 
even if sampling is occurring at the input to the cache simulator. It is possible to make 
use of these excluded references. Consider Figures 4 and 5 from the previous section. The 
shorter LRU stack maintenance algorithm of Figure 4 could be applied to all the references 
in the trace, saving the the more complicated and costly application of d o_recu rren ce() 
of Figure 5 for the references inside the sample. The power of this approach is two-fold. 
First, if the simpler algorithm exploits lookup table for stack blocks, it can be made 0 (N ), 
although the more complicated algorithm must still remain 0 ( N 2). Furthermore, since the 
stack is maintained, whether a reference is a recurrence or not is known for all references 
inside the sample. Hence, the status of all references is known. This approach is therefore 
termed a no-state-loss sampling technique.
The modified sampling algorithm is shown in Figure 9, where the predicate sampling(&) 
test to see if reference a =  ws(k) falls inside a sample or not.
1. i f  a  € 5 b (& — 1) then
1.1 if sampling(&) then do_recurrence(a, T)
1.2 Ssik) <— repush(5B(& — 1), <*),
2. else Ss(k) <— pu sh (Ss(k — 1), a )
3 if sampling(fc) then N <— N +  1
Figure 9: A no-state-loss approach to extending a single-pass cache simulation algorithm for 
sampling.
Even though there is no state lost in the sampled simulation, it is still only valid to 
calculate N  and R[b] inside each sample and not using the whole trace (recall that R[b] is
18
calculated inside of the procedure d o_recu rren ce ()). The references used to calculate N 
and R[b] must be the same as the references used to calculate conflicts. However, this will 
result in a constant error between the no-state-loss approach and the full-trace miss ratios. 
The size of this error will be investigated in the next section.
A qualification must be made concerning this approach. It is not as fast as simple 
sampling techniques such as the fill reference approach outlined prior to this. Some processing 
must be done for all references. The argument in favor of the no-state-loss approach is that 
when concurrent trace consumption is used, the unsampled references would be discarded in 
O(N) time regardless of whether they were somehow processed or not. The question remains 
whether this modification results in high accuracy and performance. The following section 
will address these issues.
5 Experimental Verification
Two issues axe important in the evaluation of the single-pass sampling method suggested in 
this paper: (1) determining whether the method is accurate, and (2) determining whether the 
method is time-efficient. The issue of accuracy is approached in this section by investigating 
the difference between actual (full trace) and sample miss ratios. Since the method is state 
loss-less, its accuracy should be higher than the fill-removal method. Therefore, comparisons 
are made between the method and fill-removal.
Time-efficiency is approached by comparing the speedup of single-pass sampling methods 
over single-pass methods without sampling. There is an inherent tradeoff between high speed 
(small sample size) high accuracy (large sample size). For this reason, the product of the 
time-efficiency and the accuracy measures is calculated.
19
Two members of the SPEC benchmark set, version 1.0 [19] were chosen for the empirical 
verification. Spike was used to instrument the data memory behavior of these programs. 
Included in the process were all the Unix C and math library functions [20], [21]. The two 
benchmarks were SPEC/008.espresso and SPEC/001.gccl.35, herein referred to simply as 
espresso and gcc.
There are two benchmark characteristics relevant to the measurements presented in this 
section. The first o f these is the total amount of time the spiked benchmark takes when no 
cache simulation is done (baseline time). The baseline time is the lowest achievable execution 
time for each spiked benchmark (see Section 2). No cache simulation can perform faster 
than this time. The second characteristic is the total number of data memory references. 
Any sampling strategy seeks to achieve high accuracy using only a small fraction of these 
references. All measurements were performed using an unloaded Sun SPARCstation IPC 
with 36MB of physical memory.
Table 1: Relevant benchmark characteristics.
Baseline time (:min) Total number
Benchmark (:min) of references
espresso :50:00 1.5x10s
gcc :12:30 3 .3 x l0 7
5.1 Metrics
Memory systems are designed to meet an access time requirement. Cache memories allow the 
designer to decrease the cost of the system by using high-performance and expensive memory 
efficiently. These goals often lead the designer in search of cache memory designs that have
20
miss ratios smaller than or equal to some criterion value. For this reason, the smaller the 
absolute difference between two miss ratios, the less the chance that the difference will effect 
the final design decisions. The metric for accuracy used in this section was selected with this 
design procedure in mind. The accuracy metric is calculated as the arithmetic difference 
between the full-trace miss ratio and the sampled miss ratio. This metric is termed the 
sample miss ratio error. This metric in general will have larger values than a relative error 
metric.
Time-efficiency of the simulation is quantified in terms of wall clock time on the unloaded 
Sun SPARCstation IPC. To compare run times, the speedup is calculated as the ratio of the 
time the simulation took without sampling over the time the simulation took with sampling.
5.2 Accuracy measurement
For each benchmark, a run was performed without sampling to obtain the full trace miss 
ratio. The length of the full trace was used to adjust the sample gap, Lq , so that the number 
of samples, Ns, would be 40 samples. The benchmarks were then run using sample sizes of 
Ls =  50K, 100K, 200K, and 400K references. The design space explored included one run 
per block size, for three block sizes of 32B, 64B, and 128B, and four set associativities of 
direct-mapped, 2-way associative, 4-way associative, and fully-associative. This resulted in 
over 380 cache dimensions for each block size or 5700 miss ratios per benchmark.
The presentation of all the data would take considerable space. Instead, several endpoints 
were selected. Only the two associativities, direct-mapped and fully-associative, will be used 
in this section. Figure 10 presents the sample miss ratio error for espresso and gcc with 
block size 32B for a sample size of 50K references. The constant error that appears as cache
21
sam
ple
 mi
ss r
atio
 err
or 
sam
ple
 mi
ss r
atio
 err
or
10 15 20 25log (base 2) cache size 30 35
GCC (sample size 50K, block size 32B)
Figure 10: Sample miss ratio error for espresso and gcc: block size 32B, sample size 50K.
22
size increases is due to the difference between R[B\/N for the no-state-loss sampled case and 
R[B]/N for the full-trace case. This error was predicted in Section 4.3. The error in all cases 
is very small, <  0.01%. The reader is invited to compare the accuracy of the no-state-loss 
technique with the state-loss sampling techniques (c.f., [6],[17],[16]). The results presented 
here axe several orders of magnitude more accurate than previous techniques. This is as 
expected since the cache contents axe maintained between samples.
The largest error in this figure occurs for the smallest cache size, a 1KB direct-mapped 
cache (e.g., cache (10, 5, 0)). The sample miss ratio error for this cache is 0.63%. For small 
caches dimensional conflicts account for a high percentage of the miss ratio. Dimensional 
conflicts that occur outside the samples axe not recorded, causing the error to be largest for 
small caches.
An important question to answer is how sample size effects accuracy for the no-state- 
loss technique. The previous example used a relatively small sample size of 50K references 
(Laha, et al. suggested a size of 100K references [6]). To summarize sample miss ratio errors 
across all cache sizes and associativity levels, the RMS value was calculated for the errors. 
This is reported, for the espresso benchmaxk across each block size in Figure 11. This figure 
indicates that smaller block sizes seem to require a larger sample size. Furthermore, there 
is a distinct knee in two of the three curves and this knee occurs between 50K and 100K 
references for blocks of size 32B, and between 100K and 200K references for 64B blocks. 
This seems to suggest that sample sizes of 100K to 200K references are appropriate for no­
state-loss sampling techniques. For these sample sizes, the RMS sample miss ratio error is 
bounded to <  0.0008.
23
RM
S u
mp
lc m
us 
rati
o a
rar
 
RM
S s
am
ple
 mi
se n
tio
 err
or
GCC
Figure 11: RMS sample miss ratio error vs. block size for espresso and gcc.
24
5.3 Time-efficiency measurement
The run-times (and speedups over full-trace simulation) for the no-state-loss technique are 
presented in Table 2 across block sizes and sample sizes for both benchmarks.
Table 2: Run-time (and speedup) of no-state-loss sampling over full-trace single-pass meth­
ods.
espresso Run-time (hr:min) & (speedup)
Block size Full 50K 100K 200K 400K
32B 11:26 1:50 (6.3) 1:51 (6.2) 2:00 (5.7) 2:15 (5.1)
64B 9:06 1:44 (6.3) 1:46 (5.2) 2:03 (4.4) 1:59 (4.6)
128B 4:21 1:42 (2.6) 1:42 (2.6) 2:05 (2.1) 1:52 (2.3)
gcc Run-time (hr:min) & (speedup)
Block size Full 50K 100K 200K 400K
32B 2:18 :32 (4.4) :38 (3.6) :53 (2.6) 1:17 (1.8)
64B 1:20 :27 (3.0) :31 (2.6) :38 (2.1) :52 (1.5)
128B 1:00 :25 (2.4) :28 (2.2) :33 (1.9) :41 (1.5)
As sample size increases, run time tends to decrease. This effect is more pronounced for 
gcc than for espresso. The baseline time for gcc (: 12:30) is 4.5 times smaller than that of 
espresso (:50:00), whereas the trace size of gcc (1 .5 x l0 8) is four times smaller than that of 
espresso (3 .3 x l0 7). This effect might be caused by a higher burst rate of buffer flushes for 
espresso over gcc. The locality of the reference stream itself also has a high effect on run time 
since it determines the amount of the LRU stack that must be traversed for each reference 
by do_recu rren ce( ) .  Larger block sizes tend to produce smaller LRU stacks since larger 
block sizes allow more references to map into the same block. This explains why larger block 
sizes perform better than smaller block sizes.
The speedup figures in Table 2 indicate that the no-state-loss technique can result in run­
time improvements of factors between 1.5 to over six times. Combining these results with
25
the accuracy results o f the previous section suggest that optimal sample sizes are between 
100K and 200K references. In this range of sample sizes, these speedups are close to the 
speedups for the smallest sample size considered (50K references).
6 Conclusion
In this paper we propose a viable method for cache performance analysis. The limits of trace 
storage were overcame by exploiting concurrent trace consumption. Single-pass methods 
were used to explore a broad design space. To improve the speed of single-pass methods, 
efficient implementation was discussed and a novel approach to statistical sampling was 
introduced. Experimental data was presented to support the claims of accuracy and speed.
The experimental data suggests that the best sample sizes using no-state-loss sampling 
with single-pass techniques is in the range of 100K to 200K references. One can expect a 2-6 
times speedup in using these techniques over unmodified sampling techniques. Furthermore, 
the accuracy of these techniques is high, with an absolute sample miss ratio error bounded 
by ±0.01% . These facts demonstrate that the technique presented is a worthwhile tool for 
cache performance analysis.
Acknowledgements
The authors would like to thank Sadun Anik and all members of the IMPACT research group 
for their support, comments and suggestions. Special thanks to Michael Golden for his work 
on Spike. Thanks also to John Fu and Janak Patel their comments on early versions of these 
ideas.
This research has been supported by Dr. Lee Hoevel at NCR, the National Science Foun-
26
dation (NSF) under Grant MIP-8809478, and by an equipment donation from the Hewelett- 
Packard company.
27
References
[1] A. J. Smith, “A second bibliography on cache memories,” Comput. Architecture News, 
vol. 19, pp. 138-153, June 1991.
[2] A. Agarwal, M. Horowitz, and J. Hennessy, “An analytical cache model,” ACM Trans. 
Computer Systems, vol. 7, pp. 184-215, May 1989.
[3] R. L. Mattson, J. Gercsei, D. R. Slutz, and I. L. Traiger, “Evalutation techniques for 
storage hierarchies,” IBM Systems J., vol. 9, no. 2, pp. 78-117, 1970.
[4] M. D. Hill and A. J. Smith, “Evaluating associativity in CPU caches,” IEEE Trans. 
Computers, vol. C-38, pp. 1612-1630, Dec. 1989.
[5] J. R. Laxus, “Abstract execution: a technique for efficiently tracing programs,” tech, 
rep., Computer Sciences Department, University of Wisconsin-Madison, Feb. 1990.
[6] S. Laha, J. A. Patel, and R. K. Iyer, “Accurate low-cost methods for performance 
evaluation of cache memory systems,” IEEE Trans. Computers, vol. C-37, pp. 1325— 
1336, Feb. 1988.
[7] I. L. Traiger and D. R. Slutz, “One-pass techniques for the evaluation of memory hier­
archies,” IBM Research Report RJ 892, IBM, San Jose, CA, July 1971.
[8] C. B. Stunkel and W . K. Fuchs, “TRAPEDS: producing traces for multi computers via 
execution driven simulation,” in Proc. ACM SIGMETRICS 789 and PERFORMANCE 
789 Inti Conf. on Measurement and Modeling of Comput. Sys., (Berkeley, CA), pp. 70- 
78, May 1989.
[9] MIPS Computer Systems, MIPS language programmer’s guide, 1986.
[10] S. J. Eggers, D. R. Keppel, E. J. Koldinger, and H. M. Levy, “Techniques for efficient 
inline tracing on a shared-memory multiprocessor,” in Proc. ACM SIGMETRICS 790 
Conf. on Measurement and Modeling of Comput. Sys., pp. 37-45, May 1990.
[11] M. L. Golden, “Issues in trace collection through program instrumentation,” Mas­
ter’s thesis, Department of Electrical and Computer Engineering, University of Illinois, 
Urbana-Champaign, Illinois, 1991.
[12] J. G. Thompson, Efficient analysis of caching systems. PhD thesis, Computer Sci­
ence Division, University of California, Berkeley, California, Oct. 1987. Report No. 
U CB/CSD 87/374.
[13] A. J. Smith, “Cache memories,” ACM Computing Surveys, vol. 14, no. 3, pp. 473-530, 
1982.
[14] T. M. Conte and W. W . Hwu, “Single-pass memory system evaluation for multipro­
gramming workloads,” Tech. Rep. CSG-122, Center for Reliable and High-Performance 
Computing, University of Illinois, Urbana, IL, May 1990.
28
[15] Y . H. Kim, M. D. Hill, and D. A. Wood, “Implementing stack simulation for highly- 
associative memories,” in Proc. ACM SIGMETRICS ’91 Conf. on Measurement and 
Modeling of Comput. Sys., pp. 212-213, May 1991.
[16] J. W . C. Fu and J. H. Patel, “How to simulate 100 billion references cheaply,” Tech. 
Rep. CRHC-91-30, Center for Reliable and High-Performance Computing, University 
of Illinois, Urbana, IL, Nov. 1991.
[17] D. A. W ood, M. D. Hill, and R. E. Kessler, “A model for estimating trace-sample 
miss ratios,” in Proc. ACM SIGMETRICS ’91 Conf. on Measurement and Modeling of 
Comput. Sys., pp. 79-89, May 1991.
[18] H. S. Stone, High-performance computer architecture. New York, NY: Addison-Wesley, 
1990.
[19] “Spec newsletter,” Feb. 1989. SPEC, Fremont, CA.
[20] T. M. Conte and W . W . Hwu, “Benchmark characterization,” IEEE Computer, pp. 48- 
56, Jan. 1991.
[21] S. I. Feldman, D. M. Gray, M. W . Maimore, and N. L. Schryer, “A Fortran-to-C con­
verter,” Computing Science Tech. Report 149, AT&T Bell Laboratories, Murray Hill, 
NJ, June 1990.
29
