Static locality analysis for cache management by Sánchez Baeza, Francisco J. et al.
Static Locality Analysis for Cache Management 
E Jes6s SBnchez, Antonio Gonzalez and Mateo Valero 
Universitat Politkcnica de Catalunya 
Department of Computer Architecture 
c./ Jordi Girona, 1-3 - Mbdul D6 
08034 - Barcelona (SPAIN) 
E-mail: {fran,antonio,mateo)@ac.upc.es 
Abstract [ 101 that most programs require considerably less cache 
Most memory references in numerical codes corre- 
spond to array references whose indices are afJine func- 
tions of surrounding loop indices. These array references 
follow a regular predictable memory pattern that can be 
analyzed at compile time. This analysis can provide valu- 
able information like the locality exhibited by the program, 
which cun be used to implement a more intelligent caching 
strategy. 
III this paper we propose a static locality analysis ori- 
ented to the management of data caches. We show that pre- 
vious proposals on locality analysis are not appropriate 
when the programs have a high conjlict miss ratio. This 
paper extends those proposals by introducing a compile- 
time inte ference analysis that significantly improve the 
pe$ormance of them. 
We first show how this analjsis can be used to charac- 
terize the dynamic locality properties of numerical codes. 
This evaluation show for  instance that a iarge percentage 
of references exhibit only temporal locality and another 
significant percentage does not exhibit any type of locality. 
This motivates the use of a dual data cache, which has a 
niodule specialized to exploit temporal locality, and a 
selective cache respectively. Then, the performance pro- 
vided by these two cache organizations is evaluated. In 
both organizations, the static locality analysis is responsi- 
ble for  tagging each memory instruction accordingly to the 
particular type(s) of locality that it exhibits. 
memory than what a typical superscalar processor has. 
One of the drawbacks of conventional cache organiza- 
tions is that they perform a blind management of all mem- 
ory references, that is, all of them are handled in the same 
way: if the reference misses, a new block i s  brought into 
cache at the expense of replacing another. 
When a reference does not exhibit any type of locality, 
this results in cache pollution and memory bandwidth 
waste. The pollution is due to the placement in cache of a 
non-reusable block whereas the memory bandwidth waste 
is caused by the additional data brought from L2 cache to 
L1 cache in the same block as the requested data. To cope 
with this issue, some current microprocessors provide 
memory reference instructions that can bypass the cache. 
When a reference has only temporal locality (i.e., only 
one data element of each cache block referenced by it is 
used by itself or any other instruction), it also results in 
cache pollution and memory bandwidth waste since only 
one element of the new block will be used. To overcome 
this problem, a cache could provide an additional module 
to store those data elements with just temporal locality. 
This was for instance proposed in the dual data cache orga- 
nization [7]. 
In this paper we propose a static locality analysis to 
manage both selective data cache and dual data cache orga- 
nizations. The locality analysis is inspired in the proposals 
presented in [ 181, [ 141 and [ 3 ] .  However, these proposals 
do not consider conflict misses when performing the local- 
ity analysis. This may cause the analysis to be very inaccu- 
rate for programs with a high conflict miss ratio. To 
overcome this problem, we propose to extend those previ- 
ous proposals for locality analysis with an interference 
analysis module. w e  show that very simple interference 
analysis schemes may deliver a quite accurate estimation of 
the locality Of most 
The paper provides quantitative statistics about the dif- 
ferent types of locality exhibited by the nested loops of the 
1. Introduction 
figh-perfomance microprocessors rely on a 
efficient cache memory organization to mitigate the 
increasing gal, between processor and memory 
Despite of the tremendous research effort that has been 
devoted to this topic, current cache organizations make a 
poor use of the cache capacity. For instance, it is shown in 
26 1 
0-8186-8090-3/97 $10.00 0 1997 IEEE 
SpecFP95 benchmarks. We find for instance that a substan- 
tial percentage of references do not exhibit locality due to 
cache conflicts for some programs and that a significant 
percentage of references exhibit just temporal locality. 
These observations motivate the use of selective and dual 
data caches. 
The rest of the paper is organized as follows. Section 2 
reviews the related work. The static locality analysis is pre- 
sented in section 3. The experimental framework is 
described in section 4. Section 5 analyses the locality 
exhibited by loop nests of numerical codes. The application 
of the locality analysis to the selective cache and dual data 
cache is discussed in section 6. Finally, the main conclu- 
sions of this work are summarized in section 7. 
2. Related work 
Selective caching (also called cache bypassing) is a 
feature of current microprocessors like the PowerPC [ 151. 
In the literature there are a number of proposals on that 
topic for both instruction and data caches. Some remark- 
able works for data caches are [61, [ll, [71 and [17]. The 
scheme proposed in [6] is based on a compile-time estima- 
tion of data lifetimes. The mechanism proposed in [ 11 iden- 
tifies non-cacheable data by means of profiling. The 
scheme proposed in [7] is based on a run-time managed 
history table of the most recent loadhtore instructions. The 
approaches proposed in [ 171 are either hardware-based or 
make use of simple schemes based on profiling. 
The selective cache considered in this paper is like a 
conventional cache in which all the memory instructions 
have an additional bit that is set up by the compiler. In case 
of a cache miss, this bit controls whether a new block is 
brought from L2 cache and placed in L1 or just the missing 
data is requested from L2 and it bypasses L1 cache. We 
assume a 64-bit data bus between L1 and L2, thus, this is 
the bandwidth spent by any bypassing request regardless of 
the actual size of the required data. 
The dual data cache was proposed in [7]. It is com- 
posed of two modules, called temporal and spatial. The 
former is targeted to exploit just temporal locality. The lat- 
ter is designed to exploit spatial locality, in addition to tem- 
poral locality if a reference exhibits both types of locality. 
In consequence, the temporal module has very short blocks 
(one 64-bit word is assumed in this study) and the spatial 
cache has larger blocks (32 bytes per block is assumed 
here). Figure I shows the basic block diagram of the dual 
data cache. In this case, the compiler sets up a two-bit field 
of each memory instructions that indicates one of three 
possible actions in case of a cache miss: a) bring a new long 
block (32 bytes) and place it in the spatial module; b) bring 
a new short block (8 bytes) and place it in the temporal 
module; and c) bring just the requested data, which requires 
Spatial Temporal I 
Cache 1 1 Cache 1 I 
I I 
I I 
I I 
I I 
I I 
I I data from 
I I L2 cache 
I I 
4 
Figure 1. Block diagram of the dual data cache 
one 64-bit word transaction due to the assumed bus width, 
and do not place it in any module. 
The main difference between the dual data cache 
architectures considered here and that proposed in [7] is 
that in the latter, memory references were tagged at execu- 
tion time using an additional hardware called locality pre- 
diction table. 
An architecture with some similarities with the dual 
data cache has been recently proposed in [ 121. Like the dual 
data cache, it has different modules for different types of 
locality. However, the allocation of data to the modules is 
done initially by a very simple heuristic based on the data 
type and then it may be changed by profiling or dynami- 
cally by means of a hardware that monitors their behavior. 
A software managed data cache is provided in the HP 
PA-7200 [5]. In this machine, every memory instruction 
includes a hint called spatial locality only that indicates 
that the data referenced by that instruction exhibits spatial 
locality but not temporal locality. The first level of the 
memory hierarchy of the PA-7200 consists of two modules: 
the assist cache and the off-chip cache. The former stores 
all the data referenced by any instruction while the latter 
stores the data replaced in the assist cache if the spatial 
locality hint is not set. In consequence, the assist cache is 
targeted to any type of reference while the off-chip cache is 
targeted to store all the data except that with just spatial 
locality. 
Static locality analysis has been previously used for 
different purposes. The most remarkable proposals are tar- 
geted to: a) improve the locality of loop nests by transfor- 
mations like interchange, reversal, skewing and tiling [ 181; 
and b) software prefetching schemes [13], [14], [3]. How- 
ever, to our knowledge, this is the first study that addresses 
the use of a static locality analysis oriented to the manage- 
262 
ment of selective and dual data caches. In addition, our pro- 
posal extends previous schemes by incorporating an 
interference analysis. This analysis is crucial for the accu- 
racy of the locality analysis for those cases in which con- 
flict misses are the main source of cache misses, which 
usually is the case for small cache memories. For instance, 
more than half of the cache misses of the SpecFP95 bench- 
marks for a 8 Kbyte 2-way associative cache are conflict 
misses [8]. An interference analysis was proposed in [16] 
in order to estimate the number of conflict misses at com- 
pile time. That analysis is different and much more com- 
plex than the one proposed in this paper since it had a 
different objective. They tried to quantify at compile time 
the number of conflict misses whereas our objective is just 
to identify those cases in which cache interference will 
inhibit completely the exploitation of locality. 
Regarding the quantitative evaluation of the locality 
exhibited by loop nests of numerical codes, this has been 
the focus of a recent paper [ 111. This paper improves that 
work mainly in the following two ways. First, for each 
memory reference we consider all types of reuse that it 
exhibits (see section 3.1 for a definition of reuse) whereas 
that work only considered the last one in the order given by 
the source code. For each type of reuse, our analysis quan- 
tifies the amount of cache volume required to exploit it. 
Second, we consider any loop nest whereas that analysis 
Reference 
A ( J )  
same data blocks. In these cases, it is said that the instruc- 
tions have group-temporal reuse and group-spatial reuse 
respectively. For group reuse, it is distinguished which ref- 
erence access first the common datalblock. This is called 
the leading reference whereas the other one, which is the 
one that can benefit from reuse, is called the trailing refer- 
ence. Obviously, a reference may have several types of 
reuse. 
Reuse is a measure that is inherent in a given program 
and does not depend on the order in which instructions are 
later executed. For instance, it is the same for both an in- 
order execution and an out-of-order execution processors. 
Besides, it is almost independent of the particular cache 
organization. In particular, the temporal reuse is com- 
pletely independent whereas the spatial reuse just depends 
on the cache block size. 
The reuse of each memory instruction is computed fol- 
lowing the methodology described in [18]. The results are 
represented as a vector space that identifies the loops in 
which reuse is found (each dimension corresponds to a 
loop). We distinguish between two types of temporal and 
spatial reuse: 
1) Unitary: the vector has only one element different 
from zero, that is, vector (0 ,..., O,ni,O ,..., 0)  indicates 
that this reference has reuse after ni iterations of loop 
1 .  
~ 
Reuse in J Reuse in I 
self-spatial N.A. 
2) Combined the vector has more than one elements dif- 
ferent from zero, that is, vector (0 ,..., O,ni,ni+l ,..., nw) 
only considered a subset of them with some particular fea- 
tures: at most 3 deep and with only one loop at each level. 
no reuse 
no reuse 
3. Static locality analysis 
group-temporal (trail- 
ing) with C (I+1, J) 
and self-spatial 
self-spatial 
The static locality analysis consists of three steps that 
are described below: reuse analysis, interference analysis 
and volume analysis. 
We restrict the locality analysis to references inside 
loops, which represent the majority of references. The 
locality analysis estimates the type of locality for both sca- 
lar and vector references. For the latter, the locality analysis 
is performed just for array references where the array indi- 
ces are affine (i.e., linear) functions of surrounding loop 
indices. In the analyzed benchmarks, the references that 
were handled by the analysis represent about 90% of the 
total. For the remaining references, it is assumed that they 
exhibit spatial and temporal locality. 
E (1, J) 
3.1. Reuse analysis 
The locality analysis starts by computing the reuse 
properties of each loadktore instruction as proposed in 
[18]. An instruction has self-temporal reuse if the same 
data is referenced by at least two different iterations of the 
loop. It exhibits self-spatial reuse if the same data block is 
referenced by at least two different iterations. Likewise, 
different instructions may access the same locations or the 
no reuse self-temporal 
indicates that this reference has reuse after ni itera- 
tions of loop i, ni+] iterations of loop i+l and so on. 
The result of this phase is a list of the different reuses 
exhibited for each reference indicating the loop(s) for 
which each reuse holds. For instance, the reuse analysis of 
the sample code of Figure 2 will produce the following 
result: 
I B ( I ,  J) I no reuse I self-spatial I 
C (I+l, J) 
no reuse I self-spatial 
3.2. Interference analysis 
Whereas the reuse analysis is almost independent of 
any particular cache organization, the interference analysis 
is not. For the interference analysis, we assume in this 
263 
Figure 2. Sample code. 
paper a direct-mapped organization for the selective and 
the spatial module of the dual data cache. The extension for 
other organizations such as set-associative caches is possi- 
ble, but it is beyond the scope of this paper. 
This analysis tries to identify groups of memory 
instructions that will repeatedly produce conflict misses 
due to interferences among them. There are two types of 
interferences: self-interferences and group-interferences. 
The former are those caused by different executions of the 
same instruction. The latter are produced by different 
instructions that reference either the same of different 
memory structures. 
Interferences prevent the exploitation of the reuse 
exhibited by the interfering instructions. 
Traditionally the effect of interferences have been 
taken into account by setting the “effective” cache size to 
be a fraction of the actual cache size [13]. Th~s simple 
scheme does not consider the reference characteristics at all 
and may result in most cases in either an overestimation or 
an underestimation of the effect of interferences. Besides, 
interferences are not uniformly distributed over all memory 
references and therefore, their contribution should be mea- 
sured for each reference independently. The effect of mem- 
ory conflicts may be very high for some programs as it has 
been previously reported. Therefore, a more accurate esti- 
mation is crucial for the performance of the locality analy- 
sis. 
3.2.1. Self-interferences. Self-interferences occur when 
different data blocks referenced by the same instruction are 
mapped onto the same cache location. 
An affine array reference will generate sequences of 
memory references at addresses separated by a constant 
stride. The self-interference analysis only considers those 
instructions with a stride larger than or equal to the block 
size. Otherwise, the instruction exhibits self-spatial reuse 
that can be exploited before any potential self-interference 
happens. 
If the stride is multiple of the block size, self-interfer- 
ences will occur in a direct-mapped cache if the number of 
blocks of the cache is lower than the length of the sequence 
. The stride family defined by x is multiplied by 2 stride-famly 
the set of strides o*ZXwith (T odd [9]. All the strides belong- 
ing to the same family (e.g., 12=3.2 and 20=5*2’ belong 
to family 2) have the same behavior from the point of view 
of self-interference. Therefore, to approximate the effect of 
self-interferences, the volume of cache (see section 3.3) 
If the stride is not multiple of the block size, the stride 
is rounded up to the next multiple of the block size and the 
above scheme is applied. 
consumed by a reference is multiplied by 2 stride-family 
3.2.2. Cross-interferences. The cross-interference analy- 
sis focus on identifying those groups of references that ref- 
erence different data blocks that map onto the same cache 
location for every iteration of the loop. These interferences 
will inhibit completely the exploitation of any reuse exhib- 
ited by the interfering instructions. This analysis is only 
applied for references whose variables are allocated at 
compile time, that is, those variables whose base address 
and size of every dimension is statically known. We have 
measured that more than 75% of dynamic references for 
the benchmarks considered in this paper meet these condi- 
tions. Extending the analysis to deal with other types of 
interferences is an future extension of this work. 
The interference analysis is applied between the reuse 
and the volume phases, because its result can modify the 
volume of data fetched by each loop. The analysis consists 
of the following steps: 
1) For each affine array reference, compute an expres- 
sion that determines the effective memory address as 
a function of the initial address and the loop indices: 
EffAddress = IniAddress + 
N 
where Zi is the index variable of loop i in a nest of 
depth N .  
Build an interference graph for each basic block. A 
conflict between two references RI and R2 is assumed 
if they are mapped onto cache at a distance lower than 
the block size: 
lRlmod CacheSize - Rzmod CacheSizel < BZockSize 
Potential conflicts are analyzed for each pair of refer- 
ences and they are identified in the interference graph 
by means of an edge. 
Remove interferences. If two instructions with some 
type of reuse interfere, their respective reuse cannot 
be exploited since the block brought in cache by any 
of them will be evicted immediately by the other, 
before it is reused. The objective of this step is to tag 
some of the interfering instructions as non-cacheable 
so that the remaining instructions do not interfere and 
therefore their reuse can be exploited. 
264 
Figure 3. Interference analysis for code of Figure 2 
The algorithm works as follows: in the interference 
graph, the node with the maximum number of edges 
is chosen. This reference is labeled as non-cacheable, 
and its edges are removed. Then, the process is 
repeated until the graph has no edges. 
If we apply this analysis to the example of Figure 2, the 
results are shown in Figure 3. We have supposed that the 
initial interference graph is the one on the left. The selected 
reference is D ( I ,  J) . Thus, this reference will be tagged as 
non-cacheable and it will not be cached despite of having 
reuse. However, the reuse exhibited by c ( I ,  J )  and 
c (1+1, J )  can be exploited. 
3.3. Volume analysis 
Another factor that may inhibit the exploitation of 
reuse is the limited storage of cache memory. In other 
words, if the amount of different data blocks that are refer- 
enced between two consecutive reuses of the same block is 
higher than the cache capacity, this reuse cannot be 
exploited'. The resulting cache miss is usually called a 
capacity miss. 
This requires to determine the amount of data that is 
used by each reference in each loop. This amount of data 
depends on: 
(a) o p e  of reuse: calculated in the previous step. 
(b) Loop bounds: if they are unknown at compile-time, 
they are estimated using the approximation proposed 
by D. Bernstein et al. [3]: each memory reference R is 
represented in the following way: 
M j - 1  N 
R = c l +  c c j .  n D k  = y o +  c r i . I E  
j = 2  k = l  i =  1 
where M is the number of dimensions of the variable, 
Dk represents the size of dimension k,  and cj repre- 
sents the subexpression in dimension j .  
Then, the last sum is sorted by order of decreasing 
magnitude of coefficients r;. The estimation is based 
on the assumption that a well behaved vector refer- 
ence will access different locations for different val- 
1 .This is true for LRU replacement. For other replacement policies, 
this is just an approximation. 
Unknown: 
None: stride-family 
V ( R , i )  = V ( R , i + l )  S i 2  
Unitary temporal: 
V(R,r )  = V ( R , i + l )  
V (R ,  1) = V(R,  i+  1) Bi 
Combined temporal or spatial: 
Group trailing: 
V ( R ,  0 = VL(R, I) TR, ,where TR, i =  
Bi 
Figure 4. Contributed volume of a reference to a 
loop. rL and rT are the coefficients of the 
leading and trailing references respec- 
tively and Bi is the upper bound of loop i. 
ues of the loop indices appearing in the expression. 
The estimated loop bounds are computed as follows: 
B; = ri-l / ri, if i>l 
B j  = ArraySizeh,, if i=l 
(a default value is used if the array size is 
unknown at compile-time). 
We use a simplification if the reference expression has 
only one loop index. In this case the estimation is 
based on the assumption that a vector subindex do not 
exceed the corresponding dimension (ci<Di). 
The analysis follows these steps: 
Calculate for each memory reference the number of 
cache blocks that are accessed in one iteration of each 
loop. Figure 4 shows how the contributed volume of 
a reference R to a loop i, which is denoted by V(R,i), 
is computed according to the type of reuse (note that 
V(R,N) = 1): 
Unknown: we suppose the worst case, that is, every 
access uses a new cache block. 
None: every access uses a new cache block too, but 
the volume is modified by the stride family in order 
to take into account the effect of self-interferences. 
In the case of the selective and the dual data 
caches, this kind of accesses are bypassed and, 
therefore, do not affect the computed volume. 
Unitary temporal: the loop i accesses to the same 
data set as loop i+l and thus, the volume is not 
increased. 
265 
Unitary spatial: the volume is increased according 
to the number of elements that fit in a cache block. 
This number is given by the stride of the reference 
sequence divided by the cache block size. 
Combined temporal or spatial: the volume is com- 
puted multiplying the previous volume by a factor 
that represents the iterations where there is no 
reuse and thus, a new cache block is needed. Each 
loop with reuse contributes to the expression by 
means of the percentage of the total iterations that 
exploit reuse. 
Group trailing: the volume is computed as the vol- 
ume of its leading reference multiplied by a factor 
rR, that represents the percentage of iterations 
needed to exploit the group reuse with respect to 
the total number of iterations of the loop. Since the 
trailing and leading references only differ on the 
independent term ro, this factor is computed by the 
difference between independent terms divided by 
the coefficient that affect the current loop index ( r j )  
and by the loop count (BJ .  
A reuse in a loop b can be exploited if 2 V ( R ,  b )  is 
not higher than the cache size. Otherwise, each 
attempt to reuse a data element will result in a capac- 
ity miss. 
If we apply this analysis to our example of Figure 2 the 
results are the following, assuming that the block size is 4 
data elements and the cache has 256 blocks: 
V R  
Reference 
B ( I ,  J )  
Contributed volume Contributed volume 
to loop I to loop J 
1 250 
I I 
C ( I , J )  I 1 I 250 1 
C(I+l, J )  
D ( I , J )  
E ( 1 ,  J) 
Total 
A ( J )  
Total 
L I I I 
0 0 
1 250 
1 10 
4 760 
1 
4 76 1 
Consequently, only reuse across loop I can be 
exploited. Therefore, the reference A ( J) will result in 
repetitive cache misses even though it has spatial reuse. 
After the locality analysis is done, each memory 
instruction is tagged accordingly: references with no reuse 
are tagged as bypass, and the rest as cacheable in the selec- 
tive cache and as temporal or spatial in the dual data cache. 
If the reference only can exploit temporal reuse, it is tagged 
as temporal and it is tagged as spatial otherwise. An addi- 
tional constraint in the dual data cache is that the references 
that exhibit group locality have to be allocated to the same 
module. 
4. Experimental framework 
The cache experiments presented in this paper have 
been performed using the following SpecFP95 bench- 
marks: tomcatv, swim, su2cor, hydro2d, mgrid, applu and 
turb3d. All of them are written in Fortran language. 
The locality analysis has been implemented using the 
ICTINEO toolset [2]. ICTINEO is a source to source trans- 
lator that produces a code in which each sentence has a 
semantics similar to that of current machine instructions. 
Currently, ICTINEO assumes an infinite number of regis- 
ters and thus, the references produced by spill code are not 
considered in this work. Optimizations usually applied by 
current compilers are implemented in ICTINEO and are 
applied to the resulting code. In this way, the resulting code 
is very similar to the code generated by a production com- 
piler. 
Memory references are instrumented according to the 
locality analysis results, and the trace obtained from the 
execution of instrumented code feeds a cache simulator of 
a selective and a dual data cache. A conventional cache is 
also simulated for comparison. The results presented in this 
paper correspond to the execution of the first 100 million of 
memory instructions of each benchmark. 
5. Statistics of loop nest locality 
The locality analysis previously presented can be used 
to obtain quantitative measures of the locality exhibited by 
loop nests. 
We are interested in all types of reuse exhibited by 
each single memory reference. Consider for instance the 
following code: 
DO J=1,M 
DO I=1,N 
. . . A ( I )  . . . 
. . .A(I+l). . . 
. . .A(I) . . . 
END DO 
END DO 
Our analysis will conclude that for loop I, the first and 
third references exhibit group-temporal reuse. Group-tem- 
poral reuse is also exhibited by the second and first refer- 
ences (in this case the reuse is after one iteration). Besides, 
each reference exhibits self-spatial reuse. Now, considering 
loop J, we have that the three references exhibit self tempo- 
ral reuse and any pair of references exhibits group-tempo- 
ral reuse. Assuming that the interference analysis does not 
detect any interference and that the size of vector A is not 
higher than the cache capacity, the analysis will conclude 
266 
8 16 32 M 
Line Size (bytes) 
TOMCATV 
8 16 31 64 
Line Size (bytes) 
SWIM 
n 16 32 M 
Line Size (bytes) 
SUZCOR 
1 L 
8 ems 
n 16 32 M 
Line Size (bytes) 
HYDROZD 
8 16 32 M 
Line Size (bytes) 
MGIUD 
S 16 32 61 
Line Size (bytes) 
APPLU 
n 16 31 M 
Line Size (bytes) 
TURBJD 
8 16 31 M 
Line Size (bytes) 
AVERAGE 
Figure 5. Reuse statistics. The different types of reuse are denoted by: UR (unknown reuse), NR (no reuse), 
ST (self-temporal), GT (group-temporal), SS (self-spatial) and GS (group-spatial). 
that all the reuse can be exploited (the reuse across loop J 
requires a larger volume that the reuse across loop I, but it 
is still into the limits of the cache size). 
Considering only the last type of reuse in program 
order as proposed in [ 1 I], the analysis would detect only a 
subset of the different reuses’. In particular, it would 
observe just group spatial reuse for every memory refer- 
ence. This could suggest that for the above code it is not 
worthwhile to exploit temporal locality, whereas this is not 
the case. 
Figure 5 shows the reuse statistics for the loop nests of 
the considered benchmarks. Each bar corresponds to the 
dynamic frequency of a different type of reuse. Since spa- 
tial reuse depends on the cache block size, different bars are 
drawn for a block size ranging from 8 to 64 bytes. The fig- 
ure shows the reuse characteristics of each benchmark and 
the average among them. Notice that the sum of the fre- 
quencies of the different types of reuse may be greater than 
1 since a reference may exhibit more than one type of 
reuse. 
Several conclusions can be drawn from Figure 5. First, 
we can see that in average, self-temporal and self-spatial 
reuse are the most frequent and none of them is dominant. 
Group temporal reuse is also quite common whereas group 
spatial reuse is relatively infrequent. As expected, this 
results differ from those presented in [ 111, where it was 
reported for instance that self-temporal reuse was the least 
frequent type of reuse2. The dominant type of reuse varies 
significantly for the different benchmarks. Self-temporal is 
dominant for tomcatv, applu and turb3d. Self-temporal and 
group-temporal are the most frequent for mgrid. Self-spa- 
tial is dominant for swim, su2cor and hydro2d. Group spa- 
tial is always the least common type of reuse. Notice also 
that in average, the locality analysis can determine the 
reuse exhibited by about 90% of the executed instructions. 
Finally, it can be observed that almost all the references 
exhibit some type of reuse. 
Figure 6 shows the percentage of executed instructions 
that exhibit just one type of reuse, either spatial or tempo- 
ral. From now on, the figures present just average statistics 
over the different benchmarks. From this graph it can be 
concluded that temporal reuse is the most common type of 
single reuse, which may suggest the inclusion of a module 
1 .In [ 11 1. what we call reuse is referred as to locality. However, to be 
consistent with the rest of this paper, we have changed their termi- 
nology according to our definition. 
2.Another reason for the discrepancy is the different bench- 
mark suite. 
267 
loo 1 
Y 
m 
loo 1 
60 
a 3 40 
U 
B g 20 
0 
a a a  m a d  m a a  a a a  z r m  z c m  zcrn 2 F . m  
8 16 32 64 
Line Size (bytes) 
AVERAGE 
Figure 6. Percentage of instructions with just one 
type of reuse: no reuse (NR), temporal 
(TR) or spatial (SR). 
c - "1 L - J  
+ tomcatv 
--I- swim 
--e su2cor 
+ hydro2d 
--I-- mgrid
- - - x - -  applu 
- -+ - -  turb3d 
" 
64 128 256 512 1K ZK 4K 8K 16K 32K 64K 128K256K 
Cache Size (bytes) 
Figure 7. Exploiting temporal reuse only 
specialized to exploit temporal locality as it is the case of 
the dual data cache. 
Figure 7 shows the percentage of temporal reuse that 
can be exploited with a fully-associative cache that is used 
only for references that exhibit just temporal reuse (here a 
fully-associative cache is modelled by just not considering 
cache interferences). In this case, the line size is 8 bytes 
(one double precision float) since a larger line does not 
make sense because spatial reuse is not present. From this 
figure we can conclude that a 16 line (128 byte) temporal 
module is enough to exploit most of the single temporal 
reuse. As we have seen in Figure 6, these references repre- 
sent about 35% of the total. Thus, this will be the size of the 
temporal module of the dual data cache for the experiments 
of the next section. 
a 
f- line size = 8 bytes 
---t 16 bytes 
+ 32 bytes 
--t 64 bytes 
w 20 
v 
64 I28 256 512 1K 2K 4K RK 16K 32K 64K12RK256K 
Cache Size (bytes) 
AVERAGE 
Figure 8. Percentage of reuse exploited with a vary- 
ing cache size without interferences. 
As pointed out above, a given instruction can have 
several types of reuse. Given a particular cache organiza- 
tion, we define the percentage of reuse that is exploited as 
the number of executed instructions that can exploit at least 
one type of reuse divided by the number of executed 
instructions that have at least one type of reuse. 
Figure 8 shows the percentage of reuse that can be 
exploited for a varying cache size with a line size ranging 
from 8 to 64 bytes and neglecting the effect of cross-inter- 
ferences. It can be seen that a cache size of about 1 Kbyte 
with lines greater than 8 bytes can capture some reuse for 
practically all the instructions of the analyzed programs 
with some reuse 
Since almost all the references exhibit some type of 
reuse (as it has been shown in Figure 5) and this reuse can 
be exploited with a relatively small volume, a locality anal- 
ysis that did not include a interference analysis would 
incorrectly conclude that it is worthwhile to cache almost 
all memory references. The percentage of reuse that would 
be exploited by this approach would be significantly lower 
than expected due to interferences. This is shown in Figure 
9 for a varying cache capacity and line size. For instance, 
comparing the graphs of Figure 8 and Figure 9 for a 8 
Kbytes capacity and 32-byte line size, it can be observed 
that without interferences nearly 100% of the reuse can be 
exploited but only 80% of it is actually exploited when con- 
sidering the effect of interferences. For some programs 
with a high conflict miss ratio, the effect of interferences is 
even much more noticeable. This is the case for instance of 
tomcatv. Figure 10 compares the percentage of reuse that 
can be exploited with a varying cache size and a line size of 
32 bytes. Whereas 1 Kbyte is enough to exploit all the reuse 
if there were not interferences, when considering interfer- 
ences the reuse exploited with a 8 Kbyte cache is just 28%. 
268 
100 - 
- 80- 
?5 
60- 
d 
B .= 40-  
2 w 20- 
- + line size = 8 bytes 
16 bytes 
-t- 32 bytes 
loo - 
- 80- 
5 
60- 
d 
.- 40-  - 
2 w 20- 
i + 64 bytes 
Benchmark 
tomcatv 
64 I28 256 512 IK 2 8  4K 8K 16K 3ZK 64Kl28K256K 
Cache Size (bytes) 
AVERAGE 
%Bypass %C-Hit %B-Miss 
42.94 71.18 84.37 
Figure 9. Percentage of reuse exploited with a vary- 
ing cache size considering interferences. 
applu 
turb3d 
m--.--. 
r' 
1.92 94.51 9.67 
5.68 96.73 38.71 
i 
a- .... .-...d 
f- Without interferences 
. . I- - With interferences 
64 128 256 512 IK 2K 4K 8K 16K 32K 64Kl28K256K 
Cache Size (bytes) 
TOMCATV 
Figure 10. Percentage of reuse exploited with a vary- 
ing cache size withlwithout interferences 
for tomcatv. 
Selective caching can play an important role to reduce 
the negative effect of interferences. Applying the interfer- 
ence analysis presented in section 3.2, reuse can be 
exploited more effectively as it is shown in Figure 11. For 
instance, a 4 Kbytes selective cache can exploit more reuse 
than a 8 Kbyte conventional cache. The differences 
observed in Figure 11 are much higher for programs with 
many interferences (tomcam and swim). 
6. Applying the locality analysis to the selec- 
tive and dual data caches 
In this section, we present two types of results: first the 
accuracy of the locality analysis is evaluated, and then the 
---t Conventional 
- - I -. Selective 
" 
64 128 256 S I 2  IK ZK 4K RK 16K 32K 64K128K256K 
Cache Size (bytes) 
AVERAGE 
Figure 11. Percentage of reuse exploited with a 
selective cache, varying the cache size 
and compared with a conventional 
cache. 
performance of the selective and dual data caches are com- 
pared against that of a conventional cache. 
For the latter, it is assumed that the cache memory is 
connected to the next level of the memory hierarchy by 
means of a 8 byte bus. The latency of the next memory 
level is assumed to be 5 cycles plus an extra cycle per 64- 
bit word. The conventional and selective caches are direct- 
mapped, write-allocate and copy-back. Cache size is 8 
Kbytes and block size is 32 bytes. The spatial module of the 
dual data cache is like a conventional cache. The temporal 
module is a very small (16 words) fully-associative buffer. 
This size has been proved to be sufficient to store practi- 
cally all memory references that exhibit only temporal 
locality (see section 5). 
Table 1 shows the results of the locality analysis 
applied to the selective cache. 
I swim I Si: 1 89.09 I 82.06 I 
su2cor 93.11 0.83 
I hvdro2d I 0.05 I 84.44 I 69.15 I 
I mgrid 1 0.04 I 97.19 I 38.62 I 
Table 1. Locality results for the selective cache 
The first column indicates the percentage of memory 
references that are bypassed. The second column lists the 
hit ratio for the references that are cached. The last column 
shows the miss ratio of bypass references on a conventional 
269 
cache, which caches all references. The two last columns 
provide an evaluation of the locality analysis. An accurate 
locality analysis should result in a high hit ratio for cached 
data and in a high miss ratio for non-cached data. One can 
see in Table 1 that the hit ratio of cached references is near 
or above 90% for most programs. On the other hand, the 
miss ratio of bypassed references on a conventional cache 
is high excepting some cases in which the percentage of 
bypass references is very low and therefore the results are 
not significant (suZcor, mgrid, applu and turb36). 
Table 2 shows similar results for the locality analysis 
applied to the dual data cache. The second and third col- 
umns show the dynamic percentage of references labeled 
respectively as temporal or spatial by the locality analy- 
sis.The columns labeled as T-Hit and S-Hit list the hit ratio 
in the temporal and spatial modules respectively. The rela- 
tively high hit ratio of cached references prove again the 
accuracy o f  the locality analysis. 
applu 
turb3d 
4.53 41.00 54.47 94.25 92.23 
5.38 79.29 15.33 99.90 80.38 
Table 2. Locality results for the dual data cache 
Figure 12 shows a comparison among conventional, 
selective and dual data caches in terms of hit ratio, average 
memory access time and average number of words fetched 
from the next memory level per memory reference. These 
figures are divided in programs with low locality (tomcatv 
and swim) and high locality (the others). 
Figure 12 shows that the selective cache and the dual 
data cache provide a significant improvement in the first 
group of benchmarks. Compared with a conventional 
cache, they reduce the average memory access time in 
about 25% and the amount of data fetches in about 65%. 
Notice that this latter benefit may be very effective to 
reduce memory bandwidth, which is expected to be an 
important limitation for future microprocessors [4]. In the 
second group of benchmarks, where the memory behavior 
on a conventional cache was already good (see Figure 12b), 
the new cache architectures slightly improve the perfor- 
mance except for one benchmark (applu) which experi- 
ences a small increase in average memory access time. 
The dual data cache provides very little improvement 
compared with the selective cache. This lack of significant 
enhancement may be due to the small number of cache 
4 4  '1
da 
2 i;l 0 
1 
" .  A 
w Selective cache 
I Dual data cache 
0 6  
0 4  
0 2  
0 0  
Figure 12. Comparison among conventional, selec- 
tive and dual data caches 
entries required to exploit temporal locality. Because of 
that, they cause few interferences in the selective cache. 
The extra bit (or two bits for the dual data cache) used 
to manage the cache do not come free. The most obvious 
implementation would reduce the range of the constant dis- 
placement in memory instructions by a factor of two or 
four. If the displacement field has 16 bits (which is typical 
for current architectures) and can bz used to address 64KB 
of data, in the modified instruction set we have that value 
reduced to 32KB or 16KB. This may incur in extra instruc- 
270 
tions if the addressed data is larger. We have measured for 
the benchmarks considered in this paper that only 2.41 % of 
dynamic memory instructions executed need extra instruc- 
tions (compared with memory instructions that has a dis- 
placement of 16 bits) using 14 bits of displacement, and 
1.32% using I5 bits, which confirms that the penalty intro- 
duced by these additional instructions is negligible. 
7. Conclusions 
A static locality analysis oriented to the management 
of cache memories have been presented. The analysis pre- 
sented extends previous proposals with an interference 
analysis step, which have been shown to be crucial for the 
accuracy of the analysis in some programs where conflict 
misses are dominant. 
The analysis has been used to characterize the locality 
exhibited by loop nests of numerical codes. It has been 
shown that self-temporal and self-spatial reuse are domi- 
nant, closely followed by group-temporal reuse. It has been 
measured that about 35% of the references exhibit only 
temporal reuse and that this reuse can be exploited with a 
very small fully-associative buffer, which motivates the use 
of the dual data cache. 
It has also been shown that interferences cause a sig- 
nificant degradation of cache memory. Interferences cause 
a large increase in the volume required to exploit a given 
percentage of reuse. This negative effect can be signifi- 
cantly reduced by a selective caching strategy. 
Applying the locality analysis to the management of a 
selective cache and a dual data cache, it has been observed 
that these cache architectures provide a significant reduc- 
tion in average memory access time and amount of data 
fetched from the next memory level, especially for pro- 
grams with a poor locality, when compared with a conven- 
tional cache. 
Acknowledgments 
This work has been supported by the Spanish Ministry 
of Education under contract CICYT TIC 429/95 and by the 
Catalan CIRIT under grant FI-DT/96-3.083. 
References 
[l] S.G. Abraham, R.S. Sugumar, B.R. Rau and R. Gupta 
“Predictability of Loadstore Instruction Latencies” 
in Proc. ofMICR0-26, pp. 139-152, 1993 
[2] E. AyguadC, C. Barrado, A. GonzBlez, J. Labarta, D. Lbpez, 
S. Moreno, D. Padua, F. Reig, Q. Riera and M. Valero 
“Ictineo: a Tool for Research on ILP’ 
in Supercomputing ’96, Reseach Exhibit “Polaris at Work” 
[lo] A.S. Huang and J.P. Shen 
“A Limit Study of Local Memory Requirements Using 
in Proc. of MICRO-28, pp. 71-81, Dec. 1995 
“A Quantitative Analysis of Loop Nest Locality” 
in Proc ofASPLOS-VU, pp. 94-104, Oct. 1996 
[I21 V. Milutinovic, B. Markovic, M. Tomasevic and M. Trem- 
blay 
“The Split TempordSpatial Cache: Initial Performance 
in Proc. of SClzzL-5, March 1996 
[13] T.C. Mowry, M.S. Lam and A. Gupta 
“Design and Evaluation of a Compiler Algorithm for 
in Proc. of ASPLOS-y pp. 62-73, Oct. 1992 
“Tolerating Latency Through Software-Controlled Data 
PhD Thesis, Stanford University, CSL-TR-94-628, 1994 
[15] J.M. Stone and R.P. Fitzgerald 
“Storage in the PowerPC” 
ZEEEMicro, vol. 15, no. 2, pp. 50-58, April 1995 
“Cache Interference Phenomena” 
in Proc. of SIGMETRICS 94, pp. 261-271, 
[ 171 G. Tyson, M. Farrens, J. Matthews and A. Pleszkun 
“A Modified Approach to Data Cache Management” 
in Proc. ofMICR0-28, pp. 93-103, Dec. 1995 
“A Data Locality Optimizing Algorithm” 
in Proc. of PLDI 91, pp. 30-44, 1991 
Value Reuse Profiles” 
[ 111 K. McKinley and 0. Temam 
Analysis” 
Prefetching” 
[14] T.C. Mowry 
Prefetching” 
[ 161 0. Temam, C. Fricker and W. Jalby 
[18] M.E. Wolf and M.S. Lam 
D. Bernstein, D. Cohen, A. Freund and D.E. Maydan 
“Compiler Techniques for Data Prefetching on the Pow- 
in Proc. of PACT95, pp. 19-26, 1995 
D. Burguer, J.R. Goodman and A. Kagi 
“Memory Bandwidth Limitations of Future Microproces- 
in Proc. of ISCA 96, pp. 78-89, May 1996 
K.K. Chan, C.C. Hay, J.R. Keller, G.P. Kurpanek, EX. Schu- 
macher and J. Sheng 
“Design of the HP PA 7200 CPU’ 
Hewlett-Packard Journal, Feb. 1996 
C-H. Chi and H. Dietz 
“Unified Management of Registers and Cache Using Live- 
in Proc. of PLDZ 89, pp. 344-355, June 1989 
A. GonzBlez, C. Aliagas and M. Valero 
“A Data Cache with Multiple Caching Strategies Tuned to 
in Proc. of ICs 95, pp. 338-347, 1995 
A. GonzBlez, M. Valero, N. Topham and J.M. Parcerisa 
“Eliminating Cache Conflict Misses Through XOR-Based 
Placement Functions” 
in Proc. of ICs 97, 1997 
D.T. Harper I11 and D.A. Linebarger, 
“A Dynamic Storage Scheme for Conflict Free Vector 
in Proc of the 14th. ISCA, pp. 72-77, 1987. 
erPC” 
sors” 
ness and Cache Bypass” 
Different Types of Locality” 
Access” 
27 1 
