Implications of memory mappings on cache misses by Genius, Daniela & Eisenbiegler, Jörn
Implications of Memory Mappings on Cache Misses
Daniela Genius Jorn Eisenbiegler
Institut fur Programmstrukturen und Datenorganisation
Zirkel 
Fakultat fur Informatik
Universitat Karlsruhe  KarlsruheGermany
emailfgeniuseisengipdinfounikarlsruhede
June 	 		
Abstract
This paper proposes an optimization by an alternative approach to memory map
ping Low set associativity allows representing cache lines by corresponding memory
areas With the help of the notion of temporal reuse in the innermost loop the be
haviour of values in the cache is modelled Combining these values into cache lines
so that spatial reuse is considered demands an alternative memory mapping Memory
mappings with a low expectation of conicts are achieved by the random placement of
arrays in memory Signicant increase of cache misses for a worst case placement is
shown by experiments as well as cache miss reduction achieved by improving reuse
  Introduction
Scientic applications are particularly sensitive wrt cache performance as large amounts
of data are regularly accessed in nested loops On the other hand a lot can be gained
if a compiler makes use of this regularity A source of ineciency often neglected is the
competition for a single physical cache line This might lead to cache thrashing when in
turn data is evicted that will shortly be reloaded into the same cache line First level
caches almost always have low associativity Typically two Intel Pentium or four sets or
more seldom direct mappings DEC Alpha are implemented Fully associative caches
are expensive and thus seldom used as vendors prefer economical solutions When high
performance is the issue the problem to make good use of the cache mostly remains with
the programmer This typically takes a lot of hand	optimization and should better be
accomplished by the compiler
Compiler	controlled optimization techniques exploit compile	time information such as
loop boundaries and variable life ranges Classical compiler optimizations 
WL 
MCT
work on the model of a fully associative cache They focus on locality improvement ie
they try to prevent data from exceeding cache size Up to now the problem of improving
behaviour for caches with less exible mappings has very seldom been tackled Caches
with low set associativity including direct mapping as an extreme case have a property
that can be used for optimization They allow in contrast to fully associative caches the
drawing of conclusions from the memory layout back to cache behaviour as the cache line
resp set a memory location maps to is known A goal that has also often been neglected
is to aggressively achieve so	called spatial locality Cache lines are often rather large so
that for regular accesses to small data items as is typical for scientic codes a constant
improvement is achieved
The present paper proposes a profound change of the memory mapping A reuse	based
strategy is proposed that enables to capture the behaviour of cache lines more exactly than

so far Reuse of data is considered by two stages rstly representing life times of values
in the cache captures temporal reuse Secondly combining such values into cache lines
for multiple accesses aggressively improves spatial reuse The knowledge gained about the
location of a virtual cache line in memory allows to detect potentiality for conicts The
code remains unchanged except index transformations By experiment the impact of the
optimizations is validated for typical scientic computing benchmarks Negative eects from
cache thrashing reach from ca  to 
Basic terminology can be found in section  Section  contains the assumptions made
The optimization algorithm is presented in section  Results of preliminary measurements
are shown in section  In section  recent related work is summed up and distinguished
from the approach taken here In the nal section an overview is given on directions of
further work
 Basic Notions
Figure  shows three levels of a possibly larger memory hierarchy Low associativity is
assumed ie memory areas bearing the same colour map to one cache line resp set of
cache lines bearing the corresponding colour thin arrows
 
 The fat arrows represent the
actions by the compiler The notion of cache line usually denoting a physical cache line
R
E
G
I
S
T
E
R
S
CPU
Compiler
  
  
y
r
o
n
i
a
M
e
M
m
mapping
modulo k
0
t
Virtual Cache Line
0
k-1
l-1
l: Cache Line Size
t: Data Type Size 
Memory Mapping
k Cache
 Lines/Sets
Source
Code
Code
Machine
Figure  Setting Terminolgy for Low	associativity Caches
needs to be dened more precisely A cache consists of k physical cache lines of a length of
l bytes each A variable of t bytes length can be considered a cache value in memory In
the context of this paper variables are array elements The cache value life time denotes
the time during which this value is present in the cache Virtual cache lines are composed
of several cache values In caches with low associativity they can be represented via one
of
m
k
memory locations where m is the memory size They are mapped to physical cache
lines Figure  sums up these notions For example the DEC Alpha rst level data cache
has a size of  bytes partitioned into k   lines of  byte length each In scientic
computing common types are oat and long ie t is  resp  bytes on the Alphas C
compiler
Whenever a value requested in a calculation is present neither in registers nor in the
cache a cache miss occurs Compulsory misses occur when lling up an empty cache
Capacity misses happen when the data does not t fully into the cache Conict misses are
due to the competition of memory locations for the same cache line such misses are specic
for direct mapped caches We generalize this notion to conicts between sets In a direct
mapped cache the line that is replaced on a miss is determined by a modulo calculation
out of the memory adress Full associativity means free choice of cache location allowing a
 
In the following the method is presented for direct mappings modications for higher set associativity
are mentioned where necessary otherwise just replace line with set

for i iSIZE i	 fori
 iSIZE i	
for j jSIZE j	 xk  ukrzkryk
for k kSIZE k	 tukrukruk
cijaikbkj tukrukruk

 


Figure  Matrix Multiply and Livermore Kernel 
choice among candidates for replacement eg LRU FIFO see 
HP In a set associative
cache this replacement policy is applied to smaller units so	called sets As the design of
fully associative caches is complex and expensive mostly a low set associativity or a direct
mapping is chosen in practice
Cache thrashing occurs when a value is required that has just been evicted in turn
evicting a value tha will be required in the near future etc On a read miss a value is loaded
into one of the registers at the same time loading the corresponding memory location and
its surrounding values depending on the line size and alignment into the cache
In a loop nest the main cache pressure stems from accessing arrays depending on the
loop indices The access pattern is described by a vector which is given by a matrix J of
multipliers and a constant vector

d of displacements Assume a loop nest of depth m with
the corresponding index vector

i and a n	dimensional array For a triply nested loop and a
twodimensional array a
i
 
 i

 i


 J 
 
  
  

and

d 
 




The general form is



j
  
   j
 m






j
n 
   j
nm






i
 



i
m







d
 



d
m



 The array indices are thus
generated by an ane mapping of the loop counter vector
Consider two loop nests typical in scientic computing matrix multiply and kernel 
from the Livermore loops In the rst example the access pattern is
 
  
  



i
j
k


for
c
 
  
  



i
j
k


for a and
 
  
  



i
j
k


for b The access patterns for u in the
second example are trivially 

k  
     

k 
 as there is only one loop References
diering only in the displacement are uniformly generated
Temporal reuse of data in the cache occurs when the same data item is accessed several
times This was expressed formally by 
WL Temporal reuse is in direction r
t
 if J r
t



spanning a vector space This form of reuse is present for reference c
i
j where J 
 
  
  

 The reuse vector space is spanned by 
  
Spatial reuse means accesses to data in the same cache line Spatial reuse occurs if
accesses are made to data that lie in memory sequentially or at least with a distance of less
than cache line size
l
t
 All but the innermost index must be identical in order to achieve
self	spatial reuse Let J

be J where all elements of the last row are set to  then the
self	spatial reuse vectors full J

  r
s


 Self	temporal implies self spatial reuse
For c in example  J 
 
  
  

 J


 
  
  


Reuse has been dened wrt unlimited cache size Given limited cache size locality
denotes the property that no item is evicted that is still required Reuse is a prerequisite
for locality however note that reuse does not imply locality except for unlimited cache size

 Starting Points
For the considerations in this paper the restriction to perfectly nested loops without branches
is made Furthermore only ane index functions are considered and indirect addressing is
excluded prerequisites fullled in many applications eg from scientic computing
We restrict to array accesses without loop	carried dependencies for simplicity of presen	
tation Otherwise data ow analysis for array references will have to be applied as proposed
by Feautrier 
Fea It is assumed that array sizes are known to the compiler The focus is
on algorithms whose behaviour only depends on the input size a subclass of the large class of
oblivious algorithms dened by Lowe and Zimmermann 
ZL Exclusively reuse wrt the
innermost loop is considered as the greatest eects can be achieved here We concentrate
on data caches Instruction caches are not considered Unless loops are unrolled extensively
as eg in 
DJ it is legitimate to leave instruction cache behaviour out of the focus
As noted above the scientic applications for which the optimizations apply are oblivious
and perfectly nestedReferences in the matrix multiplication algorithm gure  are not
uniformly generatedThere are three nested loops In the Livermore kernel there is only
one loop all references are uniformly generated None of the running examples contain
loop	carried dependencies
 Deriving the Memory Mapping
Conventionally compilers map arrays to memory in the following way Let there be an
n	dimensional array size
k
denoting the number of data items in dimension k There are
m nested loops denoted by

i Row	major order is assumed The memory mapping f is a
function f  IN
n
 IN of the loop indices
fi
 
     i
m
  i
 
  size

       size
n
 i

  size

       size
n
   i
n

Obviously such an arbitrary mapping of arrays often causes low reuse eg for matrix b
in example  A more adequate mapping must consider reuse and potentiality for conicts

fx
 
     x
n
  x
 
  size

       size
n
 x

  size

       size
n
   x
n
 
where  denotes a displacement of the array in memory


Note that such a mapping is always correct however may incur dierent cost due to
cache miss penalty If the parameters x
q
and  are chosen adequately the following two
goals are achieved
 Data should be combined into cache lines according to memory accesses This can
only be achieved by aligning data in memory parameter x
q

 Data should be mapped to memory parameter  in order to avoid conicts
Figure  depicts possible conicts due to memory mapping in the running examples Arrays
a and b resp vectors u and x aligned to the same memory address modulo cache size k
same colour may cause cache thrashing In section  this eect is shown by experiment
An overview of the algorithm is given before the steps are applied to the running examples
in the following There are three main requirements for a single loop nest
 adequately represent temporal reuse
 improve spatial reuse by a new memory mapping
 determine a conict minimal memory mapping
In the following subsections the thus decomposed goals will be fullled by the steps of the
optimization algorithm

As code is not modied moving cache lines in order to avoid conicts is impossible and far too ne
grained

accessed in same
loop iteration
conflict
... ... ...
CBA
a) b)
... ...
u x
no conflictconflict
Figure  Conicts due to Memory Mapping a Example  b Example 
  Cache Value Life Times
Temporal reuse has to be respected To this end an item should be present in the cache as
long as it is still accessed As a feasible heuristics is of interest the goal must be to derive a
pattern dependent of loop indices rather than fully unrolling the loop nest For this reason
a restriction to the innermost loop is made
As the register allocation phase is assumed to be completed physical registers have to
be taken into account 
Bri Deriving cache value life times must take place after locality
optimizations because they might inuence the instruction execution sequence
Figure  sketches the cases that can occur Memory accesses are depicted by full dots
purely register operations by empty dots

 Number  shows the general case where reuse
is present A remarkable fact is that often data are accessed only once so that the cache
value life time can be depicted as a single dot number  The third case is purely reg	
ister computation aecting registers but not the cache number  Assume that load
2 31 VL VL VLCVL CVL CVL
Figure  Relating Cache Value Life Times CVL to Variable Life Times VL
resp store operations can be extracted on source code level Given an array assignment
eg aibik reference to array a on its left hand side is a store operation The right
hand side which in source code might consist of many operations usually contains loads
from several array locations here b
i Cache value life times are a representation of temporal
reuse gure  The life times of matrix elements in example  are mostly dot	shaped In
matrix multiplication see section  self	temporal reuse wrt the innermost loop can only
be found in Matrix c Extensive self	temporal reuse of u is present in the Livermore kernel
Cache value life times are chains of length  Elements of vectors xy and z are not reused
their life ranges are dot	shaped
Summing up temporal reuse in the innermost loop has been captured by the notion of
cache value life time Now this notion will be employed in order to derive an exact picture
of a cache line

Note that reading a value from memory after denition of the corresponding variable is not reasonable
whereas it may happen that a value is stored before its last access via register

k-3
k-6
k-1
k-4
k-5
k-2
k
self-temporal reuse of u(k)
k
j
i
a[i,k]:self-temporal reuse 
in middle loop
b[k,j]:self-temporal reuse 
in outermost loop
c[i,j]:self-temporal reuse 
in innermost loop 
k
Figure  Cache Value Life Times for Running Examples 	 innermost loop unrolled
  Virtual Cache Lines
The next goal is twofold rstly x the values which belong together into one cache line
in order to reduce the degrees of freedom for the following stages and manage complexity
of conict detection Secondly enhance spatial locality Note that the latter improvement
is limited by
l
t
 however neglecting this option means giving away valuable chances for
improvement
The rst step should ll up cache lines as good as possible In order to deterministically
partition the array into cache lines the notion of reuse described above is employed
Spatial reuse is exploited by a modied memory mapping function

f  IN
n
 IN if r
determines the sequence that is placed in memory A restriction is made to those accesses
dependent on the innermost loop index Reuse is now determined via J and

J as described
in section 
The basic idea is to reect those loop transformations of 
WL aiming at the innermost
loop by the memory mapping For the moment the approach is restricted to transposition
and related techniques for simplicity Transposition can be considered a  loop interchange!
concerning just one set of uniformly generated references especially references to each array
of example  seperately This excludes eg blocking techniques Ie only

i has to be
adapted to guarantee that the access pattern remains unchanged preserving correctness

J  


i

d  J  

i

d For a n	dimensional array and


i

f

i
 
    

i
m
 

i
 
  size

       size
n
   

i
m
maps the array elements for improved reuse in the innermost loop
Array a already has self	spatial reuse of factor
l
t
in the innermost loop accesses to c were
already examined in section  They are independent of the innermost loop index and thus
of no interest For b J 
 
  
  

 J


 
  
  


Ie interchanging loops for b wrt the standard memory mapping f would create the
same reuse behaviour as for a traversing the rows of an array mapped row	major Swapping
loop indices k and j yields

J with

J


 
  
  



i has to be adapted


i 


i
k
j


 The mapping prescribes to map array b columnwise
in example  The indices k and j have to be swapped for accesses to b in the source
code yielding cijaik  bjk Note that this eect also known as matrix
transposition cannot be achieved by loop transformations On the right of gure  only the
accesses to u in the body of Livermore loop  are shown As there is temporal reuse this
implies spatial reuse

The method presented so far is conservative as it disallows eg the merging of arrays


at this point because this complicates the derivation of

J and


i
Memory is virtually subdivided into cache lines see section  Technically however
attention must be payed that any memory space delivered by routines such as malloc is
aligned to element size only whereas alignment to cache line boundaries is needed
Until now arrays have been considered separately In the following the interaction
between arrays in memory will have to be considered
  Cache Line Allocation
The task is now to derive more information on conicts avoiding cache thrashing The
number of conicts has to be derived the as cost for use in an objective function Let k
be an integer so that two memory locations map to the same cache line ie their memory
addresses are identical modulo k the machine dependent number of cache lines in direct
mapped cache the goal is to determine an optimal displacement  for each array in order
to minimize conicts with all other arrays
By restricting to the innermost loop a cyclic representation of cache values in loop
iterations can be derived in analogy to Hendren et al 
HGAM It should be noted that
the problem is easier for cache lines than for registers For a practical application cache
lines cl are represented by the corresponding memory address modulo k subscripted by the
arrays they stem from Then the notion of conict can be modelled easily Let a and b be
two arrays whose memory mapping has already been determined In the same iteration i
m
of the innermost loop memory is mapped to the same cache location if
cl
a
i
m
  cl
b
i
m
  c   k c  IN
As an array is always represented by one cache line  can be chosen as displacement to the
array starting address
It might look like nding an adequate displacement is a very simple task However the
interaction between all arrays has to be taken into account Obviously this approach is
rather expensive as the conict cost minimal combination of 
n  
starting address combi	
nations given n arrays has to be found Experimental results will show section  that
randomization can be employed here
 Experimental Results
The DEC Alpha  memory hierarchy is well documented Measurements are made on
this architecture whose rst level data cache is direct mapped The specic analysis tool
ATOM 
SE allows specifying eg cache analyses in an elegant and exible way The tool
allows analysis of C code on the DEC Alpha architecture by instrumenting binaries It is
used in order to get a realistic picture of cache behaviour In the following some preliminary
measurements for a selection of scientic applications are presented As already mentioned
in section  there are    cache lines of l   bytes each The data type size t is  for a
oating point value

 To give a clear picture the absolute number of references resp misses
for the loop nest in question is included Run times denote user times in msec of a mean of
 runs Weak optimization means optimization with respect to cache line combination
while memory is allocated more or less randomly The worst case of strong optimization

Preliminary measurements not contained in the present paper have shown that merging arrays indeed
has potentiality for some loop nests

In order to capture the miss rate of the loop nest in question only the number of references as well as
cache misses incurred by the rest of the program is subtracted from the total numbers As the loop nests
are at the end of the program code the results do not dier signicantly from those obtained by modifying
the ATOM cache tool to measure the loop nest only

problem size optimization references misses miss rate timems
	
	 none 	   
	
	 weak 	  	 
	
	 thrashing  	 		 

 none    

 weak    

 thrashing   	 
Table  Evaluating Optimizations for Example 
program size optimization references misses miss rate timems
LL   none  	  
LL   thrashing    
LL  
 none 	  		 
LL  
 thrashing 		  	 
LL  
 none    
LL  
 thrashing   	 	
lter 
 none    	
lter 
 thrashing 	   	
Table  Conict Miss Reduction Livermore Kernels Filter
mapping all arrays starting addresses to the same cache line is shown in the rows labelled
 cache thrashing! The dierent number of references is due to alignment
A comparison of dierent methods of optimizing matrix multiplication is shown in ta	
ble  Mapping

f ie the combination into cache lines yields a transposition of matrix b
decreasing cache misses signicantly Aligning in this case reduces the miss rate by about
two thirds Provoking cache thrashing by aligning all arrays modulo   


  has
little eect here A matrix size of  was additionally examined because the usual negative
eects of matrix sizes of a power of  
PNDN have to be excluded Surprisingly cache
thrashing eects are even more signicant here The size of the examples was chosen in
order not to exceed the second level cache Embedding the blocking algorithms of 
WL
into our framework is not too dicult and should achieve better results
Table  shows results for the Livermore benchmark set and a lter loop nest typical for
image processing As already mentioned the combination of values into cache lines has no
eect on the memory mapping here The dramatic increase in cache misses in the Livermore
kernels after alignment modulo  are due to cache thrashing
The increase in cache misses incurred by strong cache alignment is signicant This
indicates that a certain amount of  disorder! is desirable in order to avoid cache line conicts
Cache thrashing eects are stronger when life times are longer as is the case in the Livermore
kernels As there are many good and a few very bad choices of  the expectation value for
conicts is low Thus a randomized choice of displacements is very promising 
MR
which is conrmed by the good results for the weakly optimized case Secondly run time
improvements fall a bit short behind expectations This is due to the fact that cache misses
have no eect on run time in pipelined processors with separate functional units unless the
oating point pipeline is not completely lled Otherwise the actions overlap The classes
of problems for which the method is applicable with signicant eect on run time will have
to be determined

 Related Work
Classical approaches to cache optimization aiming at the reduction of cache capacity misses
by improving locality can eg be found as noted above in 
WL and in the work of
McKinley etal 
MCT In contrast reuse information is directly used to control the
memory mapping here By restricting to the innermost loop compile	time complexity is
signicantly lower than theirs Moreover strictly speaking the notion of locality applies
to fully associative caches only The more realistic case of low	associativity caches is not
considered there The combination of values into cache lines was examined in the context of
cache analysis by Rawat 
Raw By not taking reuse information into consideration the
estimations made by Rawats method are too coarse and often overestimate cache misses
signicantly Panda Nicolau et al 
PNDN show a simple but striking approach to the
problem of conict misses in data caches which might provide an alternative approach to
conicts between arrays Hashemi Kaeli and Calder 
HKC very recently applied cache
line coloring for direct mapped instruction caches in order to obtain conict	minimal map	
pings for procedures Their optimizations are based trace	driven simulation and validated
for the SPEC benchmark suite In the approach presented here by restricting to scientic
applications more information can be obtained at compile time Furthermore data caches
are challenging due to varying access patterns and huge amount of data
 Conclusions and Future Work
By exploiting the mapping properties of low	associativity caches a structured approach to
cache optimization is presented Accounting for a composition of cache values to cache lines
that respect temporal and spatial reuse a method for deriving conict	minimal memory
mappings is proposed Using a framework from locality optimization enables to make some
of the transformations available for memory mapping More complicated mapping functions
will have to be examined The optimization presented here is complementary to classical
loop transformations Measurements document the exiblity of the approach which can
be applied to a large class of scientic programs Expectations wrt negative eects of
over	alignment due to cache thrashing have been conrmed
However there still are a lot of potentialities for further optimization In the following
only the most important options are mentioned
By now the source code is hand	optimized The next step will be to embed the scheme
into an experimental compiler Interaction with other compiler phases such as register
allocation and instruction scheduling has to be considered
For the moment each loop nest is analyzed separately Arrays may be accessed in
dierent ways in several loop nests that are part of a program They may be replicated with
respect to dierent access patterns However care must be taken not to exceed main memory
size which would occur when large arrays are considered Alternatively arrays would have
to be remapped at run time which is an expensive action By choosing between alternative
parameter settings Eisenbiegler has specialized the cost directed conguration approach
described by Moldenhauer 
Mol for the purpose of compiler	supported data distribution
for multiprocessors
Eis Currently a method to support the choice between remapping
and replication is developed by us in analogy to the latter method Data dependencies
between array references have to be taken into account
Finally innermost loops are not specic to the area of scientic computing In more
general areas of application data type sizes vary and structures of access are much less
regular It in an open question whether the method can be generalized
Acknowledgement We wish to thank Uwe Assmann Thilo Gaul Gerhard Goos and
Sylvain Lelait for productive discussions Furthermore we thank the authors of ATOM
Amitabh Srivastava and Alan Eustace for providing their tool

References

Bri Preston Briggs Register Allocation via Graph Coloring PhD thesis Rice Uni	
versity April 

DJ Jack W Davidson and Sanjay Jinturkar Aggressive loop unrolling in a retar	
getable optimizing compiler In Compiler Construction volume  of LNCS
pages " April 

Eis Jorn Eisenbiegler Datenverteilung als Kongurationsproblem data distribution
as conguration problem Technical Report 	 Universitat Karlsruhe
TH Fakultat fur Informatik May 

Fea Paul Feautrier Dataow analysis of scalar and array references Int J of
Parallel Programming " February 

HGAM L Hendren G Gao E Altman and C Mukerji A register allocation framework
based on hierarchical cyclic interval graphs In Proc th Int Conf Compiler
Construction volume  of LNCS pages #"# Springer	Verlag 

HKC Amir H Hashemi David R Kaeli and Brad Calder Ecient procedure map	
ping using cache line coloring In PLDI 	 pages " jun  Proceed

ings of the ACM SIGPLAN 	 Conference on Programming Language Design
and Implementation

HP John L Hennessy and David A Patterson Computer Architecture 
 A Quanti

tative Approach Morgan Kaufman nd edition 

MCT Kathryn S McKinley Steve Carr and Chau	Wen Tseng Improving data local	
ity with loop transformations ACM Transactions on Programming Languages
and Systems " July 

Mol Horst Moldenhauer Kostenbasierte Kongurierung fur Programme und SW

Architekturen cost
based conguration of programs and software architectures
PhD thesis University of Karlsruhe TH June 

MR Rajeev Motwani and Praphakar Raghavan Randomized Algorithms Cambridge
University Press 

PNDN Preeti Ranjan Panda Hiroshi Nakamura Nikil D Dutt and A Nicolau Im	
proving cache performance through tiling and data alignment In IRREGULAR
	 pages " Springer LNCS  

Raw Jai Rawat Static analysis of cache performance for real	time programming
Technical Report IASTATECS$$TR	 Iowa state university November 


SE Amitabh Srivastava and Alan Eustace ATOM A system for building cus	
tomized program analysis tools In Proceedings of the SIGPLAN  Confer

ence on Programming Language Design and Implementation pages "
June 

WL Michael E Wolf and Monica S Lam A data locality optimizing algorithm
SIGPLAN Notices " jun  Proceedings of the ACM SIGPLAN
 Conference on Programming Language Design and Implementation

ZL Wolf Zimmermann and Welf Lowe An approach to machine	independent paral	
lel programming In VAPP CONPAR VAPP IV Joint International Con

ference on Vector and Parallel Processing LNCS  Springer	Verlag 

