NP-Hardness of Cache Mapping by Li, Zhiyuan & Xu, Rong
Purdue University 
Purdue e-Pubs 
Department of Computer Science Technical 
Reports Department of Computer Science 
2004 
NP-Hardness of Cache Mapping 
Zhiyuan Li 




Li, Zhiyuan and Xu, Rong, "NP-Hardness of Cache Mapping" (2004). Department of Computer Science 
Technical Reports. Paper 1588. 
https://docs.lib.purdue.edu/cstech/1588 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 
Please contact epubs@purdue.edu for additional information. 
NP-HARDNESS OF CACHE MAPPING 
Zhiyuan Li 
Rong Xu 
Department of Computer Sciences 
Purdue University 
West Lafayette, IN 47907 








NP-hardness of Cache Mapping* 
Zhiyuan Li Rong Xu 
Department of Computer Sciences 
Purdue University 
West Lafayette, IN 47907 
{I~ .xur )  @cs.purdue.edu 
Abstract 
Plncessors sllch os the Iritel StrorigARM SA-1110 orid 
tlir Intel XScole provide ,fle.riDI~. contrnl oller tlie corhe 
riinnogen7ent to orhieve better roche z~tili:oriori. Pl-ogronis 
can specih tlie cache ~iinpying policy for each virtr~ol poge, 
i.e. niopping it to the mnirz cache, the riiini-coche, or neithel: 
For the latter cose, the pnge is mnrked ns nonr~cheoble. 111 
tliis popel; we liiodel the rnche maypirzg pl-oDle117 orid p r o ~ ~ e  
[hat firiding the optiriial cnche mopping is NP-hord. 
1 Introduction 
The issue of reducing the average memory access time 
continues to receive wide-spread attention. One of the 
hardware approaches proposed in recent studies [9,8.  1 1.21 
uses horizontally partitioned data caches. This approach 
maintains multiple data caches at the same level in the cache 
hierarchy. Different caches may have different structures. 
There are several advantages to this approach: 
Different memory addresses may exhibit different 
locality behaviors and some may have no locality at 
all. By carefully mapping different data to different 
subcaches? we may get a higher overall cache hit ratio. 
Smaller subcaches allow a faster CPU clock beca~~se  
of the shorter cache hit time. 
On a partitioned cache, i t  is possible to probe just 
one of the subcaches during a data access. This can 
result in a substantial energy saving [8, 1 1 .  1 ,  71, 
which is especially important to handheld devices and 
embedded systems. 
Cache Dypnss is another technique to reduce the average 
memory access time. It keeps non-reusable data items out of 
*Technique Report CSD-TR-03-001. Depaltlnent of Computer Sci- 
ences, Pul-due University, West Lafhyeue, 1N 47907. Janua~y. 2004 
t ~ h e  author names are listed in alphabetical order. 
the cache in order to use the cache space to retain reusable 
data. For the data items exhibiting a low locality, cache 
bypass also reduces the amount of data fetched from the 
main memory. because only the target data item, instead of 
the whole cache line. needs to be transferred. A number of 
hardware solutions [6, 101 have been proposed to monitor 
the memory access patterns and to make bypass decisions. 
Processors such as the Intel StrongARM SA-I I I0 [4] 
and the Intel XScale [5] allow application programs or 
compilers to specify the cache mapping policy for each 
virtual page. The processol- contains a relatively small-sized 
mini-cache in parallel with the main data cache. Each 
page can be mapped to either the main cache or the 
mini-cache, or marked as noncacheable (for cache bypass). 
Intel Developer's Manual [4] states that the mini-cache is 
designed to prevent thrashing on the main data cache. Its 
typical use is to store large data structures such that accesses 
to these data structures do not interfere with the data in the 
main data cache. 
The support for multiple cache policies provides the 
potential benefits of both the horizontally partitioned data 
cache and the bypass cache. However, to take advantage 
of this feature, one must carefully specify the cache policy 
for the individual virtual page. We call the process of 
specifying the mapping between the virtual pages and the 
caches cnche niappirig. Although Intel gives guidelines 
for cache mapping as mentioned above, applying them 
to real programs faces several challenges: (1) It is often 
difficult to predict the cache reuse pattern in non-numerical 
programs. Even the programmer may not predict the cache 
behavior accurately. (2) The decision on cache mapping 
cannot be made for a single data object in isolation without 
considering other objects stored in the same page. 
In this paper, we use memory profiling to study page- 
level cache mapping. We model the cache mapping problem 
and prove that its optimal solution is NP-hard. 
The rest of this paper is organized as follows: in Section 
2, we briefly review the cache system in Intel StrongARM 
SA-I I I0 and discuss why caches with cache mapping 
i g*
rt t
i,xllr} . r ll . ll
ro u a n n l I an
he a . x ble o ver h ac
ma a me ac ca utilization. rograms
ify h map ing icy r ual a
. ma ain . m ni-ca , itheJ:
a , a a a cacheable. In
h a m ca p in r b m an ove
t t n m l a a Iwr .
rt
rttt
rt rt rt , , II, J
trt rtrtiti rt rt rt rt rt
rt rt rt rt rt rt rt rt
rt rt rt .




, rt rrt rt .
• rt rt cause
.
• rt rtrtiti , t
rt
rt strt irt , II, I, ],
irt rt rt
by a rt rt m
rt rt rt
4- L l1m l
. rdue rsity. a tt . I ary.
The l r.
rt rt trt ,
rt rtr rt
m
rt , J rt
rt
rtS lIO ]
rt rt] ] rtll rtp li














a m n .
rt] m : I
rt rttt l
rt






can perform better than traditional caches. In Section 3 
we model the cache mapping problem and prove i t  to be 
NP-h ard. 
2 Cache Mapping Problem 
The Intel StrongARM SA-I 1 10 processor [4] employs 
two logically separate data caches. i.e. the main data cache 
and the mini-cache. The 8K-byte main data cache is 32-way 
set associative with round-robin replacement. The 5 12-byte 
mini-cache is 2-way set associative with LRU replacement. 
The cache line size is 32 bytes on both caches. For each data 
cache access, both caches are probed in parallel. However, 
a particular memory block can exist in only one of the two 
caches at any time. 
Both the main cache and the mini-cache are indexed 
and tagged by virtual addresses. All memory blocks in 
the same virtual page will be mapped to the same cache. 
The mapping is controlled by the bufferable bit (B) and the 
cacheable bit (C) in the page table entry in the MMU. If 
B = l  and C = l l  which is the default, the page is mapped to 
the main data cache. If B=O and C = l ,  i t  is mapped to the 
mini-cache. If C=O1 then the page is noncacheable and its 
accesses bypass both caches. This mechanism provides the 
conlpiler or the application programs the ability to control 
page-to-cache mapping by modifying the B and C bits in the 
MMU. Note that we need to flush the caches and the TLB 
entries for consistency after changing the page mapping. 
By carefully mapping different references to different 
- caches. we can achieve better cache utilization than tradi- 
tional caches through better cache replacement. Although 
under certain circumstances, an optimizing compiler may 
be able to use software means to influence the replacement 
policy (memory overlay and reference reordering are two 
well-known techniques of such), unfortunately, dependence 
relations in a programoften prevent compilers to reorder 
memory references. Opportunities for memory overlay 
are also limited because many variables may remain live 
at the same time. Independently indexed caches offer 
opportunities to manipulate the replacement policy without 
being constrained by dependence relations or live variable 
information. 
Assuming all the caches are fully associative with 
LRU replacement, consider the following trace where each 
variable represents a memory block: xO x l  x l  x 2  x 2  
xO. If we have a single cache of size smaller than or equal 
to 2 cache lines, then a capacity miss will occur at block xO. 
In contrast, if we have two independently indexed caches, 
each of which have one cache line. then we can allocate xO 
to one cache and x l  and x2  to the other cache. No capacity 
miss occurs. 
The idea of reducing capacity misses by selecting 
memory references to bypass the cache can be illustrated 
via the following example: xO x l  x2  xO . In this 
memory trace, there are no reuses for blocks x l  and x2.  If 
we let the references to these blocks bypass the cache, then 
xO can be reused for a cache size as small as one cache line. 
3 The Optimal Cache Mapping 
We define the CACHE-MAPPING problen~ as the 
following: given a memory trace, determine the best 
page-to-cache mapping such that the average memory 
access time is minimized. Since we do not reduce the 
number of references, the objective of minimizing the 
average memory access time is the same as minimizing the 
total memory access time, which can be expressed by the 
following formula: 
where  TI,^^.^ 7,.,,,,,,, and Tll.j l~in~71,ill~ denote access time for a 
cache hit at the main cache and the mini-cache respectively. 
ZrLiss-iT1.rrlaiil and Tiniss.itl-ini?li denote the average access 
time for a cache miss at the main cache and the mini-cache 
respectively. The term z is the delay for accessing the first 
byte of any data in the main memory, B the memory bus 
bandwidth, and Si the ith noncacheable memory access 
size. Nmaill and are the total number of accesses 
to the main cache and the mini-cache. Nnoncacheahle  
is the number of noncacheable accesses. h,,,,i,, and 
h,lli,,i represent the hit ratios for the main cache and the 
mini-cache. 
Formally, the page-to-cache mapping is done by assign- 
ing each virtual memory page to one of the three mutually 
exclusive sets, Setnl,ill, Set,li?li and Setno,,cach,,~r,. 
Set,,,,,,, contains the pages mapped to the main cache; 
Set,,,,,, contains the pages mapped to the mini-cache and 
Set?lo,lcacheable contains the noncacheable pages. 
We make the following assumptions to simplify the 
CACHE-MAPPING problem: 
Thif.i,l.rr,a.i7l. = Th.itii,.?ilini and Trrziss.iil.iilain = 
Tllliss.ill.n~i.lLi. Hence, we simply use the terms 
and TmiSS, respectively. The StrongARM SA-I I I 0  
processor probes both caches in parallel, so it is 
necessary to have T~,it_irl_inai?1 = T~it.iil.iniili. Since 
the main memory operation and the bus transmission 
dominate the cache miss penalty, T,lliss.i~r.7110~r, and 
T,l,jSS.i? ,, lli71i are approximatively equal. 
All the data items targeted by noncacheable accesses 
are of the same size. This assumption is reasonable for 































T = T",Un.main * Nmain * hmain +
1~niss_i11_m,ain * N nw.in * (1 - hmain )+
Th-iLirLnl'llli *" N mini * hm.i71i + (])
1~niss_i1Ln1ini * N mini * (1 - hm ,i71i)+
L~~\()1/r:f)(:h(:abl(~(:r + ....9i / B)
ThiLin_17Jain hiLin_m i
Tmi _i7LlIlain mis _iH_mini t t
:1: t
th








• i,t _ill_'J1la.in = iLi1L1n H1i ,s'_in_171oin =
miss.in_mi i. Thit
iss, . IO
t hiLin_mai hiLin.miJli. i
miss_in_lIlaill




system supports accesses in the burst mode. This 
is because. in compiler-generated code, the majority 
of the loads and stores, access one word at a time. 
We denote the time to load or  to store a single 
noncacheable word by T7,01Lcacheab~e 
With these assumptions. we can simplify the memory 
access time in Formula ( I )  to T = 
Notice that, under the condition of T7,07,catheab~e = T,,l,ss, 
T is minimized if and only if the total number of hits, 
i.e. N,,,,,, * h, ,,,,,, + N,,,,,,, * h,,,,, is maximized. 
In the following. we  shall first prove the problem of 
maximizing the cache hits to  be NP-hard. We then prove the 
NP-hardness of the CACHE-MAPPING problem without 
the condition of  T~loncactleab/e = Tl,I15s. 
3.1 NP-hardness proof 
Definition.1. CACHE-MAPPING problem: 
Instance: A niain cache of size S,  a miiii-cache of size 
S,i71i, each liaviizg eirl7el- tlie LRU or tlze Round-Kobi~i 
replacenierzt polic); a page size S,, a set (P) of virtual 
pages s~lch that each page P, E P contains nzeiiior-y 
blocks (.i. 1 ) .  (i: 2). . . . . ( i .  S,,), and a sequence of n7eli7or-y 
accesses A = nl. . . . . a,,, to the 177en701-y blocks i~ztt-oduced 
above. The ~zui~iber of disrilzct i17enzo1-y blocks accessed i7z 
the sequence is assz{~lied to be greater tkaiz the size of each 
cache. 
Solurioiz: a partitioii of pages ill P iiito Set,,,ain, Set,,i7," 
and Setn,,,c,c/l,,ab~,, szlcl7 rhat the ii7eiiioty access tirile T 
dejiized in Eq~latio~i 2 is iiii~ziiiiized. 
Definition 2. MAX-HIT problem: 
Iizsta~zce: A iiiai~i caclie of size S,  a iiiiizi-cache of size 
S,,,ini, each havii~g either tlie LRU or the Round-Robiii 
replaceiiieizr policx a page size S,, a set (P) of virtcral 
pages such that each page Pi E P coiztai~zs ineiilor-y 
blocks ( i :  1):  (i: 2): . . . . (i. S,,), arid a sequence of iiieiiior~ 
accesses A = nl:  . . . . a,,, to the iiieii7oiy blocks iizrroduced 
above. Tlie ~zuniber of disri~zct nleiiioiy blocks accessed in 
the sequeizce is a s s ~ ~ ~ i i e d  to be greater- than the size of each 
cache. 
Sol~ctioiz: a partitioiz of pages iiz into  set,,,,,,,^ 
and Set,l,l,c,ch,,~~,, sztcl7 rhat the tola1 number of cache 
hits of A is iiiaxiiiiized. 
L e m m a  1: MAX-HIT is NP-hard in terms of the length of 
the memory-access sequence if man:(S. SnIi.,,i) 5 S, - 1 
and S # Smini: 
Proof: We reduce MAX2SAT [3] to MAX-HIT. The 
MAX2SAT problem is defined as: given a set of clauses, 
each being a disjunction of at most two literals', and 
an integer K, whether there is a truth assignment that 
satisfies at least K of the clauses. Given an instance of 
MAX2SAT. we construct a sequence of memory accesses 
which consists of a prefix and a postfix. The prefix enforces 
a one-to-one correspondence between the truth assignment 
in the MAX2SAT and the page placement in MAX-HIT. 
The postfix is transformed from the clauses of the given 
MAX2SAT instance. 
Throughout this proof, we use the notation A(i.. b)  t o  
represent a reference to the btIZ memory block in page Pi 
(if b = 0, A( i ,  b) is null). The notation A[( i .  6 1 ) :  (i: bz)] 
denotes the series of A(i,  b l ) .  A(i .  bl - 1 ) : .  . . I A(i:  b z ) .  If 
bl < 62: A[(i .  b l ) :  (i. b2)] is empty. Let AT be the length (i.e. 
the number of clauses) of the MAX2SAT instance. Without 
loss of generality, we assume S > S,n77,i. 
We first construct the prefix. For each variable 21 in 
MAX2SAT, we introduce 3 virtual pages, P,,: P,, and P,,, . 
The prefix is the concatenation of the following memory 
accesses for each variables v: 
Ai ( t . .  S ) .  (o. I ) ]A[ ( -11 .  S ) .  ( 7 1 1 .  l ) ] A ( v " .  I )  . . .  (repeal 2 * iV + 1 more 
l i i i ~ e h )  
By placing P,+l in (i.e. mapping it to the 
mini-cache), either P,, or  P,.! in Set,,,,i,, (i.e. mapping it 
to the main cache) and the other page in Setno7,c,ch,,bl,, 
the prefix has (2  * N + 1 )  * ( S +  1 )  cache hits,the maximum 
possible. 
Table 1 defines the rules to transform a clause to the 
memory accesses in the postfix. In the table. a and P 
denote two literals of distinct variables in the clause2. a' 
and 3' are the opposite literals respectively. For each clause 
in the MAX2SAT instance, we apply one of the two rules 
exactly once. Since S < S ,  - 1,  we have at least S + 1 
memory blocks in each virtual page. Given the cache sizes, 
the memory accesses introduced for each clause have 2 
potential cache hits (marked in bold). Therefore, the total 
number of cache hits in the postfix is at most 2 * N .  
We 'obtain the whole memory accesses sequence by 
appending the postfix to  the prefix. In the optimal solution, 
exact one of the two pages associated with each variable 
should be mapped to the main cache: 
Mapping P,. or  P,, to the mini-cache will not produce 
any hit in the mini-cache for the memory accesses in 
prefix. A s  a result, only page P,J~ should be mapped to 
the mini-cache, otherwise, we lose at least (2 * N + 1) 
hits. 
If there exists a variable 7) such that both P,, and P-,, 
are in the accesses to P,, and P,, in the 
' v  and -v are two opposite literals of variable v. 




T1niss * (JVmoin + N m -ini + Nnoncacheable)-
(Tmiss - Thit} * (Nmain * hmain + N mini * hmini )- (2)
(Tm.iss - Tnoncacheable) * Nnonc-ocheable
noncacheable l7liss ,




m f 5, n f
5 mini , h n th r h h Robin
emen y, 51" P) f
uch i m m y
,l), , i. 51' ' f m m
a n memoJ)' ntroduced
. number f t n memory n
umed h n f
.
t n: tition f n int 5 m in, 5 min i
5 tnoncacheable, u h t memo/)' m
f n uation minimiz.ed.
nstance: main h f 5, mini-cac e f
5 mi i , ing h n
cement }; 51" P) f tu
ntains m m y
,l), i, ), 51' ' n f memory
aI, n m mOJ)' ntr
h number f tinct m m r
n sumed . f
.
ution: tition f n Pint 5etmain, 5 tm ini
5etnoncacheable, u h t t l f
f m mize .
: ll
m:(5, 5 mini ) :::; 51'







i, IU. b1 , i, 2
i, 1 . . 1 ), . . , U, 2 .




\ " 5). v. 1 )] \ ~v. 5). ~lo 1 )J ( lOU 1) . l • N
t mes)
"" 5etmini .
" , 5 ma in .
5 tnoncacheable,









• v v '
. , ""
, 2 )
• l' " v
5etmain , " ~lo
1V ~ .
2 Q Q t , Q ~
Table 1. Clause transformation rules under S 5 S,, - 1 
1 Clause / Memory accesses sequence 
prefix are cache misses. By placing one of them in 
Set,,oncacheable, the cache hit count will increase by at 
least (2* N + 1)  *S .  In the postfix, while this may force 
some accesses to become cache misses, the decreases 
is no more than 2 * N. 
By the same argument, if there exists a variable 1 .  
such that both P, and P,, are in Se t  ,,,,,,,, h e a h / , .  by 
placing one of them in Set,,,,,,,, the total cache hit 
count wlll be increased. 
Therefore: we can build the following one-to-one 
mapping between the truth assignment and the page 
placement: 
P, E e t i  is true. 
P,, E @ v is false. 
Under this mapping, the transformation rules guarantee 
that a satisfied clause will increase the cache hit count by 
exactly 1, and an unsatisfied clause will not affect the cache 
hit count: 
If a is true (which means P, is in Set,, ,,,,, ), the 
memory access A ( a .  1 )  is a hit, but no other listed 
accesses can be hits. 
If a is false (which means Pol is in Set,,,,,,,). the 
memory access A ( a .  1 )  will not be a cache hit. 
- If p is true, Pp is in Set7,,,,,,h,,~,,t so 
A ( a l ,  1 )  is a hit. 
- If p is false, Pp is in Set,,,,,,, A ( a l .  1) is a 
miss. The cache hit count will not increase. 
Lemma 2: Lemma 1 remains correct if S = S,r,i,,i. 
Proof: We will use the following prefix, 
A [ ( , .  S).  ( 1 ) .  l ) ] . A [ ( - 7 : .  S ) .  ( - 1 1 .  l ) ] A ( t ! " .  1 )A[ ( , ' ' .  S). (v". I ) ]  
. . . (repeat 2 * !V + 1 more times) 
In the optimal mapping, Pl.jj will definitely be mapped to 
one of the two caches. Exactly one of P, and P,, will be 
mapped to the other cache. If P,: is mapped to the cache, t i  
is true, otherwise, t! is fnlse. o 
Lemma 3: If m n ~ ( S .  S,,irIi) 2 S,, MAX-HIT is still NP- 
hard in terms of the length of the memory accesses sequence 
for fully-associative caches. 
Proof: We still reduce MAX2SAT to MAX-HIT, using the 
same notations in Lemma 1 .  We assign 2 pages, P, and 
P-,,, for each variable I * ,  in the MAX2SAT instance. 
Let m = lS/S,,], S7,t.7,. = S %. S,,. n71 = ~Sl,,i,i/Sp], 
and Sl,,,,,, = S,,,i7,j %, Sp. We introduce m + m l  + 1 
padding pages, numbered from 1 to m + m l  + 1. 
Let SeqPaddi7,, represent the following memory access 
sequence to the padding pages, .seq ,,", = A I ( I .  sp) .  ( I . I ) I  ... 
A[(J J I  - 1. Sf,). ( n - l . l ) ]  A l ( ~ f 1 .  S,,, ,,.). (111.1)l A((771 + 1. S,,), (m+l.l)] ... 
Al(m+inl,S,,). ( ~ I + ~ I . I ) I  Al(m+ml+l.Sl,,,,,.). ( m + m l + l . ~ ) l .  SeqpUddi7Lg 
accesses S + S7,,i7,i -- S,, memory blocks. If S,,,, = 0, 
then page P,,, does not exist. If S17,,i7,i = 0, then page 
Pfn+rnl+l does not exist. Accesses to such nonexistent 
pages should be removed throughout the proof. 
For each clause in the MAX2SAT instance, we use one 
of the two rules in Table 2 exactly once to generate a 
sequence of memory access in the postfix. Each clause 
contains at most 2 * ( S  + Slf,ilL,) + S, + 4 memory accesses. 
So the total number of cache hits for all the clauses does not 
exceed (2 * ( S  + S,,,i,,i) + Sp + 4) * N. 
Let Seq-lit(t1,R) be R repetitions of ~ t q , , ~ , ~ ~ ~ , , ~  
s ~ , , , , ~ , ~   A S  I A . S .  I The prefix contains 
the following memory access sequence: With this property, i t  is easy to see that maximizing the 
Seq-lit (?:I .  R )  Seq.lii(z.2. R )  . . . Seq.lit(u.y,. . R) 
cache hit count is equivalent to maximizing the number of 
where N,, is the number of variables in the MAX2SAT 
satisfied clauses. An optimal MAX-HIT solution will derive 
an optimal solution of the MAX2SAT. instance. 
We first examine what kind of cache mapping maximizes 
Finally, the trace length constructed in this reduction is 
the number of cache hits in Seq-lit(t1. R )  for each v, where 
a polynomial function of N. Therefore, MAX-HIT is NP- 
hard. o 
R >  1. 
All memory blocks which are accessed in Seq-lit(v, R) 
Note that in the proof, we do not assume any and are not placed in Set  ,,, heab le  should exactly 
associativity property for the caches. The proof is valid for cover both caches. Hence, if S,,,. > 0, then P,,, 
directly-mapped, set-associative or full-associative caches. should not be in Set ,  ,,,, h,,b/,. Neither should 
l . l t f ti l :s; 1' -
l e r accesses se e ce
0' A[(O', S + 1), (0', 1)] A(a, 1)
Q V ,8 A[(O', S + 1), (0', 1)] A(a, 1) A[(a', S + 1), (a', 2)] A(f3', S + 1) A(a'.1)A[(,3', S), (f3', 1)] A(O'l 1)
i i . l i t i
etnoncacheable, t it t ill i r s t
l t ) . t t i , il t i
s t i , t r
i .
• t, I.'






• v S etmain <=> v .





• I ° a tmai")
o l) ,
s
• ° a , tmai ,,),
o, )
(3 i t e, ry i i tnoncocheable,
o', )









i ti i f
i tl , .
: I i t if mini.
f: ill t f ll i r fi ,
[(v. 5). (v. I)jA[(~v 5) -".1 ] v .1)A[(t· n ). ".]»)
. N l110r
t ti l i , r " ill i it l t
. f v ~v
l' v
, , v a/ 0





1'J, nm p ml lSmin;/ J,
l ne1l: mini .
paddinq t
S jlOddin" I l S ). 1.1)] .
I m . p). m-LI ) I m. o w). .l j I(m p . +I.I)1 .
I mI, ,,). m+ml.J)1 I(m+ml+LS1"c ')' l I.I)I. ad i"g
mini - p f new
m f l mini
m +rnl +]
mi"i ) 1' .
t
( 8 mi"i) ) .
v, ) f Seqpaddi"q
Seq,"'ddi"" \(v.Sp ). (v.I») [(~v. p) (-'v.l)j.





r t l i etnoncache le s l tl
t . , if ne11: , t m
l t i etnollCQc eable' it er l
Table 2. Clause transformation rules under S 2 S, 
P,,,+ ,,, I+, if Sl,,,, , > 0. Further, if Sl,,,,,. # S ,, .. 
then P,,, should be mapped to the main cache and 
Pn l+ml+ l  to the mini-cache. If Sin ,,,. = S ,,,,,,, then 
either page can be mapped to the main cache, the other 
to the mini-cache. 
For each variable v, exactly one of P,. and P,,: should 
be mapped to either the main cache 01- the mini-cache. 
If both were in Set,o,,c,ch,o~~,. then we  would leave 
S, cache lines unused. On the other hand, if both 
P, and P,, were mapped to the caches, then one 
of the padding pages we  introduced would be in 
Set?loncacheoble Since each padding page is accessed 
more frequently than either of P,: and P-,. placing 




a v p  
For any of the variables, if either of the two requirements 
mentioned above is unsatisfied, then we would lose at least 
R - 1 cache hits in the prefix. Since the postfix can have 
no more than 2 * (S + Sl,,i,,i) + S, + 4) * A' cache hits: 
we simply let R = 2 * (S + S,,,il,i) + S, + 4) * AT + 2 
to make sure that any optimal mapping will satisfy the two 
requirements above for all variables. 
It is important to note that all those pages which are 
assigned to the variables and not in S e t  ,lo,,,, ch,,,b,, will be 
mapped to the same cache (i.e. either all in Set,,,oi,, o r  all 
in Thus, since both caches are fully associative, 
the memory access sequence corresponding to each satisfied 
clause (see Table 2) will have exactly one cache hit? and an 
Memory accesses sequence 
Seqp,dd,,, A(n1.. S,,,,. + l ) i l (nl  + m.1 + 1. Sl,,,,. + 1).4[(0, S , ) .  (a: I)] A ( a ,  1) 
Seql,,dd,,,, A(7n:S ,,,,, . + l)A(ln + 1 1 1 1  + l.SI,,,,. + l )A[(a.S,) ;  (a, 1)J A ( a ,  1 )  
A[(al :  S,) :  (n': I)] Seq,,,dd ,,,, A[(,L3/. S,): (3'. I)]  A(nl .  1 )  
unsatisfied clause will generate no cache hits. Maximizing 
the number of  satisfied clauses is equivalent to maximizing 
the number of cache hits in the entire memory access 
sequence. The MAX-HIT problem is NP-hard. o 
Lemma 4: For virtual-indexed caches, Lemma 3 remains 
correct if the cache is set-associative or direct-mapped. 
Proof: If S > S, and the cache is virtual-indexed, the proof 
of  Lemma 2 can be modified to work for the set-associative 
or direct-mapped cache. We manipulate the virtual page 
numbers to make sure that every page created for the 
variables is mapped to the lowest S, cache lines. We also 
make sure that Pl through P,,, cover cache-line indices from 
S, + 1 to S. P,,+1 through P,,,+,,,1+1 cover cache-line 
indices from 1 to Smtni. Notice that, for convenience, we 
write the lowest cache index 1 (instead of 0 by convention). 
These treatments will ensure that, by the rules in Table 2, a 
satisfied clause will generate exactly one cache hit and an 
unsatisfied clause generate no cache hits. o 
For the real-indexed cache with the set-associative or 
directly-mapped replacement policy, the number of hits 
are not fixed under any mapping scheme (as long as 
the real-page assignment is unpredictable). S o  the cache 
mapping problem would be ill-defined. 
Theorem: CACHE-MAPPING is NP-hard in terms of  the 
length of the memory trace. 
Proof: Suppose T,,,,,, # T,lo71cochtoble. The optimal cache 
mapping given in the proofs of Lemmas I through 4 remains 
optimal, as long as we make the prefix sufficiently long. We 
show how this is done in Lemma I .  In the prefix, we repeat 
memory accesses for each variable v R more times, instead 
of 2 * AT + 1 more times. Suppose we  change the page 
mapping from the optimal in Lemma I.  This would increase 
T in the prefix by at least R * (T,loncochroble - Thzt), but at 
the same time we may decrease T in the postfix by n o  more 
than 2 ( S  + 1)  * I\[ * (T,,,,,, - T7107,cocheoble). If we let R = 
2 ( S  + 1) * A' * (Tnztss - T7~oncacheablr)/(T7~oi~cocheoble -
Th,j) + 1. the optimal mapping in Lemma I will remain 
optimal for CACHE-MAPPING. The number of satisfied 
clauses in MAX2SAT equals the number of cache hits in 
the optimal solution for CACHE-MAPPING. o 
If we  remove the mini-cache from CACHE-MAPPING, 
the problem becomes how to optimally select the non- 
cacheable virtual pages. The proofs of  Lemmas 1 through 4 
and the Theorem remain valid with slight adjustments. (In 
the proof of Lemma I, w e  simply remove memory accesses 
to A(vl', 1) from the prefix.) Hence, this special case of 
CACHE-MAPPING remains NP-hard. 
Acknowledgments 
This work is sponsored by National Science Foundation 
through grants CCR-0208760, ACIDTR-0082834, and 
CCR-9975309. 
?: p
r ss s s
' addln 9 m, "EU' I A(m 177 . I"eu' I)A[(O', 1')' 0', 1 ] (O',
O' /3 SeqpaddinC) (m, S"",. I) (m ml 1. SI"cu' I) [(n. S1')' (0:, 1)] A(o:, 1)
(0:', 1')' (If, 1 ] pa di"9 [(3f , 1')' f . 1 ] f • )
m ml 1 Inew O , I"",1" i "eu"
m









v ~v etnoncacheabl< JJ
" ni) p N ,



















n a eable it),
) N lIliss noncacheoble)'









[ I ]  D. H. Albonesi. Selective cache ways: On-demand cache 
resource allocation. In Pioceediilgs of the 32nd Iiirernntiorinl 
Syinposiruii on Micronrchitecture, pages 248-259, 1999. 
[2] K. K. Chan, C. C. Hay, J. R. Keller. G. P. Kurpanek, F. X. 
Schumacher, and J .  Zheng. Design of HP PA 7200 CPU. In 
Hewlett-Pncknrd Journal, February 1996. 
[3] M. R. Garey, D. S. Johnson, and L. J .  Stockmeyer. 
Some simplified NP-complete graph problems. Tlieoretical 
Coiiiputei- Science, 1 :237-267, 1976. 
[4] lntel Corporation. lntel StrongARM SA-I 1 I0  microproces- 
sor developer's manual. 
http://www.i ntel.com/designistrong/manuals/278240.htm, 
October 2001. 
(51 lntel Corporation. Intel PXA250 and PXA210 application 
processor developer's manual. 
http://www.intel.com/design/pca/applicationsprocessors/ma 
nuals/278693.htm, February 2002. 
[6] T. L. Johnson and W. W. Hwu. Run-time adaptive 
cache hierarchy management via reference analysis. In 
Proceedings of rhe 24/11 ~Interi~ntionnl Syi~iposiuiii oil 
Coinpiltei-Architect~1re~ pages 3 15-326, 1997. 
[7] J .  Kin, M. Gupta, and W. H. Mangione-Smith. The filter 
cache: An energy efficient memory structure. In liiter- 
iinlioiznl S~i~iposii~iii  011 Micronrchirecture, pages 184-1 93, 
1997. 
181 H. S. Lee and G. S.  Tyson. Region-based caching: an 
energy-delay efficient memory architecture for embedded 
processors. In Proceedings of the iiiternntional conference 
on Coi~il~ilei-s, nrchirecrures, arid syiithesis ,for ernbedded 
sysieins, pages 120-127. ACM Press, 2000. 
[9] J. A. Rivers and E. S. Davidson. Reducing conflicts 
in direct-mapped caches with a temporality-based design. 
In Placeediiigs of the 1996 liiter~~nrionnl Conference oil 
Pnrnllel Processing, volume I .  pages 154-163, 1996. 
1101 G. >son, M. Farrens, J. Matthews, and A. R. Pleszkun. 
A modified approach to data cache management. In 
Proceedings of the 28th Ai~nunl ACM/IEEE Interi~ntioilnl 
Ssrnposium on Microarchitectitre, pages 93-103, 1995. 
[I I] 0. S. Unsal, I. Koren, C. M. Krishna, and C. A. Moritz. The 
minimax cache: An energy-efficienr framework for media 
processors. In HPCA, pages 13 1-140, 2002. 
f
[I] . . lbonesi. elective cache ays: n-de and cache
r r ll ti . I r in ft Int ati nal
mposiu/11 i a it t , - 59, .
[2] K. K. Chan, C. C. Hay, J. R. Keller. G. P. Kurpanek, F. X.
r, . . i f . I
l ll- a a l, r r .
[3] . . , . . , . . t .
e i li i l t l . h ti l
m t r i , : , .
[4] Intel orporation. Intel trong -lllO icroproces-
r l ' l.
. tel.com/design/strong/manuals/278240.htm,
t r I.
[ ] I t l r r ti . t l li ti
l ' l.
tt :// . i i / l
l / . t , .
[6J . . . . . ti ti
is.
t th ·International mposiu/11 n
mputer Architecture, I - .
[7] . i , . t , . .
e: . In -
nat nal ymposium on a t
97.
[8J . . . . .
r - l
rs. nt a
mpilers, a t tures, n n fo m
t ms, , .
[9J . . . .
i .
ro n International 011
a a , , , .
[IOJ . Tyson, . , . , . . .
t.
nnual national
ymposiu/11 it ture, , .
[IIJ O. . l, 1. , . . .
i : t
ss rs. , .
