Emulating Variable Block Size Caches by Muthulaxmi, S
Abstract 
The lncreasmg dspanty between processor speeds and mam memory access tlmes IS a 
major h t m g  factor to program performance Caches memones are small, hgh speed 
buffer memones mtroduced between the processor and the mam memory, whch hold 
the most recently used code and data thus reducmg thss performance bottleneck Dlrect 
mapped caches are preferred 8 first level caches as they have the least access tune 
The performance of a h e c t  mapped of a gwen sue, largely depends on the block slze 
of the cache Small block sizes result m lower memory traffic, but need substantially 
large tag space Large block sizes have higher memory tr&c owmg to a large number of 
conflict misses and the imphat prefetchmg of data Exsting solutions to reduce tag space 
and mlss ratios m drect mapped caches led to deslgn of subblock caches and decoupled 
sectored caches However, both these suffer from large number of codicts and large tag 
space (especially with processors movlng to 64 bit addresses) Our goal IS to propose a 
cache design to reduce the mlss ratios, memory traffic, tag space requirements wthout 
compromising on the access time of a duect mapped cache 
In t b  them, we evaluate a new scheme to emulate a vanable block sue cache The 
scheme uses a subtagged cache (proposed m t h  them) in conjunction mth p r e f e t b g  In 
a subtagged cache, a portion of the tag of a cache block IS associated w t h  each subblock 
in the block Thu allows subblocks from Merent memory blocks to co-reside m the 
cache block at the same tune T b  exploits temporal locality by reducmg false codhcts 
at the block level To exploit spatial locahty, we determine access patterns wthm cache 
blocks and prefetch multiple subblocks on a cache m m  There ~s a si@cant variation 
m performance of the new schemes across platforms In particular, cache-block level 
confhcts between areas of the address space contribute ~1@cantly to rmss ratios, even 
when we use subtaggmg We observed that significant performance improvement occurs 
lf the program stack ui remapped SO as to reduce the number of conflicting tag bits in 
a cache conflict and stonng the conflictmg bits with the subblock Overall, on suitable 
machine-compller platforms where data regions are allocated m a cache-consc~ous fashlon, 
we find that the new schemes for emulating vanable block sues outperform previously 
proposed alternatives, both m terms of mmss ratios and memory traffic 
