Improving Cache Performance by Selective Cache Bypass by Chi, Chi-Hung & Dietz, Henry
Purdue University
Purdue e-Pubs
Department of Electrical and Computer
Engineering Technical Reports
Department of Electrical and Computer
Engineering
7-1-1988






Follow this and additional works at: https://docs.lib.purdue.edu/ecetr
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.
Chi, Chi-Hung and Dietz, Henry, "Improving Cache Performance by Selective Cache Bypass" (1988). Department of Electrical and












School of Electrical Engineering
Purdue University
West Lafayette, Indiana 47907
Improving Cache Performance by 
Selective Cache Bypass
Chi-Hung Chi
School of Electrical Engineering 
Purdue University 




School of Electrical Engineering 
Purdue University 
West Lafayette, IN 47907 
hankd@ ee.ecn.purdue.edu 
(317) 494 3357
In traditional cache-based computers, all memory references are made through 
cache. However, a significant number of items which are referenced in a program are 
referenced so infrequently th a t other cache traffic is certain to “ bum p” these items from 
cache before th e y  are referenced again. I n  such cases, not only is there no benefit in plac- 
ing the item in cache, but there is the additional overhead of “bum ping” some other item 
out of cache to  make room for this useless cache entry. Where a cache line is larger than 
a processor word, there is an additional penalty in loading the entire line from memory 
into cache, whereas the reference could have been satisfied with a single word fetch. 
Simulations have shown th a t these effects typically degrade cache-based system perfor- 
mance (average reference time) by 10% to 30%.
This performance loss is due to cache pollution; by simply forcing “polluting” refer- 
ences to directly reference main memory — bypassing the cache — much of this perfor- 
mance can be regained. The technique proposed in this paper involves the use of new 
hardware, called a B ypass-C ache, which, under program control, will determine 
whether each reference should be through the cache or bypassing the cache and referenc- 
ing main memory directly. Several inexpensive heuristics for the compiler to determine 
how to make each reference are given.
K eyw ords: bypass-cache, cache-pollution, cache, compiler-analysis, compiler-
optimization, execution-time.
P resen ta tio n  m aterials needed: overhead projector.
Purdue University TR-EE 88-36
I . In trodu ction
Advances in supercomputing and semiconductor technologies have made it possible 
to design and build high performance computer systems with many processors. However, 
the performance of these systems is often limited by memory reference bandwidth. While 
the execution of each operation has become very fast, the time to fetch each datum  from 
main memory (or from another processor’s local memory) is a t least an order of magni­
tude longer than the processor operation time — also an order of m agnitude longer than 
the reference time from on-chip or local memory. Use of a cache seems a natural way to 
attack  this mismatch. -
It is widely accepted th a t cache memory is a cost effective way to  improve system 
performance by using locality properties to improve apparent average memory access 
time. Significant reductions in the average data/instruction  access time have been 
achieved using very simple i cache placem ent/replacem ent policies implemented iii 
hardware [Bel74]. If anything, the success of cache ha.s been too complete; the desirability 
of caching items is rarely questioned and basic research on cache design generally has 
been reduced to the level of benchmarking and fine-tuning a few; well-known param eters.
For example, since cache reference time is so much less than main memory reference 
time, it is commonly held th a t as many data as possible should be placed in cache. One 
typically measures the efficacy of a cache design by determining the cache h it ratio — the 
fraction of memory references which are satisfied by cache entries. The problem is simply 
th a t it is not always beneficial to fetch a line into the cache on a cache-miss even if the 
cache is infinitely large — increasing cache hit ratio sometimes reduces system perfor­
mance'. O ther criteria like memory traffic have occasionally been used instead of cache 
h it ratio, bu t these measures are also somewhat imprecise and indirect. If one wants to 
minimize total memory reference time, then th a t is the obvious measure by which cache 
performance should be judged. Throughout this paper, cache performance is measured in 
term s of the effect on to tal memory reference time.
Why are the more commonly used cache performance criteria inaccurate measures 
of system performance? There is always an overhead associated with fetching a line from 
memory into cache. If the benefit gained from having th a t line in cache is not greater 
than the overhead th a t loading the cache line implies, then it is faster to reference the 
data  of th a t line directly from main memory. This is true even if the cache is infinitely 
large — but even more dramatically true with smaller caches. If some mechanism can be 
used to selectively disable or bypass the cache for those references which cache .cannot 
improve:
[lj the cost of loading the cache with these lines is saved and
[2] for finite-size caches, more cache space becomes available to other references and the
probability of accidentally replacing useful lines (those lines th a t can help improve
Page 2
Purdue University TR-EE 88-36
system performance) is reduced—- there will be less cache pollution.
Simulation results, reported in Section 4, strongly support this view. An average of 10%  
to 80% reduction in total reference time can be achieved simply by using the proposed 
cache bypass mechanism.
Section 2 of this paper presents a survey of current cache designs and bypass con­
cepts. Section 3 discusses the cache bypass mechanism and how the cache bypass control 
inform ation can be implemented in p racticalhardw are, S e c tio n 4’-presents simulation 
results. Continuing research on the cache bypass mechanism is described in Section 5.
2. C urrent C ache D esigns and B ypass C oncepts
Before investigating the mechanism for, and benefits of, selective cache bypass, it is 
useful to briefly survey existing cache management policies; in part, this highlights where 
the extra performance comes from, but it also clarifies the constraints these traditional 
policies impose on the cache bypass mechanism. Examples illustrate why some con­
strain ts imposed by previous cache replacement policies often cause a large- decrease in 
system performance, as well as how eliminating some of these constraints can regain 
much pf the lost performance.
This discussion serves the purpose of illustrating the im portance of cache bypass and 
of giving m otivation to research this topic. In the last part of this section, we briefly 
describe the cache bypass mechanism used in the C l minisupercomputer manufactured by 
Convex Computer Corporation [Con86]. Although the strategy used for cache bypass in 
the C l is very limited, it does dem onstrate the im portance of incorporating a bypass 
mechanism.
2.1. T rad itional R ep lacem ent Policies
Replacement policy is defined as the set of rules by which the choice of which cache 
line to replace is made when the cache is full and a new line is to be fetched from the 
main memory into the cache [HwB84]. Replacement policies such as LRU (least recently 
used), random replacement, FIFO (first-in first-out), etc., are commonly used in current 
cache designs.
Although each of these traditional cache replacement policies has its own unique 
technique for placing and/or replacing cache lines, the option of deciding not to put the 
requested line in cache was not considered. In all conventional cache replacement policies, 
immediately after each reference, the line referenced is in cache. This implies th a t when­
ever there is a cache miss occurred, the missed line needs to be fetched into the cache and 
this line fetch is independent of whether the fetched line would bring improvement to sys­
tem performance.
Page 3
Purdue Uniyersity TR-EE 88-36
The main argum ent for this constraint is th a t since reference time of data in cache 
is much smaller than th a t from main memory and with spatial and tem poral behavior of 
program references [Spi77]., having the current referenced line in cache has a high proba­
bility to bring improvement in system performance. While this argum ent is generally 
true, it is possible to predict with good certainty exactly which lines will not contribute to 
improving performance; w ithout such prediction, it is easy to envision scenarios where the 
cache would replace lines it should have kept with lines th a t will never again be refer­
enced. This leads to a worst-case scenario in which a machine runs slower with cache 
than  w ithout it. Bypassing the cache, hence avoiding this pollution, this worst-case 
scenario is averted.
An example of this problem is easily constructed. Suppose there is a fully- 
associative cache of size two, line size one,,and the memory reference string is 12 3 12 3. 
(It is interesting to note th a t this example is exactly the kind of reference sequence one 
would get in executing a typical loop which references more data than there are cache 
c e lls  _  which is well-known to the worst-case for LRU.) W ith the cost of different types 
of memory references shown in Table I (and the line-style used to represent each), the 
cache content after each reference with random replacement, LRU, and modified LRU 
with cache bypass mechanism are shown in Figures I, 2, and 3.
Page 4
Purdue University TR-EE 88-36









T c +  T p Reference through 
Cache (with Fetch 
to Em pty Cache 
Line)
VV \V\
T c +  2 (T p ) Reference through 
Cache (with
Replacement of a 
Cache Line)
T able Is C ost for E ach T ype o f M em ory R eference
Page 5
Purdue University TR-EE 88-36
Figure I: R andom  R eplacem ent T ransactions for 123  1 23
Page 6
Purdue University TR-EE 88-36
— ■ I -  — — I 2 —  — . 3 ----------------- . I —~  — - 2
2 3 . - • . I • 2 ; . 3
. . . ' . ■ ■ Ca ■ ■ •
;•
ref. I ref. 2 ' V,- ' ref . 3 ref. I ; r ef. 2 ref. 3
;
; / - F igure 2: LR U  T ransactions for 1 2 3 1 2  3 \ C ; ' :
. .■ • I <-j . • ■. ■ ■ • ------- n ■ - --------
. , ' — —  —  — > 1 : . I • —  — • I -r------------- * I I - — — — — * I
' - — ' —— —
— — 2 2 2 2 ' V ' '■ 2
— ■ —— : :• =‘ ■ ' '. ;; Ce- ———
ref. T ref. 2 ref. 3 ref. I ref. 2 ref. 3
F igure 3: M odified LR U  w ith  C ache B yp ass for 12 3 12 3
Cache Policy Cost Cost with 
Tp =  T r m  IOTe
Q 0 Cache—P olity  /  @ o s ^ Optimal
Optimal 2Tp -f 2Tr + 4 Te 44 Te 1.000
Random TJhTp + 6  T4 83.5 Te 1.898
LRU IOTp + 6  Te 10 OTe ° 2.409
T able 2: C om parsion o f E xecution  T im es for 1 2 3 1 2 3
The total reference costs using these three policies are shown in Table 2. In this 
table, it can be seen that the ratio of CostRandom /  CostBypa„ is 1.898 and the ratio of 
CostitRjj / Costsvfat, is 2.409.
Notice that while placing data 1 and 2 in cache can improve system performance, 
placing datum 3 in cache actually decreases the system performance. Unfortunately, if 
bypass of the cache is not considered, the resulting performance is the worst possible — in 
fact, it is worse than if no cache were present. With selective cache bypass, one might 
simply reference datum 3 directly from main memory; yet the cache would speed-up 
references to data 1 and 2.
Page 7
Purdue University TR-EE 88-36
2.2. H istory o f Cache B ypass
Although not commonly accepted as part of traditional cache design, cache bypass is 
not entirely new.
Nearly all cache-based computers have some provision for disabling the cache so 
th a t memory-mapped I/O  transactions can take place. However, the idea of 
enabling/disabling the cache for each memory reference is not well supported by most of 
these systems (presumably the possibility had not been considered). These systems typi­
cally require an entire instruction to be executed to change the cache enable state. 
Despite this, such systems can be used to implement cache bypass where several consecu­
tive references should be bypassed.
Some machine designers also recognized th a t the performance of cache could be 
improved by simultaneously requesting each datum from both main memory and cache. 
In this scheme, if the item is found in the cache then the cached value is used and the 
main memory request is cancelled or ignored. If not, the item is returned directly from 
main memory to the processor, simultaneously initiating a cache update for th a t datum ’s 
line. This technique does improve performance, but may require fairly expensive 
hardware and does not avert cache pollution — it merely reduces the cost of referencing 
“ through” the cache.
Somewhat closer in spirit to our approach, Convex Computer Corporation has 
implemented a selective cache bypass mechanism in their C l minisupercomputer. The 
strategy employed is [Con86]:
Upon load or store, the physical control unit either writes the referenced data 
into its cache or bypasses the cache and accesses main memory directly, leaving 
the cache unmodified. All aligned 64-bit vector loads and stores result in cache 
bypass. Loads and stores of aligned, contiguous 32-bit vector elements bypass 
the cache as well. Since vector accesses dominate supercomputer-class applica­
tions software, cache bypass opportunities occur frequently.
Apparently, the cache bypass mechanism is employed only on vector operations because 
the C l has a cache with a set size of one, hence, loading a vector register had the effect of 
totally flushing the cache — obviously negating any benefits of caching. In any case, the 
Convex scheme is quite reasonable, and was sufficiently new so as to be patented (patent 
pending?); the problem is th a t it equates “vector” with “ bypass,” and this isn’t really 
correct. Some vectors should be cached and some scalars shouldn’t  be, but on the average 
the Convex scheme is right often enough to yield a big improvement.
In contrast, the current proposal for cache bypass is to use a compile-time static 
analysis of the reference behavior of each program to compute a “ cache/bypass” tag for 
each memory reference the compiled code makes. These tags are used at runtim e to  con­
trol a cache enable/disable line.
Page 8
Purdue University TR-EE 88-36
3* Im plernenting Gache B ypass
As shown in the example of Section 2.1, LRU referencing of all data through the 
cache actually performed worse than if no cache were present.
There are two main reasons for this phenom ena F irst, there is often a large time 
overhead implied in moving lines of data between cache and main memory. This over­
head increases as the cache line size is increased. Consequently^ fetching a line into cache 
can improve system performance iff the to tal number of references to data in th a t line 
(before th a t line is replaced) is such th a t the savings in referencing cache outweighs the 
overhead of moving th a t line between cache and main memory. If not, the to tal time to 
make these references will be minimized by ignoring the cache — bypassing to  directly 
reference main memory. Even if the cache is infinitely large, this still holds. ; :
Second, since all real caches are finite, placing one line in cache generally means th a t 
some other line cannot be in cache. Hence, placing infrequently referenced lines into 
cache not only adds a large overhead to; to tal memory access tim e, b u t also prevents 
speed-up th a t could have been gained if some other (more heavily referenced) line were 
placed in cache. This effect is what we call “ cache pollution.”
Since minimizing the to tal memory access tim e is our goal in selective cache bypass 
and the to tal access tim e depends on both the architectural design and the implementa­
tion technology of the cache and main memory, some details m ust be supplied. In the 
remainder of this paper, we have chosen to discuss cache bypass assuming th a t the sup­
plied information is th a t of a typical system; this greatly simplifies the following discus­
sion and reduces the number of graphs needed to support the rest of the paper. For 
example, the simulations and examples presented in this paper are based on the assump­
tion th a t LRU is the basic cache management technique and th a t “ typical” CMOS or 
NMOS ICs implement the relevant system components. This implies, for example, th a t a 
main memory reference! takes about 10 times as long as a cache reference — in reality, 
this ratio  varies from about 2:1 to greater than 50:1. Of course, the use o f specific 
numbers in the examples and discussion is not indicative of the technique requiring those 
exact numbers: the technique works for most reasonable cache organizations, only the 
percentage benefit gained varies.
In Section 3.1, a brief discussion of current IC technologies and their im pact on 
memory access time is given. Criteria or rules to determine whether a reference request 
is going to bypass the cache and to reference directly from main memory are presented in 
Section 3.2. Section 3.3 gives a very simple and cheap, yet efficient, way to incorporate a 
cache bypass mechanism with an LRU policy. Practical im plem entation schemes for 
cache bypass control signals to be added to existing systems are presented in Section 3.4.
Purdue University TR-EE 88-36
3.1. In tegrated  G ircuit T echnologies
Integrated circuit (IG) technology is one of the major param eters in the criteria for 
cache bypass mechanism (discussed in the next section). Hence, a brief survey of current 
different (IC) technologies and its im pact on off-chip and on-chip memory reference time 
is necessary. Table 3 gives the on-chip and off-chip memory access times for some of the 
current integrated circuit technologies [MiF86]. From this table, we see th a t the ratio of 
off-chip to  on-chip memory access times is at least 10. Using this ratio, an estim ate of the 
minimum reference frequency th a t a line needs to justify its placement in cache can be 
obtained.
Type of Access Silicon CMOS/SOS Silicon NMQS GaAS
On-chip memoryjiccess 10^20ns 10-20ns 0.5-2. Ons
Off-chip on-package memory access 40-80ns 20-40ns 4-IOns
Off-chip off-package memory access
/ -
100-200ns 100-200ns 20-80ns
Ratio of off-chip on-package to 4 2 ^ 5-8
pn-chip memory access ; ;
Ratio of off-chip off-package to J J . /  10 40
on-chip memory access
T able 3. M em ory A ccess T im e o f D ifferent IG T echnologies
3.2. C riteria for Gache B ypass M echanism
Thoughout the current work, the main focus is the reduction of to tal memory refer­
ence time for a program. Hence, criteria proposed here are based on the comparsion 
between the time overhead involved in having a line in cache and the to tal reference time 
saved by referencing data in a line in cache.
The time overhead of placing a line in cache is the transfer time for all data of tha t 
line from main memory to cache. If any dirty1 line is bumped out of cache using a write­
back cache- a similar transfer time to uptime the main memory is also included in this 
overhead. Since the am ount of data transfer between main memory and cache is constant 
for a cache design, this overhead is only architecture design and im plem entation technol­
ogy dependent, and is independent of program behavior.
I. A line in cache is considered dirty iff some protion of the value it contains 
does not match the value stored in the corresponding main memory line.
Page 10
Purdue University TR-EEl 88-36
On the other hand, the time savings for placing a line in cache accumulates every 
time data in th a t line is referenced. Hence, the savings are, in addition, program depen­
dent.
There are additional factors which can influence the costs and the savings of 
placing/replacing a line in cache, resulting in slightly different cache bypass decisions of 
references in a program. For example, if a reference is going to bypass the cache and 
directly reference main memory, the average probability of bumping a line from cache 
decreases, and cache space could also be viewed as available to other lines.
These effects are easily recognized and advantageously used in the Cache bypass 
mechanism. In fact, a  complete analytical model of the cache bypass mechanism for com­
mon cache replacement policies to take all these factors into consideration can easily be 
derived from the compiler-driven cache (SGP) model [ChD87] [GhiD88]. While the SCP 
model can fully account for cache bypass, and can promise.- optirnal performance, the com­
plete SCP model does entail relatively complex analysis and compiler technology; hence, 
the technique presented here is a sub-optimal, but quite effective and simple, approxima­
tion to the SCP model2.
To define an algorithm for determining when to bypass the cache for a particular 
reference, some definitions and notations are useful.
overhead(i) =  time overhead of placing/replacing line i in cache
saving(i) =  time saving of having line i in  cache before it is replaced
n(i) —  to tal number of referencing line % in cache before it is replaced
W ith the cost notations defined in Table I, the overhead(i) and saving(i) are as follows:
If no dirty line is bumped out of cache, the overhead is: 
overhead(i) =  Tp
If a dirty line is replaced (bumped) from the cache, then the overhead is: 
overhead(i)= = 2*Tp
The savings for having line t in cache (before it is replaced) is: 
saving(i) — n(i) * (Tr - Tc )
In order for a reference line i to bypass the cache, the overhead overhead(i) m ust be 
greater or equal to the to tal time savings saving(i). Only in this case can the placement
2. In fact, if the SCP model, is used with more radically redesigned cache,
performance is much better than using a Bypass-Cache and the analysis is 
essentially the same. Hence, we feel that if one Ayante to achieve optimal 
performance, one should be willing to make the more drastic hardware and 
software changes to support it — here, we have simply given a technique 
whereby only trivial hardware and software changes result in large, but 
sub-optimal, performance gains.
Page 11
Purdue University TR-EE 88-36
of line i contribute to improve system performance.
3.3. A lgorith m  for LR U  B yp ass-C ach e
In this section, LRU (least recently used) cache replacement is chosen as the basic 
scheme and the cache bypass control is added on top of this policy. We have chooseii to 
discuss an LRU Bypass-Cache because the basic LRU policy is probably the most com­
monly used and most commonly trusted to yield good performance. Hence, the compas­
sions of simulated performance w ith /w ithout cache bypass (in Section 4) are very good 
estimates of the expected im provem ent derived by converting commonly available com­
puters to use Bypass-Cache instead of traditional cache.
In this section, a fast* simple, efficient (yet sub-optimal) algorithm to determine 
when a reference should bypass the cache is proposed. The algorithm is based on the con­
cept of a tra c e , as discussed in t r a c e  sch ed u lin g  techniques used for autom atic parallel­
izing Compilers [E1185]. The procedure to  determine, for each reference in the program, 
whether to bypass or to reference through the cache is:
1. Perform  traditiohal flow analysis and build the program flow graph. (This step
should beconsidered “ free” because any good compiler will use this same analysis to
aid in generating efficient code.)
2. For each trace (a possible control flow path which has not yet been processed), do
the following: \  ^  v’ ''
a. M ark all references in this trace as “ cachable” (put in cache).
b. Scan this trace, keeping track of which items would be resident in cache assum­
ing th a t all items marked as cachable are always referenced through the cache 
and th a t LRU is used to determine which item is bumped from cache when line 
replacement occurs. As the references are scanned, the time overhead and sav­
ings realized for each cachable line are accumulated. As a simple heuristic, the 
savings for referencing an item within a loop is multiplied by a factor of IO3.
c. A t the end of the trace, m ark all references which have a larger overhead than 
savings as “non-cachable” .
d. The above set of markings can be somewhat improved, although not made 
optimal, by repeating steps 2b and 2c. Such repetition is, however, completely 
optional. All the simulation results given in this paper used only a single pass.
This algorithm, although very crude and simple, reaps speedups ranging from a few 
percent to a factor of nearly 100, depending on the cache configuration and the bencli- 
m ark used. Speedups greater than 2 are not unusual for commonly used cache 
configurations.
This is a rough approximation to weighting each reference in the trace by its 
expected number of executions — it assumes each loop executes an average 
of 10 times. If the compiler has a better estimate, this can be used instead. 
Techniques for the compiler to make more intelligent estimates pf expected 
execution frequencies are discussed in [Die87].
Page 12
Purdue University TR-EE 88-36
3.4. Iittplem ientation o f B ypass C ontrol
W ith the results of compiler analysis of a program (or with statistical results 
gleaned from previous runs), the bypass/cache question is easilyansw ered with good 
enough accuracy so as to perm it huge performance increases. However, th is  ^information 
m ust be transm itted  to the Bypass-Cache control logic for each reference. The informa­
tion for each reference requires only a single bit — a I means “bypass” and 0 means “go 
through the cache.” The naturahquestion is how does the compiler get this one bit of 
inform ation for each reference into the Bypass-Cache control a t runtime?
There are a number of alternative solutions to  this problem and each of these solu­
tions trades off some resources or capabilities.
The conceptually easiest and most efficient way to transm it this cache bypass infor­
m ation is to embed a bit in each instruction for each memory reference the instruction 
may cause. For new machine design, this is fairly convenient; reserving a control bit to 
obtain speedups of to tal memory access time by factors of 2 or more is virtually always 
worthwhile. Also, existing machines with at least one currently unused bit in each 
instruction should probably use this implementation.
• Alternatively, the instruction set of the machine can be expanded to  include explicit 
Bypass-Gache control instructions. In fact, these instructions exist for virtually all com­
puters which have cache. An extreme example of this explicit cache control is the IBM 
801, where individual cache lines can be explicitly allocated and 4ealloca,ted; most systems 
simply perm it the Cache to  be enabled/disabled as a whole. Since bypasses may come in 
“ clumps” , even this crude bypass control can gain some improvement; however, bypasses 
do not always come in clumps. By defining a new instruction specifically to implement 
Bypass-Gache control, one could perm it each cache control instruction to set the pattern  
of bypass/cache decisions for the next n references, where n is somewffiat less than the 
machine word length. Again, some performance would be gained, bu t the high frequency 
Of Bypass-Cache control instructions would limit performance.
While all the above schemes have some merit, there is another scheme which both 
perm its a cache control bit to be associated with each instruction and does not require 
changes in the instruction set design or encoding. In current machine designs, the 
addressable space is typically very large and programs rarely use the entire addressable 
space of the machine. Thus, it is possible to trade one address b it (e.g., the most 
significant bit of an address) for use as the control b it for the Bypass-Cache. In fact, this 
solution is suggested by Intel in their 80386 program m er’s reference manual [Int86] as a 
way to provide a Cache control bit for use in multiprocessor Cache coherency control. 
W orst case, this effectively reduces the addressable space by 50% ^ Of course, it also 4
4. The actual address space may not be affected because address mapping 
mechanisms may be able to circumvent the loss.
Page 13
causes the compiler writer a bit of grief in th a t not only m ust all addresses be correctly 
tagged, bu t the compiler must also be careful about operations such as pointer arithmetic 
or comparisons.
Other methods, such as using a separate cache controller to explicitly control the 
cache (similar to the remote PC idea [Rad83]) are also possible. However, the overhead 
and the synchronization cost involved may be too large to be practical.
4. S im ulation  R esults
To measure the effect of cache bypass in reducing total reference time, detailed 
simulation of the LRU Bypass-Cache was performed using the single-pass compiler algo­
rithm  A scribed above* For comparison, the same simulations were performed using a 
conventional LRU cache with the same configuration as the Bypass-Cache.
The benchmark programs were taken from the DARPA MIPS package, and are 
widely used as benchmarks of cache and/or system performance. D ata are given for four 
of these programs:
Bubble *. !
A typical bubble sort program, executed on a set of 500 random data.
Puzzle
This is a compute-bound program from Forest Basket, run with a size of 511. 
Realmm
A program which performs a m atrix multiplication of two teal matricies, each of 
which is 40 by 40/
TowerThe standard recursive tower-of-Hanoi solution, given the problem of moving 18 
disks.
Each of the programs was simulated for about 500,000 references of execution, hence 
“ cold s ta r t’’ cache effects are negligible.
Since our prim ary concern is minimizing the to tal reference time, rather than max­
imizing hit ratio, it was also necessary to assume specific ratios of reference tim es for each 
of the different types of reference. The cost functions used for the data in this paper were 
based on cost estimates for a typical CMOS-based system:
• Cost of referencing data from cache is I time unit.
• Cost of referencing data from main memory is 10 t i in e units.
• Cost of placing a line in an empty or non-dirty cache entry is 10 -f- (line_size - I) * 7
tim e units.
The fact th a t fetching/storing n consecutive data into/from  cache in one request takes 
less time than fetching/storing n data in n requests is reflected in the above costs. We 
were actually quite generous in this assumption, using a formula giving a 30% benefit for 
multi-word fetch/store; however, this simply has the effect o fm ak in g  ^ tlie benefit due to 
Bypass-Cache appear smaller.
Purdue University TR-EE 88-36
Page 14
To make the simulations as complete as possible, all possible power-of-2 cache 
organizations (e,g. different line sizes, set sizes) for a fixed cache size of 128 words5 were 
simulated and are presented in this paper. The absolute reference times for the different 
benchmarks naturally differ, however, the speedups and curve shapes are fairly consistent 
across all the simulations.
Figures 4 through 7 graph speedup of to tal memory reference times with Bypass- 
Cache as compared to the same configuration conventional cache. Each curve in the 
graphs is marked w ^h the power-of-2 which was used as the associative set size. These 
graphs clearly dem onstrate th a t the speedup in to tal memory reference time using 
Bypass-Cache is very large —- in fact, it is plotted on a log scale, and averages about 2.
; : The speedup with BypassrCache is usually smallest for a line size of one or two. 
W ith an increase in line size (leaving cache size and set size fixed), the speedup with 
Bypass-Cache increases greatly. This agress with confirms the argum ent given in Section
3. This is because a larger line size implies a larger overhead in cache line placement and 
replacement. Although the to tal number of references of a line with increasing line size 
increases, this increase is much less than the increase in overhead. Consequently, cache 
more easily becomes polluted, and the Bypass-Cache becomes more critical in improving 
system performance.
These curves also show th a t the speedup with Bypass-Cache is usually smaller for 
cache with small set size (fixed cache size aiid line size). Although the cause of this is not 
yet known, we suspect th a t this is related to the increase in traffic seen by each cache set 
(becuase there are fewer sets). Even though the speedup is much smaller in these cases, it 
is still typically about 1.2 (i.e., 20 percent).
Figure 8 shows the to tal reference time for the Tower benchmark. The dotted lines 
indicate the times taken using conventional cache, whereas the solid lines show the times 
taken with Bypass-Cache.
Aside from the obvious benefit in using Bypass-Cache, this graph suggests an 
interesting general cache design rule. If th e  to ta l m em ory reference tim e is to  be 
m inim ized , rather th a n  th e  h it-ra tio  m axim ized , it is u sually  b e tter  to  choose  
sm all line size and sm all set size. This makes perfect sense in th a t although large line 
sizes increase hit-ratio, they imply overhead increases which are greater than the hit-ratio 
increases — in fact, expotentially greater. T hat increasing set size is not beneficial is less 
intuitive, bu t probably is related to the increased traffic per set and use of a poor
Purdue University TR-EE 88-36
About 500 simulations were performed, encompassing a wide variety of 
cache sizes and configurations. However, all the simulation results obtained 
were very consistent, hence we have chosen to present only the data for the 
Jargest cache size we examined — 128 words. Other simulation data are 
aval able upon request.
Page 15
replacement algorithm (i.e., one can do a whole lot better than LRU [ChD87]).
For Bypass-Cache, the difference in to tal memory access time for different line sizes 
(with same cache size and size) is not as great as those for cache w ithout bypass. This is 
true because a lot of cache pollution can be avoided with Bypass-Cache.
Purdue University TR-EE 88-36
Page 16





Line Size (log scale plot)
Figure 4: Speedup in Total Reference Time for Bubble
Page 17





Line Size (log scale plot)
F ig u re  5: Speedup in Total Reference Time for Puzzle
Page 18





Line Size (log scale plot)
Figure 6: Speedup in Total Reference Time for Realmm
Page 19





Line Size (log scale plot)
F ig u re  7: Speedup in Total Reference Time for Tower
Page 20





. . » • « » » |  «i
Line Size (log scale plot)
Figure 8:
Total Reference Time W IT H / W ITHO U T  Bypass for Tower 
(WITH is solid lines, W ITH O U Tis  dotted lines)
5. C onclusion
I In this paper, we present a new cache design — Bypass-Cache which is able to 
avert polluting the cache by bypassing the cache for entries for which caching would pot 
result in faster to tal execution time. From our simulation results, we see th a t the 
speedup is tremendous, with an average of about 2. Various methods for implementing 
the Bypass-Cache architecture are presented as well as an outline of the compiler technol­
ogy required for its effective use.
Page 21
Perhaps the most significant result, however, is that, !c'ache-'hit: r a t io  is n o t neces­
s a ry  re la te d  tb  th e  to ta l  re fe ren c e  tim e . This will be discussed more deeply in a 
later paper.
Purdue Uiiiversity TR-EE 88-36
A ck n o w led g em en ts
Thanks to the members of CARP (the Compiler-oriented Architecture Research 
group a t Purdue) for their useful comments on this work. Special thanks to George 
Adams for his suggestions concerning the presentation of the results and also for coining 
the name B y p ass-C ach e .
'R e fe re n c e s '
[A1B86] Allen, R., Baum gartner, D., Kennedy, K., Porterfield, A., “ PTOOL: A
Semi-Automatic Parallel Programm ing A ssistant,” 1986 International 
Conference on Parallel Processing,August 1986, pp. 164-170.
[Bel74] Belady, L.A., Palermo, F .P ., “ On-line M easurement of Paging Behavior by
the Multi-valued MIN Algorithm ,” IBM  Research and Development^ 18,1, 
January, 1974, pp. 2-19.
[BuC86] Burke, M., Cytron, R.j “Interprocedural Dependence Analysis and Paral­












“ C l Processor Series: A rchitecture,” Convex Computer Corporation, 1986. 
Chi, C.H., Dietz, H .,“ Compiler-Driven Cache Policy,” Technical Report 
EE-87-21, Purdue University, May, 1987.
Chi, C.H., Dietz, H., “ Register Allocation for GaAs Computer Systems,” 
Proceedings of the 1988 Hawaii International Conference on Systems Sci- 
ences, January 1988, pp. 266-274.
Dietz, H. G., The Refined-Language Approach To Compiling For Parallel 
Supercomputers, Ph.D. Dissertation, Polytechnic University, June 1987.
Ellis, J. R., Bulldog: A  Compiler for V LIW  Architectures, 1985 ACM  Doc­
toral Dissertation Award, MIT Press, 1986.
Hwang, K., Briggs, F.A., Computer Architecture and Parallel Processing, 
McGraw Hill Book Company, 1984.
Intel Corporation, 80886 programmer’s reference manual, 1986, pp. 11-6. 
Radin, G., “ The 801 M inicomputer,” IBM Journal of Research and 
Development, May 1983, pp. 237-246.
Smith, A.J., “ Cache Memories,” Computing Surveys, Vol.', 14, No. 3, Sep­
tember, 1982, pp. 473-530.
Spirn, J., Program Behavior: Models and Measurements, Elsevier-North 
Holland, N.Y., 1977.
