Purdue University

Purdue e-Pubs
Department of Electrical and Computer
Engineering Technical Reports

Department of Electrical and Computer
Engineering

7-1-1988

Improving Cache Performance by Selective Cache
Bypass
Chi-Hung Chi
Purdue University, chi@ec.ecn.purdue.edu

Henry Dietz
Purdue University, hankd@ee.ecn.purdue.edu

Follow this and additional works at: https://docs.lib.purdue.edu/ecetr
Chi, Chi-Hung and Dietz, Henry, "Improving Cache Performance by Selective Cache Bypass" (1988). Department of Electrical and
Computer Engineering Technical Reports. Paper 617.
https://docs.lib.purdue.edu/ecetr/617

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.

I.v.v.\\v.v.\v.;.\v.\v.v.v.v.v.v.v.v.v.\v.\
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::;:::::::::::::::::

Improving Cache
Performance by
Selective Cache Bypass
Chi-Hung Chi
Henry Dietz

TR-EE 88-36
July 1988
BiiliiilIIIlII

mmmmmmmrnm
School of Electrical Engineering
Purdue University
West Lafayette, Indiana 47907

Im proving Cache Perform ance by
Selective Cache B ypass

Chi-Hung Chi

Henry Dietz

School of Electrical Engineering
Purdue University
W e s t L afayettej IN 47907
chi @ ec.ecn.purdue.edu
(317) 494 3353

School of Electrical Engineering
Purdue University
W est Lafayette, IN 47907
hankd@ ee.ecn.purdue.edu
(317) 494 3357

In traditional cache-based com puters, all memory references are m ade through
cache. However, a significant num ber of items which are referenced in a program are
referenced so infrequently th a t other cache traffic is certain to “ bu m p ” these item s from
cache before t h e y are referenced again. I n such cases, not only is there no benefit in placing the item in cache, b u t there is the additional overhead of “ bum ping” some other item
out of cache to make room for this useless cache entry. W here a cache line is larger than
a processor word, there is an additional penalty in loading the entire line from memory
into cache, whereas the reference could have been satisfied w ith a single word fetch.
Sim ulations have shown th a t these effects typically degrade cache-based system performance (average reference time) by 10% to 30%.
This perform ance loss is due to cache pollution; by simply forcing “ polluting” references to directly reference main memory — bypassing the cache — much of this performance can be regained. The technique proposed in this paper involves the use of new
hardw are, called a B y p a ss-C a ch e, which, under program control, will determ ine
w hether each reference should be through the cache or bypassing the cache and referencing main m em ory directly. Several inexpensive heuristics for the compiler to determ ine
how to make each reference are given.
K eyw ord s:
bypass-cache,
cache-pollution,
cache,
optim ization, execution-time.
P r e se n ta tio n m a teria ls need ed : overhead projector.

compiler-analysis,

compiler-

P urdue University TR-EE 88-36

I . In tr o d u c tio n
Advances in supercom puting and semiconductor technologies have made it possible
to design and build high performance com puter systems w ith m any processors. However,
the perform ance of these systems is often lim ited by memory reference bandw idth. While
the execution of each operation has become very fast, the tim e to fetch each datum from
m ain m em ory (or from another processor’s local memory) is at least an order of magni
tude longer th an the processor operation tim e — also an order of m agnitude longer than
the reference tim e from on-chip or local memory. Use of a cache seems a natu ral way to
attack this m ism atch.

-

It is widely accepted th a t cache memory is a cost effective way to im prove system
perform ance by using locality properties to im prove apparent average memory access
tim e. Significant reductions in the average d ata/in stru ctio n access tim e have been
achieved using very simple i cache placem ent/replacem ent policies im plem ented iii
hardw are [Bel74]. If anything, the success of cache ha.s been too complete; the desirability
of caching item s is rarely questioned and basic research on cache design generally has
been reduced to the level of benchm arking and fine-tuning a few; well-known param eters.
For example, since cache reference tim e is so much less than main memory reference
time, it is commonly held th a t as many d ata as possible should be placed in cache. One
typically measures the efficacy of a cache design by determ ining the cache h it ratio — the
fraction of memory references which are satisfied by cache entries. The problem is simply
th a t it is not always beneficial to fetch a line into the cache on a cache-miss even if the
cache is infinitely large — increasing cache hit ratio sometim es reduces system perfor
mance'. O ther criteria like memory traffic have occasionally been used instead of cache
h it ratio, b u t these m easures are also somewhat imprecise and indirect. If one w ants to
m inim ize total memory reference time, then th a t is the obvious measure by which cache
perform ance should be judged. Throughout this paper, cache perform ance is measured in
term s of the effect on to tal memory reference time.
W hy are the more commonly used cache perform ance criteria inaccurate measures
of system performance? There is always an overhead associated w ith fetching a line from
m em ory into cache. If the benefit gained from having th a t line in cache is not greater
th an the overhead th a t loading the cache line implies, then it is faster to reference the
d ata of th a t line directly from main memory. This is tru e even if the cache is infinitely
large — b u t even more dram atically tru e w ith smaller caches. If some mechanism can be
used to selectively disable or bypass the cache for those references which cache .cannot
improve:
[lj
[2]

the cost of loading the cache w ith these lines is saved and
for finite-size caches, more cache space becomes available to other references and the
probability of accidentally replacing useful lines (those lines th a t can help improve

Page 2

Purdue University TR-EE 88-36
system performance) is red u ce d —- there will be less cache pollution.
Sim ulation results, reported in Section 4, strongly support this view. An average of 10%
to 80% reduction in total reference time can be achieved simply by using the proposed
cache bypass mechanism.
Section 2 of this paper presents a survey of current cache designs and bypass con
cepts. Section 3 discusses the cache bypass mechanism and how the cache bypass control
inform ation can be im plem ented in p ra cticalh ard w are,

S e c tio n 4’- presents sim ulation

results. Continuing research on the cache bypass mechanism is described in Section 5.
2. C u rren t C ach e D esign s an d B y p a ss C o n cep ts
Before investigating the mechanism for, and benefits of, selective cache bypass, it is
useful to briefly survey existing cache m anagem ent policies; in p art, this highlights where
the extra perform ance comes from, b u t it also clarifies the constraints these traditional
policies impose on the cache bypass mechanism. Examples illustrate why some con
strain ts imposed by previous cache replacem ent policies often cause a large- decrease in
system performance, as well as how eliminating some of these constraints can regain
much pf th e lost performance.
This discussion serves the purpose of illustrating the im portance of cache bypass and
of giving m otivation to research this topic. In the last p a rt of this section, we briefly
describe the cache bypass mechanism used in the C l m inisupercom puter m anufactured by
Convex C om puter C orporation [Con86]. Although the strategy used for cache bypass in
the C l is very limited, it does dem onstrate the im portance of incorporating a bypass
mechanism.
2.1. T r a d itio n a l R ep la cem en t P olicies
Replacem ent policy is defined as the set of rules by which the choice of which cache
line to replace is made when the cache is full and a new line is to be fetched from the
m ain m emory into the cache [HwB84]. Replacem ent policies such as LRU (least recently
used), random replacem ent, FIFO (first-in first-out), etc., are commonly used in current
cache designs.
A lthough each of these traditional cache replacem ent policies has its own unique
technique for placing an d /o r replacing cache lines, the option of deciding not to p u t the
requested line in cache was not considered. In all conventional cache replacem ent policies,
im m ediately after each reference, the line referenced is in cache. This implies th a t when
ever there is a cache miss occurred, the missed line needs to be fetched into the cache and
this line fetch is independent of whether the fetched line would bring im provem ent to sys
tem performance.

Page 3

P urdue Uniyersity TR-EE 88-36
The main argum ent for this constraint is th a t since reference tim e of d ata in cache
is much smaller th an th a t from main memory and w ith spatial and tem poral behavior of
program references [Spi77]., having the current referenced line in cache has a high proba
bility to bring im provem ent in system performance. While this argum ent is generally
true, it is possible to predict w ith good certainty exactly which lines will not contribute to
im proving performance; w ithout such prediction, it is easy to envision scenarios where the
cache would replace lines it should have kept w ith lines th a t will never again be refer
enced. This leads to a worst-case scenario in which a machine runs slower w ith cache
th an w ithout it. Bypassing the cache, hence avoiding this pollution, this worst-case
scenario is averted.
An example of this problem is easily constructed.

Suppose there is a fully-

associative cache of size two, line size one,,and the memory reference string is

12 3 12 3.

(It is interesting to note th a t this example is exactly the kind of reference sequence one
would get in executing a typical loop which references more d ata th an there are cache
c e lls _
which is well-known to the worst-case for LRU.) W ith the cost of different types
of memory references shown in Table I (and the line-style used to represent each), the
cache content after each reference w ith random replacem ent, LRU, and modified LRU
w ith cache bypass mechanism are shown in Figures I, 2, and 3.

Page 4

Purdue University TR-EE 88-36

Cost (Time)

Line P a tte rn

Type of Reference

none

T

___

c

Reference
Cache

from

r

Reference
from
M ain M emory

___ ___-

T

—**^

V
V
\
V
\

T c + Tp

Reference through
Cache (with Fetch
to E m pty Cache
Line)

T c + 2 (T p )

Reference through
Cache
(with
Replacem ent of a
Cache Line)

T a b le Is C ost for E ach T y p e o f M em ory R eferen ce

Page 5

Purdue U niversity TR-EE 88-36

F ig u re I: R a n d o m R ep la cem en t T ra n sa ctio n s for 1 2 3 1 23

Page 6

P urdue University TR-EE 88-36

—■

...

-

I

'.

r e f. 2

/

;

2

—

2

3

.

—

— > 1:

—

.

<-j
I

.

-----------------

.

—~ — -

I

•

I

2

;

•—

I

- r------------- *

;

r e f. 3

■

ref. 2

■' '.

ref. 3

;;

\ C ; ' :

- ------- ------- n ■
I
- — ———* I
—
——

I

2

2

2

3

.

r e f. 2

'

2
—— : :•=
‘

2

;•

• ■.■ ■•

.
—•

—
—

ref. T

-•

3

r ef. I

ref. 3

' V,- '

I
— —
'
—

—.

F ig u re 2: L R U T ra n sa ctio n s for 1 2 3 1 2 3

-

.■•
, '

I

■•

;

.

—

■ ■Ca ■
r e f. I

.

—

—

Ce-

ref. I

'

V'

'■

2

——

ref. 2

ref. 3

F ig u re 3: M odified L R U w ith C ach e B y p a ss for 12 3 12 3

Cost

Cost with
Tp = T r m IOTe

Optimal

2Tp -f 2 T r + 4 Te

44 Te

1.000

Random

TJhTp + 6 T4

83.5 Te

1.898

IOTp + 6 Te

10OTe

Cache Policy

LRU

Q0

°

Cache—P o lity /

@ o s ^ O ptim al

2.409

T a b le 2: C om p arsion o f E x ecu tio n T im es for 1 2 3 1 2 3
The total reference costs using these three policies are shown in Table 2. In this
table, it can be seen that the ratio of CostRandom / CostBypa„ is 1.898 and the ratio of
CostitRjj / Costsvfat, is 2.409.
Notice that while placing data 1 and 2 in cache can improve system performance,
placing datum 3 in cache actually decreases the system performance. Unfortunately, if
bypass of the cache is not considered, the resulting performance is the worst possible — in
fact, it is worse than if no cache were present. With selective cache bypass, one might
simply reference datum

3 directly from main memory; yet the cache would speed-up

references to data 1 and 2.

Page 7

Purdue University TR-EE 88-36

2.2. H isto ry o f C ache B y p a ss
A lthough not commonly accepted as p a rt of traditional cache design, cache bypass is
not entirely new.
Nearly all cache-based com puters have some provision for disabling the cache so
th a t

m em ory-m apped

I/O

transactions

can

take

place.

However,

the

idea

of

enabling/disabling the cache for each memory reference is not well supported by most of
these systems (presumably the possibility had not been considered). These systems typi
cally require an entire instruction to be executed to change the cache enable state.
Despite this, such systems can be used to im plem ent cache bypass where several consecu
tive references should be bypassed.
Some machine designers also recognized th a t the performance of cache could be
im proved by simultaneously requesting each datum from both main memory and cache.
In this scheme, if the item is found in the cache then the cached value is used and the
main memory request is cancelled or ignored. If not, the item is returned directly from
main memory to the processor, simultaneously initiating a cache update for th a t d atu m ’s
line. This technique does improve performance, b u t may require fairly expensive
hardw are and does not avert cache pollution — it merely reduces the cost of referencing
“ thro u g h ” the cache.
Somewhat closer in spirit to our approach, Convex Com puter C orporation has
im plem ented a selective cache bypass mechanism in their C l m inisupercom puter. The
strategy employed is [Con86]:
Upon load or store, the physical control unit either w rites the referenced data
into its cache or bypasses the cache and accesses main memory directly, leaving
the cache unmodified. All aligned 64-bit vector loads and stores result in cache
bypass. Loads and stores of aligned, contiguous 32-bit vector elements bypass
the cache as well. Since vector accesses dom inate supercom puter-class applica
tions software, cache bypass opportunities occur frequently.
A pparently, the cache bypass mechanism is employed only on vector operations because
the C l has a cache w ith a set size of one, hence, loading a vector register had the effect of
totally flushing the cache — obviously negating any benefits of caching. In any case, the
Convex scheme is quite reasonable, and was sufficiently new so as to be patented (patent
pending?); the problem is th a t it equates “vector” w ith “ bypass,” and this isn’t really
correct. Some vectors should be cached and some scalars shouldn’t be, b u t on the average
the Convex scheme is right often enough to yield a big im provem ent.
In contrast, the current proposal for cache bypass is to use a compile-time static
analysis of the reference behavior of each program to com pute a “ cache/bypass” tag for
each memory reference the compiled code makes. These tags are used at runtim e to con
trol a cache enable/disable line.

Page 8

Purdue University TR-EE 88-36
3* Im p lern en tin g G ache B y p a ss
As shown in the example of Section 2.1, LRU referencing of all d ata through the
cache actually perform ed worse than if no cache were present.
There are two main reasons for this phenom ena F irst, there is often a large time
overhead implied in moving lines of d ata between cache and main memory. This over
head increases as the cache line size is increased. Consequently^ fetching a line into cache
can improve system performance iff the to tal num ber of references to d ata in th a t line
(before th a t line is replaced) is such th a t the savings in referencing cache outweighs the
overhead of moving th a t line between cache and main memory. If not, the to tal tim e to
make these references will be minimized by ignoring the cache — bypassing to directly
reference main m em ory. Even if the cache is infinitely large, this still holds. ; :
Second, since all real caches are finite, placing one line in cache generally means th a t
some other line cannot be in cache. Hence, placing infrequently referenced lines into
cache not only adds a large overhead to; to tal m emory access tim e, b u t also prevents
speed-up th a t could have been gained if some other (more heavily referenced) line were
placed in cache. This effect is w hat we call “ cache pollution.”
Since minimizing the to tal memory access tim e is our goal in selective cache bypass
and th e to tal access tim e depends on both the architectural design and the im plem enta
tion technology of the cache and main memory, some details m ust be supplied. In the
rem ainder of this paper, we have chosen to discuss cache bypass assuming th a t the sup
plied inform ation is th a t of a typical system; this greatly simplifies the following discus
sion and reduces the num ber of graphs needed to support the rest of the paper. For
example, th e sim ulations and examples presented in this paper are based on the assump
tion th a t LRU is the basic cache m anagem ent technique and th a t “ typical” CMOS or
NMOS ICs im plem ent the relevant system components. This implies, for example, th a t a
m ain memory reference! takes about 10 times as long as a cache reference — in reality,
this ratio varies from about 2:1 to greater than 50:1. Of course, the use o f specific
num bers in the examples and discussion is not indicative of the technique requiring those
exact numbers: the technique works for most reasonable cache organizations, only the
percentage benefit gained varies.
In Section 3.1, a brief discussion of current IC technologies and their im pact on
m emory access tim e is given. C riteria or rules to determ ine w hether a reference request
is going to bypass the cache and to reference directly from main memory are presented in
Section 3.2. Section 3.3 gives a very simple and cheap, yet efficient, way to incorporate a
cache bypass mechanism w ith an LRU policy. P ractical im plem entation schemes for
cache bypass control signals to be added to existing systems are presented in Section 3.4.

Purdue University TR-EE 88-36

3.1. In te g r a te d G ircuit T ech n o lo g ies
Integrated circuit (IG) technology is one of the m ajor param eters in the criteria for
cache bypass mechanism (discussed in the next section). Hence, a brief survey of current
different (IC) technologies and its im pact on off-chip and on-chip memory reference time
is necessary. Table 3 gives the on-chip and off-chip memory access tim es for some of the
current integrated circuit technologies [MiF86]. From this table, we see th a t the ratio of
off-chip to on-chip memory access tim es is at least 10. Using this ratio, an estim ate of the
m inim um reference frequency th a t a line needs to justify its placem ent in cache can be
obtained.
Type of Access

Silicon CM OS/SOS

Silicon NMQS

GaAS

10^20ns

10-20ns

0.5-2. Ons

40-80ns
/100-200ns

20-40ns

4-IOns

100-200ns

20-80ns

4

2

On-chip m em oryjiccess
Off-chip on-package memory access
Off-chip off-package memory access
R atio of off-chip on-package to
pn-chip memory access
R atio of off-chip off-package to
on-chip memory access

5-8

^
;

J J ./

10

;
40

T a b le 3. M em ory A ccess T im e o f D ifferen t IG T ech n o lo g ies

3.2. C riteria for G ache B y p a ss M ech a n ism
T houghout the current work, the main focus is the reduction of to tal memory refer
ence tim e for a program . Hence, criteria proposed here are based on the comparsion
between the tim e overhead involved in having a line in cache and the to tal reference time
saved by referencing d ata in a line in cache.
The tim e overhead of placing a line in cache is the transfer tim e for all d ata of th a t
line from m ain memory to cache. If any d irty 1 line is bum ped out of cache using a w rite
back cache- a sim ilar transfer tim e to uptim e the m ain m emory is also included in this
overhead. Since the am ount of d ata transfer between main memory and cache is constant
for a cache design, this overhead is only architecture design and im plem entation technol
ogy dependent, and is independent of program behavior.
I.

A line in cache is considered dirty iff some protion of the value it contains
does not match the value stored in the corresponding main memory line.

Page 10

P urdue U niversity TR-EEl 88-36
On the other hand, the tim e savings for placing a line in cache accum ulates every
tim e d ata in th a t line is referenced. Hence, the savings are, in addition, program depen
dent.
T here are additional factors which can influence the costs and the savings of
placing/replacing a line in cache, resulting in slightly different cache bypass decisions of
references in a program . F or example, if a reference is going to bypass the cache and
directly reference main memory, the average probability of bum ping a line from cache
decreases, and cache space could also be viewed as available to other lines.
These effects are easily recognized and advantageously used in th e Cache bypass
mechanism. In fact, a complete analytical model of the cache bypass mechanism for com
mon cache replacem ent policies to take all these factors into consideration can easily be
derived from the compiler-driven cache (SGP) model [ChD87] [GhiD88]. While the SCP
model can fully account for cache bypass, and can promise.- optirnal performance, the com
plete SCP model does entail relatively complex analysis and compiler technology; hence,
the technique presented here is a sub-optim al, b u t quite effective and simple, approxim a
tion to the SCP model2.
To define an algorithm for determ ining when to bypass the cache for a particular
reference, some definitions and notations are useful.
overhead(i) = tim e overhead of placing/replacing line i in cache
saving(i) = tim e saving of having line i i n cache before it is replaced
n(i) — to tal num ber of referencing line %in cache before it is replaced
W ith the cost notations defined in Table I, the overhead(i) and saving(i) are as follows:
If no dirty line is bum ped out of cache, the overhead is:
overhead(i) = Tp
If a dirty line is replaced (bumped) from the cache, then the overhead is:
o verh ea d (i)= = 2 * T p
The savings for having line t in cache (before it is replaced) is:
saving(i) — n(i) * (T r - Tc )
In order for a reference line i to bypass the cache, the overhead overhead(i) m ust be
greater or equal to the to tal tim e savings saving(i). Only in this case can the placement
2.

In fact, if the SCP model, is used with more radically redesigned cache,
performance is much better than using a Bypass-Cache and the analysis is
essentially the same. Hence, we feel that if one Ayante to achieve optimal
performance, one should be willing to make the more drastic hardware and
software changes to support it — here, we have simply given a technique
whereby only trivial hardware and software changes result in large, but
sub-optimal, performance gains.

Page 11

P urdue U niversity TR-EE 88-36

of line i contribute to improve system performance.
3.3. A lg o r ith m for L R U B y p a ss-C a ch e
In this section, LRU (least recently used) cache replacem ent is chosen as the basic
scheme and the cache bypass control is added on top of this policy. We have chooseii to
discuss an LRU Bypass-Cache because the basic LRU policy is probably the m ost com
monly used and m ost commonly trusted to yield good performance. Hence, the compas
sions of sim ulated performance w ith /w ith o u t cache bypass (in Section 4) are very good
estim ates of th e expected im provem ent derived by converting commonly available com
puters to use Bypass-Cache instead of traditional cache.
In this section, a fast* simple, efficient (yet sub-optim al) algorithm to determine
when a reference should bypass the cache is proposed. The algorithm is based on the con
cept of a tr a c e , as discussed in tr a c e s c h e d u lin g techniques used for autom atic parallel
izing Compilers [E1185]. The procedure to determine, for each reference in the program ,
w hether to bypass or to reference through the cache is:
1.

2.

Perform traditiohal flow analysis and build the program flow graph. (This step
should beconsidered “ free” because any good compiler will use this same analysis to
aid in generating efficient code.)
For each trace (a possible control flow p ath which has not yet been processed), do
the following:
\ ^
v’ ''
a.
M ark all references in this trace as “ cachable” (put in cache).
b.
Scan this trace, keeping track of which item s would be resident in cache assum
ing th a t all item s m arked as cachable are always referenced through the cache
and th a t LRU is used to determ ine which item is bum ped from cache when line
replacem ent occurs. As the references are scanned, the tim e overhead and sav
ings realized for each cachable line are accum ulated. As a simple heuristic, the
savings for referencing an item within a loop is m ultiplied by a factor of IO3.
c.
d.

A t the end of the trace, m ark all references which have a larger overhead than
savings as “ non-cachable” .
The above set of markings can be somewhat improved, although not made
optim al, by repeating steps 2b and 2c. Such repetition is, however, completely
optional. All the sim ulation results given in this paper used only a single pass.

This algorithm , although very crude and simple, reaps speedups ranging from a few
percent to a factor of nearly 100, depending on the cache configuration and the benclim ark used. Speedups greater than 2 are not unusual for commonly used cache
configurations.
This is a rough approximation to weighting each reference in the trace by its
expected number of executions — it assumes each loop executes an average
of 10 times. If the compiler has a better estimate, this can be used instead.
Techniques for the compiler to make more intelligent estimates pf expected
execution frequencies are discussed in [Die87].

Page 12

P urdue U niversity TR-EE 88-36
3.4. Iittplem ientation o f B y p a ss C o n tro l
W ith the results of compiler analysis of a program (or w ith statistical results
gleaned from previous runs), the bypass/cache question is easily an sw ered w ith good
enough accuracy so as to perm it huge performance increases. However, th is^inform ation
m ust be tran sm itted to the Bypass-C ache control logic for each reference. The inform a
tion for each reference requires only a single b it — a I means “ bypass” and 0 means “go
through the cache.” The n atu rahquestion is how does the compiler get this one b it of
inform ation for each reference into the Bypass-Cache control a t runtim e?
There are a num ber of alternative solutions to this problem and each of these solu
tions trades off some resources or capabilities.
The conceptually easiest and m ost efficient way to tran sm it this cache bypass infor
m ation is to embed a b it in each instruction for each memory reference the instruction
may cause. For new machine design, this is fairly convenient; reserving a control bit to
obtain speedups of to tal memory access tim e by factors of 2 or more is virtually always
worthwhile. Also, existing machines w ith at least one currently unused b it in each
instruction should probably use this im plem entation.
•

A lternatively, the instruction set of the machine can be expanded to include explicit
Bypass-Gache control instructions. In fact, these instructions exist for virtually all com
puters which have cache. An extreme example of this explicit cache control is the IBM
801, where individual cache lines can be explicitly allocated and 4ealloca,ted; m ost systems
simply perm it the Cache to be enabled/disabled as a whole. Since bypasses may come in
“ clum ps” , even this crude bypass control can gain some im provem ent; however, bypasses
do not always come in clumps. By defining a new instruction specifically to im plem ent
Bypass-Gache control, one could perm it each cache control instruction to set the p attern
of bypass/cache decisions for the next n references, where n is somewffiat less th an the
machine word length. Again, some performance would be gained, b u t the high frequency
Of Bypass-Cache control instructions would lim it performance.
While all the above schemes have some m erit, there is another scheme which both
perm its a cache control b it to be associated w ith each instruction and does not require
changes in the instruction set design or encoding. In current machine designs, the
addressable space is typically very large and program s rarely use the entire addressable
space of the machine. Thus, it is possible to trade one address b it (e.g., the most
significant b it of an address) for use as the control b it for the Bypass-C ache. In fact, this
solution is suggested by Intel in their 80386 program m er’s reference m anual [Int86] as a
way to provide a Cache control bit for use in multiprocessor Cache coherency control.
W orst case, this effectively reduces the addressable space by 50% ^
4.

Of course, it also4

The actual address space may not be affected because address mapping
mechanisms may be able to circumvent the loss.

Page 13

Purdue University TR-EE 88-36
causes the compiler w riter a b it of grief in th a t not only m ust all addresses be correctly
tagged, b u t the compiler m ust also be careful about operations such as pointer arithm etic
or comparisons.
O ther methods, such as using a separate cache controller to explicitly control the
cache (similar to the rem ote PC idea [Rad83]) are also possible. However, the overhead
and the synchronization cost involved may be too large to be practical.
4. S im u la tio n R esu lts
To measure the effect of cache bypass in reducing to tal reference tim e, detailed
sim ulation of the LRU Bypass-Cache was performed using th e single-pass compiler algo
rithm A scribed above* For comparison, the same sim ulations were perform ed using a
conventional LRU cache w ith the same configuration as the Bypass-Cache.
The benchm ark program s were taken from the DARPA MIPS package, and are
widely used as benchm arks of cache an d /o r system performance. D ata are given for four
of these program s:
Bubble
*. !
A typical bubble sort program , executed on a set of 500 random data.
Puzzle
This is a com pute-bound program from Forest Basket, run w ith a size of 511.
Realmm
A program which performs a m atrix m ultiplication of two teal matricies, each of
which is 40 by 40/
Tow erThe stan d ard recursive tower-of-Hanoi solution, given the problem of moving 18
disks.
Each of the program s was sim ulated for about 500,000 references of execution, hence
“ cold s ta r t’’ cache effects are negligible.
Since our prim ary concern is minimizing the to tal reference tim e, rath er than max
imizing h it ratio, it was also necessary to assume specific ratios of reference tim es for each
of the different types of reference. The cost functions used for the d ata in this paper were
based on cost estim ates for a typical CMOS-based system:
•
•
•

Cost
Cost
Cost
tim e

of referencing d ata from cache is I tim e unit.
of referencing d ata from main memory is 10 t i in e units.
of placing a line in an em pty or non-dirty cache entry is 10 -f- (line_size - I) * 7
units.

The fact th a t fetching/storing n consecutive d ata in to /fro m cache in one request takes
less tim e th an fetching/storing n data in n requests is reflected in the above costs. We
were actually quite generous in this assumption, using a form ula giving a 30% benefit for
m ulti-w ord fetch/store; however, this simply has the effect o fm a k in g ^tlie benefit due to
Bypass-Cache appear smaller.

Page 14

Purdue U niversity TR-EE 88-36

To m ake the sim ulations as complete as possible, all possible power-of-2 cache
organizations (e,g. different line sizes, set sizes) for a fixed cache size of 128 words5 were
sim ulated and are presented in this paper. The absolute reference tim es for the different
benchm arks naturally differ, however, the speedups and curve shapes are fairly consistent
across all the simulations.
Figures 4 through 7 graph speedup of to tal memory reference tim es w ith BypassCache as com pared to the same configuration conventional cache. Each curve in the
graphs is m arked w ^h the power-of-2 which was used as the associative set size. These
graphs clearly dem onstrate th a t the speedup in to tal memory reference tim e using
Bypass-Cache is very large —- in fact, it is plotted on a log scale, and averages about 2.
; : The speedup w ith BypassrCache is usually sm allest for a line size of one or two.
W ith an increase in line size (leaving cache size and set size fixed), the speedup with
Bypass-Cache increases greatly. This agress w ith confirms the argum ent given in Section
3. This is because a larger line size implies a larger overhead in cache line placem ent and
replacem ent. A lthough the to tal num ber of references of a line w ith increasing line size
increases, this increase is much less than the increase in overhead. Consequently, cache
more easily becomes polluted, and the Bypass-Cache becomes more critical in improving
system performance.
These curves also show th a t the speedup w ith Bypass-Cache is usually smaller for
cache w ith small set size (fixed cache size aiid line size). A lthough the cause of this is not
yet known, we suspect th a t this is related to the increase in traffic seen by each cache set
(becuase there are fewer sets). Even though the speedup is much smaller in these cases, it
is still typically about 1.2 (i.e., 20 percent).
Figure 8 shows the to tal reference tim e for the Tower benchm ark. The dotted lines
indicate th e tim es taken using conventional cache, whereas the solid lines show the times
taken w ith Bypass-Cache.
Aside from the obvious benefit in using Bypass-Cache, this graph suggests an
interesting general cache design rule. If th e t o ta l m em o ry referen ce tim e is to be
m in im ized , ra th er th a n th e h it-r a tio m a x im ized , it is u su a lly b e tte r t o ch oose
sm a ll lin e size a n d sm a ll se t size. This makes perfect sense in th a t although large line
sizes increase hit-ratio, they imply overhead increases which are greater th an the hit-ratio
increases — in fact, expotentially greater. T h at increasing set size is not beneficial is less
intuitive, b u t probably is related to the increased traffic per set and use of a poor
About 500 simulations were performed, encompassing a wide variety of
cache sizes and configurations. However, all the simulation results obtained
were very consistent, hence we have chosen to present only the data for the
Jargest cache size we examined — 128 words. Other simulation data are
aval able upon request.

Page 15

P urdue University TR-EE 88-36
replacem ent algorithm (i.e., one can do a whole lot better th an LRU [ChD87]).
For Bypass-Cache, the difference in to tal memory access tim e for different line sizes
(with same cache size and size) is not as great as those for cache w ithout bypass. This is
true because a lot of cache pollution can be avoided w ith Bypass-Cache.

Page 16

P urdue Uniyersity TR-EE 88-36

Speedup
(log
scale
plot)

Line Size (log scale plot)

F ig u re 4: Speedup in T otal Reference Time for Bubble

Page 17

P urdue University TR-EE 88-36

Speedup
(log
scale
plot)

Line Size (log scale plot)

F ig u r e 5: Speedup in T otal Reference Time for Puzzle

Page 18

Purdue University TR-EE 88-36

Speedup
(log
scale
plot)

Line Size (log scale plot)

F ig u re 6: Speedup in T otal Reference Time for Realmm

Page 19

P urdue U niversity TR-EE 88-36

Speedup
(log
scale
plot)

Line Size (log scale plot)

F ig u r e 7: Speedup in T otal Reference Tim e for Tower

Page 20

P urdue U niversity TR-EE 88-36

1E+09

1E+08
Time

1E+07

. . » • « » » | «i

Line Size (log scale plot)

F ig u re 8:
T otal Reference Time W IT H / W ITH O U T Bypass for Tower
(WITH is solid lines, W IT H O U T is dotted lines)

5. C on clu sio n
I In this paper, we present a new cache design — Bypass-Cache

which is able to

avert polluting the cache by bypassing the cache for entries for which caching would pot
result in faster to tal execution time.

From our sim ulation results, we see th a t the

speedup is trem endous, w ith an average of about 2. Various m ethods for im plem enting
the Bypass-Cache architecture are presented as well as an outline of the compiler technol
ogy required for its effective use.

Page 21

P urdue Uiiiversity TR-EE 88-36

P erhaps the m ost significant result, however, is that, !c'ache-'hit: r a ti o is n o t n e c e s
s a r y r e la te d t b t h e t o t a l re fe re n c e tim e . This will be discussed more deeply in a
later paper.
A c k n o w le d g e m e n ts
T hanks to the members of CARP (the Com piler-oriented A rchitecture Research
group a t Purdue) for their useful comments on this work.

Special thanks to George

Adams for his suggestions concerning the presentation of the results and also for coining
the nam e B y p a s s -C a c h e .
'R e f e r e n c e s '
[A1B86]

[Bel74]

[BuC86]

[Con86]
[ChD87]
[ChD88]

[Die87]
[E1185]
[IIwB84]
[Int86]
[Rad83]
[Smi82]
[Spi77]

Allen, R., B aum gartner, D., Kennedy, K., Porterfield, A., “ PTOOL: A
Sem i-Autom atic Parallel Program m ing A ssistan t,” 1986 International
Conference on Parallel P rocessing,A ugust 1986, pp. 164-170.
Belady, L.A., Palerm o, F .P ., “ On-line M easurem ent of Paging Behavior by
the M ulti-valued MIN A lgorithm ,” IB M Research and Development^ 1 8 ,1,
January, 1974, pp. 2-19.
Burke, M., Cytron, R.j “ Interprocedural Dependence Analysis and P aral
lelization,” S IG P L A N Symposium on Compiler Construction, 1986, pp.
613-641.
“ C l Processor Series: A rchitecture,” Convex C om puter C orporation, 1986.
Chi, C.H., Dietz, H .,“ Compiler-Driven Cache Policy,” Technical R eport
EE-87-21, P urdue University, May, 1987.
Chi, C.H., Dietz, H., “ Register Allocation for GaAs C om puter System s,”
Proceedings of the 1988 Hawaii International Conference on System s Sciences, January 1988, pp. 266-274.
Dietz, H. G., The Refined-Language Approach To Compiling For Parallel
Supercomputers, Ph.D. D issertation, Polytechnic University, June 1987.
Ellis, J. R., Bulldog: A Compiler for V L IW Architectures, 1985 A C M Doc
to ral D issertation Award, M IT Press, 1986.
Hwang, K., Briggs, F.A., Computer Architecture and Parallel Processing,
M cGraw Hill Book Company, 1984.
Intel C orporation, 80886 program m er’s reference manual, 1986, pp. 11-6.
Radin, G., “ The 801 M inicom puter,” IBM Journal of Research and
Development, May 1983, pp. 237-246.
Smith, A .J., “ Cache M emories,” Computing Surveys, Vol.', 14, No. 3, Sep
tem ber, 1982, pp. 473-530.
Spirn, J., Program Behavior: Models and M easurements, Elsevier-North
Holland, N.Y., 1977.

