Processor memory traffic characteristics for on-chip cache by Rathja, Roy C. et al.
AN ABSTRACT OF THE THESIS OF
(Jeremy) Yui Luen Ho for the degree of Master of Science in
Electrical and Computer Engineering, presented on April 16th 1992.
Title: Processor Memory traffic clkaractsristics for On-chip Cache
Abstract approved:Redacted for Privacy
7
Roy C. Rathja
The motivation of this research is to study different cache designs for on-
chip caches that improve processor performance and at the same time
minimize the degradation to system performance caused by an increase in
the processor memory traffic. As VLSI technology advances we can have
bigger and more complex on-chip caches that could not have been possible
a few years ago. Results derived from on-chip caches and performance
issues are basicallysimilar to off-chip caches. In this study, we will
concentrate on single level on-chip caches though there are many
interesting issues relating system performance, memory trafficand
multi-level caches.Processor Memory Traffic Characteristics for
On-chip Cache
by
(Jeremy) Yui Luen Ho
A THESIS
submitted to
Oregon State University
in partial fulfillment of
the requirements for the
degree of
Master of Science
Completed April 16 1992
Comencement June 1992APPROVED:
Redacted for Privacy
Professor of Electrical and 'Computer Engineering in charge of major
Redacted for Privacy
Head of Dment of Electrical and Computer Engineering
Redacted for Privacy
Dean of Graduate School
Date thesis is presented:April 16th 1992
Typed by (Jeremy) Yui Luen Ho for (Jeremy) Yui Luen HoAcknowledgements
It is my greatest pleasure and honor to acknowledge the help and
guidance of Dr. Roy Rathja who has been my major professor, adviser and
friend. He could always spare time for his students inspite of his ever
hectic schedule. He has guided me through each stage of my thesis work. I
would like to thank him for his time, patience, concern, guidance and
interest in my work.
I would also like to thank Dr. Shih-Lien Lu for his advice and
counsel. It is a great help to know him. I would also like to thank Otto
Gygax and the ECE support for giving me a chance to work and learn
with them.
Professor Tim Budd, Dept. of Computer Science, has kindly agreed
to be my minor professor and I would like to thank him. My thanks are
also due to Prof. Charles Drake for agreeing to be in my graduate
committee.
I am grateful to my friends in Corvallis, especially my roommates
and my close friend Shuang Li for their patience, love, and support
throughout my graduate studies in Oregon State University.
And finally, I would like to thank my parents, brother and my
sister-in-law for their everlasting support, encouragement, and affection.TABLE OF CONTENTS
page
Chapter 1:Introduction 1
1.1.Background 1
1.1.1 The Performance Metrics 2
1.2The Problem 4
1.3The Motivation 4
1.4The Approach and Thesis Overview 5
Chapter 2:An Overview of Cache Design Parameters
and Memory Traffic 7
2.1An Introduction to Cache Memory Design
Parameters 7
2.2.Cache Organization 8
2.2.1 Cache Size and Block Size 8
2.2.2 Instruction Cache 9
2.2.3 Data Cache 10
2.2.4 Data and Instruction Cache 10
2.3Placement Algorithms 12
2.4.Coherency Techniques and Write Policies 18
2.4.1 Write Through 19
2.4.2 Write Back 202.5.Cache Management 21
2.5.1 Demand Fetching and Prefetching 21
2.5.2 Replacement Algorithms 22
Chapter 3:Review of Past Work and Our Motivation
and Approach 24
3.1Review of Past Work 24
3.2Motivation 26
3.3.Our Approach 27
3.3.1 Why on-chip Cache ? 28
3.3.2 Trace Driven Simulation 28
3.3.3 Current Tracing Techniques 30
3.3.4 Trace Description 33
Chapter 4:Simulations, Observations and Results 35
4.1The System Model 35
4.2.Simulations, Observations and Discussions 36
4.2.1 Cache Size 36
4.2.2 Associativity 37
4.2.3 Block Size 38
4.2.4 Subblock 40
4.2.5 Write-Back and Write-Through 42
4.2.6 Multiprogram Traces 43
4.3Simulation Results 44Chapter 5:Summary, Conclusions and Future Work 57
5.1Summary and Conclusions 57
5.2Future Work 60
References
Appendix
62
68LIST OF FIGURES
FIGURE Page
2.1Split or Unified Cache System 11
2.2Fully Associative Mapping 13
2.3Direct Mapping 15
2.4Address Format Under Direct Mapping 16
2.5Set Associative Mapping 17
2.6Address Format Under Set Associative Mapping 18
3.1Our Cache System 26
4.1Miss Ratio Vs Size (single) 44
4.2Traffic Ratio Vs Size (single) 45
4.3Miss Ratio Vs Associativity (single) 45
4.4Traffic Ratio Vs Associativity (single) 46
4.5Miss Ratio Vs Subblock (single) 46
4.6Trafic Ratio Vs Subblock (single) 47
4.7Miss Ratio Vs Blocksize (single) 47
4.8Traffic Ratio Vs Blocksize (single) 484.9Miss Ratio Vs Blocksize (prefetch,single) 48
4.10 Traffic Ratio Vs Blocksize (prefetch,single) 49
4.11Miss Ratio Vs Subblock (single)
Subblock Prefetch-wrap-around 49
4.12 Traffic Ratio Vs Subblock (single)
Subblock Prefetch-wrap-around 50
4.13 Miss Ratio Vs Subblock (single)
Subblock Load-forward-prefetch 50
4.14Traffic Ratio Vs Subblock (single)
Subblock Load-forward-prefetch 51
4.15 Miss Ratio Vs Subblock (single)
Subblock Missprefetch 51
4.16Traffic Ratio Vs Subblock (single)
Subblock Missprefetch 52
4.17 Miss Ratio Vs Subblock (single,multi)
Subblock Tagged-prefetch 52
4.18Traffic ratio Vs Subblock (single,multi)
Subblock Tagged-prefetch 53
4.19 Miss Ratio vs Subblock (single,multi)
Subblock Always-prefetch 53
4.20 Miss Ratio Vs Subblock (single)
Write-Through 544.21Traffic Ratio Vs Subblock (single)
Write-Through 54
4.22 Miss Ratio Vs Blocksize (single)
Write-Through 55
4.23Traffic Ratio Vs Blocksize (single)
Write-Through 55
4.24 Miss Ratio Vs Blocksize (prefetch,single)
Write-Through 56
4.25Traffic Ratio Vs Blocksize (prefetch,single)
Write-Through 56LIST OF APPENDIX FIGURES
FIGURE Page
A.1Miss Ratio Vs Blocksize (prefetch,multi) 68
A.2Traffic Ratio Vs Blocksize (prefetch,multi) 69
A.3Miss Ratio Vs Subblock (multi)
Subblock Load-forward-prefetch 69
A.4Traffic Ratio Vs Subblock (multi)
Subblock Load-forward-prefetch 70
A.5Miss Ratio Vs Subblock (multi)
Subblock Missprefetch. 70
A.6Traffic Ratio Vs Subblock (multi)
Subblock Missprefetch 71
A.7Miss Ratio Vs Blocksize (multi) 71
A.8Traffic Ratio Vs Blocksize (multi) 72
A.9Miss Ratio Vs Cachesize (multi) 72
A.10 Traffic ratio Vs Cachesize (multi) 73
A.11 Miss Ratio Vs Associativity (multi) 73
A.12 Traffic Ratio Vs Associativity (multi) 74Chapter 1
Introduction
1.1Background
Computer Systems are fundamentally composed of three basic units: a
Central Processing Unit (CPU), which does the computing and processing
of data and instructions; a memory, which stores the instructions and data;
and an Input/Output (I/O) system. When a program is executed, instructions
are repeatedly retrieved from memory to the CPU, fetching any operands that
are specified, performing operations and maybe writing back to the main
memory system. The memory unit has the task to store all the information
that a processor needs over the whole period of operation. It is desirable for
the memory to have large capacity for problem solving. However when
memory becomes bigger, it also becomes much slower. Therefore it is
impractical that a memory unit have only one level of hierarchy.
Moreover, in most practical computer systems the CPU's rate of
executing instructions and processing data far outstrips the main memory's
rate of providing them. To narrow the gap of this mismatch, most modem
computers provide caches [7,30,31]. Caches are small, fast memories that
are physically and conceptually closer to the CPU. Their function is to
provide the instructions and data needed by the CPU at a rate more in line
with the CPU's demands.Only when the cache cannot provide the
necessary data or/and instruction, which is called a miss, will the necessary
information be queried from the main memory.2
The success in reducing the mean time required for the CPU to fetch
an instruction or datum relies on a high probability that the requesteddatum
is contained in the cache. The expectation that instructions and data that are
used currently will be referenced again soon is called temporal locality.
Spatial locality refers to the likelihood that two items adjacent in main
memory will be needed within a short span of time of each other [13].
1.1.1 The Performance Metrics
Caches are successful with computer systems because programs
generally exhibit good spatial and temporal locality; high probabilities of data
use and reuse based on current and recent activity. The frequency with
which the cache does not hold the information needed is called the miss
ratio. As a result of a miss, the information has to be fetched from main
memory [14,17,18]. While this is taking place, the CPU generally sitsidle
as it does not have the information that it needs to execute the sequential
instructions. Therefore the higher the miss ratio, the more frequently the
CPU idly waits for instructions and data to be fetched from memory and the
longer the mean or average execution time.
The average memory access time with a cache can be modeled
[9] as
Tav = (1-M)Te + MTm3
where M is the miss ratio, Tc is the access time of the cache on a hit,
and Tm is the total time it takes to access main memory on a miss. The term
(1-M) is the hit ratio, H.
Hit ratio and miss ratio are convenient measures of cache efficiency
which directly affect processor performance. However, it must be
remembered that the objective is not necessarily to optimize cache efficiency,
or even processor performance alone. Usually a more global goal such as
optimizing system performance is desired.
Minimizing the cache miss ratio may, in fact, run counter to this goal
of improving system performance. For example, prefetching (bringing
blocks into the cache before they are demanded by the processor) and large
blocks tend to decrease the miss ratio, but may also increase system bus
traffic [18,21]. In a demand-paged virtual memory system, paging I/O for
one process proceeds concurrently with the execution of another process. If
the system bus bandwidth is insufficient to meet the requirements of both the
cache and the I/O devices, performance will degrade. System bus bandwidth
is an especially critical resource in a multiprocessor system [3].
Consequently, another very important performance metric is the bus
traffic ratio. It is defined as:
BTR =(total traffic on system bus) / ( total demand fetches )
We can view BTR as the ratio of memory bus traffic in a system with a
cache to that without a cache [14]. The total traffic on the system bus
depends on the number of misses, the write back policy and the I/O traffic.
The total demand fetches are the total number of requests the processor4
makes to the memory hierarchy. The memory hierarchy consists of both the
cache and the main memory. In our study we will not be taking into account
the traffic activities due to Direct Memory Access (DMA) devices on the
system bus which bypass the cache. We can make this assumption because
I/O traffic is usually a fraction of processor accesses.
1.2 The Problem
The traffic ratio becomes a more pressing issue if the memory bus is
the bottleneck, either because the single processor is too fast for the bus or
because there are multiple processors on the same bus. In a microprocessor
based system with a shared bus, the traffic capacity of the bus limits the
number of microprocessors that can be used [9,20,21]. Most cache designs
proposed attempt to improve processor performance by reducing the miss
ratio. Unfortunately, by doing so, the result may be an increase of the
processor's memory traffic [9,18]. System performance cannot be really
improved if there is a substantial increase in the memory traffic.
1.3The Motivation
The motivation of this research is to study different cache designs for
on-chip caches that improve processor performance and at the same time
minimize the degradation to system performance caused by an increase in the
processor memory traffic. As VLSI technology advances we can have
bigger and more complex on-chip caches that could not have been possible a
few years ago. Results derived from on-chip caches and performance issues5
are basically similar to off-chip caches. In this study, we will concentrate on
single level on-chip caches though there are many interesting issues relating
system performance, memory traffic and multi-level caches.
1.4 The Approach and Thesis Overview
To improve performance of a processor, we have to study how each
of the different design choices affect the miss ratio and the traffic ratio of a
cache. The design parameters of a cache are the total cache size, the line size
(block size), the mapping algorithm, split (instruction/data) vs unified, fetch
and prefetch algorithm used and the write policy by the cache. Each of these
affects the miss ratio as well as the traffic ratio.
Studies of a similar nature [9,18,20,21] have been done but they did
not specifically discuss traffic memory issues relating to on-chip cache
designs. They are mainly concerned with bus traffic on a network of single
board computers [21], or they do not provide a comprehensive study of the
design choices which can improve system performance by reducing miss
ratio and the penalty of increasing memory traffic [14,18]. Therefore the
central goal of this research is to quantify and to characterize the traffic ratio
as we try to improve the performance of an on-chip cache by varying the
cache design parameters.
Making the best choices and selecting the best designs for an on-chip
cache depends very much on the workload being studied [17]. To simulate
this 'real workload', program address traces were collected and a cache
simulator was modified.6
Trace-driven simulation experiments are used in these studies for
several reasons. First, such simulations are repeatable and allow cache
design parameters to be varied so that effects can be isolated. They are
cheaper than hardware monitoring and do not require access to, or the
existence of, the machine being studied. Simulation results can be obtained
in many situations where analytic model solutions are intractable without
questionable simplifying assumptions. Further, there does not currently exist
any generally accepted model for program behaviour. Workloads in trace-
driven simulation are represented by samples of real workloads and contain
complex embedded correlations that synthetic workloads often lack. Lastly, a
trace-driven simulation is guaranteed to be representative of at least one
program in execution [9].
The University of California at Berkeley has provided a cache
simulator that has been developed over a period of a few years for cache
studies. It is called Dinero [22,30]. The design of the cache simulator,
Dinero, is based on the RISC processor R3000 that is designed and
manufactured by MIPS Inc, [8,29]. Using this cache simulator with a set of
8 program traces, simulation studies on the different cache parameters which
will affect the memory traffic (traffic ratio) and the performance of the cache
(the miss ratio) were studied. The nature of the traces will be discussed in
Chapter 3 of this thesis under the title Trace Description.7
CHAPTER 2
An Overview of Cache Design Parameters and
Memory Traffic
2.1 An Introduction to Cache Memory Design Parameters
Cache memories are small, high speed memories employed to hold the
segments of main memory that are currently in use. Cache memory design
aims to make the slow, large main memory appear as a fast memory to the
CPU.
Optimizing the design of a cache memory generally has four aspects
[18] :
1. Maximizing the hit ratio (the probability of finding a memory
reference's target in the cache).
2. Minimizing the access time (time to access information in the cache).
3. Minimizing the delay due to a miss.
4. Minimizing the overhead of updating main memory, especially to
maintain multi-cache consistency.
In our study we will only be concerned with issues 1,2 and 4 as issue
3 has more to do with the implementation and technology of caches. Cache8
design has several aspects which include general cache organization, cache
placement, cache management and cache replacement.
2.2Cache Organization
2.2.1 Cache Size and Block Size
Size is the dominant cache parameter in terms of its effect on both cost
and potential performance enhancement. Therefore, it should come as no
surprise that implemented machines have caches with large variations in size
depending upon the targeted application (i.e., from microprocessors
incorporating 256 bytes to mainframes equipped with 128 kbytes) [3].
However, sheer size alone does not make an efficient cache. It is the other
parameters which determine how well this valuable resource actually
performs [17].
Block size (also commonly referred to as line size ) is one of the
more visible elements associated with cache design. A block is the amount of
information transferred between main memory and the cache per transaction.
Large blocks may enhance the exploitation of locality by obtaining more data
at once. However, small blocks are likely to containless unneeded
information. Therefore the option of breaking blocks into subblocks will
be studied.
Typically, caches with large blocks require less storage and logic for
management purposes, but demand additional bandwidth from the memory
system [15]. Another trait that can be associated with enlarging the block size
is an increased latency of the bus in responding to high priority requests9
(since transmission of the entire block may have to complete before bus
ownership can be relinquished ). When the bus bandwith is not wide enough
to transfer an entire block in one operation, multiple transfers are dictated.
Burst mode transactions are characterized by the transmission of a single
address followed by multiple data transfers. Traditionally, this has been
implemented via interleaved memory modules and a proprietary bus.
However, the task is being eased by special DRAM access modes ( nibble,
page, static column decode) and support from emerging 32 bit bus standards
(Futurebus, Multibus II, VMEbus) [24,26]. In some cases, the presence of
burst mode will cause bus traffic to no longer be a monotonically increasing
function of block size and thereby provide an opportune situation for
minimization. This occurs when increasing the block size improves the hit
rate enough to compensate for any bus cycles wasted on transfers of
unneeded information.
2.2.2 Instruction Cache
An instruction cache (I. cache) holds a fixed number of blocks of
instructions. All instruction fetch references are checked to see if the
requested instruction word is in the cache. If that is the case, there is a "cache
hit", and the instruction word is fetched from the cache, immediately
decoded, and then executed. If the requested word is not present in the
cache, then there is an instruction "cache miss". The block/sub-block
containing that word is fetched from memory and is also put in the cache.
Studies [13] have shown that typical programs spend most of their execution
time in a few small routines or in tight loops. Therefore, if these routines are10
captured in a fast cache, they can be executed directly from the cache. This
method deceases execution time, since the CPU does not have to wait a long
time for instructions to come from the slower main memory. The processor's
external bus activity is reduced, providing a larger effective memory bus
bandwidth. Thus, if the instruction fetches hit in the cache, a significant
performance improvement results. Instruction cache design for modern
programming languages is simpler than data cache (D. Cache) design, or
design of both instruction and data cache (I&D cache), because storing and
writing into instruction cache locations is disallowed.
2.2.3 Data Cache
Data cache is used to store frequently used data; that is, to avoid
"CPU-data-memory" bottleneck. In general, the locality of data is not as
good as that of instructions. Therefore, not many computer systems have
adopted D. cache only. The locality of data can sometimes be improved by
using an intelligent compiler to rearrange data.
2.2.4 Data and Instruction Cache
Systems that provide cache memory for both instruction and data
references are said to have an I&D cache. An I&D cache is implemented in
two ways: unified-cache and split-cache. In unified-cache both
instruction and data references are stored in the same physical cache. In
split-cache systems, the cache is physically divided into two parts, each with
independent controllers. One part is used for instructions and the other for11
data. Split I&D cache makes the design of the I. cache easier, since its
content does not have to be modified. Furthermore, I.cache can be tailored to
specific referencing patterns found in fetching instruction streams. A separate
D.cache can also be tailored to data reference patterns, possibly resulting in
an optimum design for both. Split I&D cache may eliminate the conflict
between data and instruction access in a pipelined architecture. This issue is
dependent on the overall CPU organization. By having separate I&D cache
we have approximately twice the cache bandwidth [28, 30]. In our
simulation studies, we will only be looking into split I and D caches as they
are more popular with on-chip processors as real estate in a VLSI chip
increases.
L
lu
Eu
-.4
-of
HUnified or Split
Cache
lu Instruction Unit
Eu Execution Unit
Mmu Memory Management Unit
M Memory (primary)
SB System Bus
SB
Fig. 2.1 Split or Unified Cache System
Mmu
M
Fig. 2.1 shows a unified or split on-chip cache in a microprocessor system.12
2.3PlacementAlgorithms
Another extremely visible parameter of cache design is the placement
algorithm. Placement algorithms provide a mechanism for mapping main
memory addresses into cache locations. The placement algorithm can be
fully associative, direct mapped or set associative depending on
whether the mapping for a given main memory address can be placed in any
location in the cache, in just one location or in a set of locations [7].
In a fully associative mapping scheme, a main memory block i can
be mapped to any cache block j, where
0 <=<= M -1 and0 <= j <= N-10
1
N-1
Cache
Memory
Main
Memory
Fig. 2.2Fully Associative Mapping
1
M-1
13
From figure 2.2, it is apparent that the main memory has M
blocks and the cache is divided into N blocks. To determine which block of
main memory is stored into the cache, a tag is required for each cache block.
More formally:
Tag (j) = address of memory block stored in cache
block j14
Suppose M = 2m and N = 2n; then m and n bits are required to specify the
address of a main and cache, respectively. Since a main memory block can
be mapped to any cache block, the entire m bits of a main memory block
address has to be used as a tag. Since there are N cache blocks, N tags are
needed. These tags can be either stored in the cache memory itself or
separately stored in an associative memory called the tag directory.
In this scheme, when the CPU generates an address, the main
memory block is extracted (usually the high-order m bits) and is then
associatively compared with all N tags stored in the tag directory for a match.
If a match occurs, the corresponding cache block number is retrieved, and
the cache is accessed for the required data. If the associative search fails,
then the main memory is accessed for the required data.
The principal advantages of this method are its great flexibility and that
the address translation process can be performed quickly because of the
high-speed tag directory. However, the high cost associated with a tag
directory limits the liberal implementation of this idea.
In direct-mapping a main memory block, i is always mapped into
the cache block i mod N. For this reason, this method is known as
congruent mapping. This is illustrated in figure 2.3.0
1
i mod N
N-1
Cache
Memory
AI
Main
Memory
Fig. 2.3 Direct Mapping
0
1
i
M-1
15
If N = 2n and M = 2m, then i mod N will be in the range of 0 to 2n
1.This means that the low-order n bits of the binary representation
corresponding to the main block i give the cache number. This is shown in
figure 2.4 below.16
Main Memory block Number
m bits
m-n bits n bits1
Tag Cache Block Numberti
Fig. 2.4 Address Format Under Direct Mapping
Figure 2.4 shows that the high-order m-bits can be used as a tag to
determine if a main block is stored in the cache memory. When the CPU
generates an address, the low-order n bits of the main memory block number
field are used as the index to the tag directory, and the tag stored here is
compared with the tag field of the specified main memory block number. If
there is a match, the cache is accessed; otherwise the mainmemory is
accessed. In the event of a cache miss, the incoming mainmemory block i
always replaces the cache block i mod N because it cannot be mapped into
any other cache block. This kind of mapping is therefore not flexible.
Set associative mapping is the compromise between fully associative
and direct-mapped. In this case, the cache blocks are divided into N/S setsso
there are s cache blocks per set.In this approach, a mainmemory block i will
always be mapped to the cache set i mod (N/S). However, within the17
set, the block can be placed anywhere.Therefore,fully associative mapping
occurs within a set. A conceptual view of this is shown in figure 2.5.
If M = 2m, N = 2", and S = 2s, then N/S will be in the range of 0 to
N/S- 1,and n - s bits are required.
Since i mod (N/S) is also in the range of 0 to N/S1, the low-order n
- sbits of the main memory block address directly specify the cache set
number. Therefore, the high-order m ( n - s) bits of the main memory block
address are treated as tag bits as shown in figure 2.6.
Set 0
Set 1
Set
i mod(N/S)
Set N/S -1
Cache
Memory
Main
Memory
Fig. 2.5 Set Associative Mapping
0
1
M-118
Main Memory block Number
m bits
No-
m-n+s n-s
Tag
Fig. 2.6
Cache Block Number
Address Format Under Set Associative Mapping
2.4Coherency Techniques and WritePolicies
Coherency (or consistency) is the task of ensuring that all requests
for data are satisfied with a correct and up-to-datecopy of the desired
information. This is a problem that must be dealt with when multiple,
independently modifiable copies of the same logical entity exist. And the
incorporation of the cache memory in a system results in sucha situation.
Fortunately, this topic has been well researched andmany innovative
solutions have been devised [4,6,23,10,27].
Our intent is not to thoroughly analyse coherency mechanisms in
general but to focus attention on approaches which are especially suitable for
microprocessor based designs. In particular, the architecture under
evaluation consists of one or more processors, each witha private cache and
sharing a global main memory via a common system bus. We broadly
classify the techniques receiving considerationas follows: (1) write19
through (2) write back. Write through and write back are the two basic
memory update strategies. With write through, all stores are immediately
transmitted to main memory. This greatly simplifies the consistency problem
since main memory always contains an up-to-date copy of cached
information. The drawback is that all write operations must now be viewed
as misses with respect to processor stalls and system bus traffic. However
these disadvantages can be overcome. Smith has shown that a moderate
degree of buffering can alleviate processor stalls and Sequent has utilized a
high bandwidth system bus to successfully accommodate twelve processors
[5,25]. Write-back caches are less suitable for multiprocessor systems as
explained below.
2.4.1 Write Through
Write through ensures that main memory is updated when cache is.
This is done by updating the memory when the cache is updated.But it
does not ensure that cache is updated when main memory is updated. Main
memory can become inconsistent with cache due to DMA activity, I/O
operations and writes by other processors. The most common solution is to
have each cache watch the system bus for write operations directed toward
resident blocks. When detected, the target entry may be either invalidated or
updated to reflect the new value. In order to reduce interference with normal
processor accessed, the tag RAMS are often duplicated and one set devoted
to the monitoring operation.
When using write through a decision must still be made concerning
what action to take in the event of a write miss. The referenced block20
could be brought into the cache (sometimes referred to as allocation on
write) or only updated in main memory. While fetching the block might
improve the hit rate, it may actually increase bus traffic and stall time since
the frequency of references to the block is not known and the bus traffic
encountered is the same as that which would occur if the block were
allocated on a demand basis.
Another related design trade-off involves how the cache is treated
upon write hits. The cache may be updated simultaneously with main
memory or the referenced block might simply be invalidated.
2.4.2 Write Back
Write back reduces bus traffic by eliminating the constraint of
immediately transmitting writes to main storage. Since the contents of main
memory do not necessarily reflect the most recent cache updates, a more
sophisticated algorithm is needed to maintain coherency. Therefore copy
back is difficult to realise in multiprocessing systems, where other
processors have to be immediately informed of any datum change. As write
back uses a "dirty" bit to mark blocks that have been modified, it needs extra
logic to implement the "dirty" bits. which is not necessary in write through.21
2.5Cache Management
Algorithms to manage the cache operation are implemented in
hardware or in software. These include demand fetching, prefetching
and replacement policies.
2.5.1 Demand Fetching and Prefetching
A fetching policy is the mechanism that decides when to move
data from main memory to the cache, and which data to move. Fetching
policies can be classified as either demand fetching or prefetching.
A demand fetching policy moves information from the main
memory to the cache when information is requested by the processor.
Therefore, the processor has to wait until the requested data arrive from
main memory. All cache memory systems must support demand fetching,
which is invoked when a miss occurs in the cache memory.
Prefetching makes use of idle cycles to transfer data to the cache.
There are two approaches: static prefetching, which is done at compile
time, and dynamic prefetching, which is done at run time [21]. Dynamic
prefetching usually takes the form of one block (or transfer unit) look ahead
cache (i.e., fetching block i+1 when there is a reference to block i). Static
prefetching involves predicting the most likely needed block to move into the
cache.
In our study, we will look at only dynamic prefetching. There are
several different prefetching methods. They are always-prefetch, miss-22
prefetch, tagged-prefetch, look-forward-prefetch andwrap-
around-prefetch. Always-prefetch prefetches afterevery demand
reference. Miss-prefetch prefetches after every demand miss. Tagged-
prefetch prefetches after the first demand miss toa (sub)-block. Load-
forward-prefetch (sub-block placement only) works like prefetch-always
within a block, but it will not attempt to prefetch sub-blocks in other blocks.
Wrap-around-prefetch (sub-block placement only) works like prefetch-
always within a block except when referencesnear the end of a block.
Wrap-around-prefetch references will wrap around at the end of the block
within the current block [22]. All of the prefetching policies just mention will
only be studied with subblocks. By using subblocks and prefetchingwe can
combine the miss ratio benefits of a larger block size with the low bus traffic
of a subblock.
Program and data references within a cache block exhibita forward
bias. A program typically branches to a random location withina cache
block, proceeds sequentially forward, and then branches again. Data
references also tend to proceed forward because of processing ofarrays,
character strings, and individual variables whosestorage is defined by the
programmer in order of use [21].
2.5.2 Replacement Algorithms
When a miss occurs in the cache,one of the entries of the set must be
selected for replacement. (In the direct mapped cachethere is only one23
choice and in the fully associative cache there are n choices, where n is the
size of the cache). Two common replacement algorithms that are at the
opposite ends of the cost/performance spectrum are RANDOM and LRU
(Least Recently Used). RANDOM is easy to implement. The entry to
be replaced is chosen based on some random event; for example, the lower
bits of the real-time clock. LRU (Least Recently Used) maintains an ordering
of the entries (either by a list or by matrix of relationships) in the order of
most recently used to least recently used. When an entry is referenced it is
moved to the most recently used position. And when an entry is needed for
replacement, the least recently used entry is selected.
LRU is complicated to implement if the set size is larger than two. We
will only be using LRU in our simulations as studies [17,18] have shown
that LRU and RANDOM show the same level of performance.24
Chapter 3
Review of Past Work and Our Motivation and
Approach
3.1Review of Past Work
An excellent survey article by Alan Smith [18] thoroughly and
systematically analyzes the influence of various cache design parameters on
performance using trace-driven simulation. The cache aspects analyzed
include cache fetch algorithm, placement algorithm, line size, replacement,
main memory update, cold-start versus warm-start effects, I/O, split
instruction/data cache, virtual versus real addressing, cache size, and
bandwidth issues. The study used a large number of traces of PDP-11 and
IBM 360/370 series of computers.
Goodman [20] argued that in a real system, especially in a VLSI
implementation, and in multiprocessors, the required bus bandwidth is a
critical performance issue. He showed ways of using caches to reduce this
bandwidth and proposed using the traffic ratio, defined as the ratio of the
total cache-to-memory to the processor-to-cache traffic, as a performance
metric. An important conclusion was that block-size variation must be
carefully assessed not only in terms of the miss rate but also in terms of
traffic ratio.
Clark [34] presented a set of thorough measurements of caches in real
working systems. Clark measured the performance of the VAX-11/78025
cache for real workloads. Besides helping in the design of future systems,
these results also helped calibrate and validate previous simulation cache
models. The reported miss rates were higher than the miss rates predicted by
earlier trace-driven simulation studies. The degradationwas attributed to the
presence of operating-system references, realistic multitasking, and I/O
activity. The failure to capture these effectswas recognized as a key
drawback of trace-driven simulation and exposed the need for bettertraces
and analysis methods.
Smith's paper [17] on workload selection for cache studies hasan
excellent discussion on the relative merits and demerits of trace-driven
simulation. The paper shows that poor workload selectioncan lead to
severely distorted results. The emphasis is on the choice of realistictraces of
large programs that do not fit trivially into the cache. Thepaper also presents
a suite of cache results using traces from a wide variety of computer systems
including the IBM 370, the IBM/91, the DEC VAX, the Zilog 8000, the
CDC 6400, and the Motorola 68000, which could be used bya computer
architect to design a machine.
In recent times, microprocessor cache memories became increasingly
prevalent, introducing a new set of trade-offs [4,41]. Hill and Smith [9] also
evaluated microprocessor's on-chip cache memories to study the impact of
various cache parameters on the miss ratio and traffic ratio. Their results
indicate that transferring smaller units than the block size (called sub-block
placement or sector placement) result in better overall utilization of silicon
real estate. Prefetching techniques suchas load forward - fetching from the
current word to the end of the block- are shown to be more advantageous.26
3.2Motivation
As mentioned briefly in the introduction, the motivation of this study
is to quantify and characterize the processormemory traffic of different on-
chip cache designs. The cache system that we shall study is modeledas
shown in Figure 3.1.
1
L
lu
Eu
1
HUnified or Split
Cache
lu Instruction Unit
Eu Execution Unit
Mmu Memory Management Unit
M Memory (primary)
SB System Bus
Fig. 3.1 Our Cache System
SB
We have chosen this microprocessor system because this isone of the
most frequently encountered [11]. We are only interested in how each of the
cache designs will influence the memory traffic by studying the change in the
traffic ratios. As defined before, traffic ratio is the ratio ofmemory bus27
traffic in a system with a cache to that without a cache [9].We will be
looking at how cache design choices that improve the miss ratio, which is
one design goal mentioned, may cause overall system performance to fall
due to traffic ratio increases. We will be proposing some cache designs that
improve the miss ratio and at the same time minimize the increase in traffic
ratio. As mentioned in section 3.1, the discussion in the literature on traffic
ratios for cache designs are usually brief. This is surprising since it is such
an important performance metric. Previous studies also lack good traces that
include operating systems and multitasking and this movtivates a more
thorough study on miss ratios and traffic ratios with these traces.
3.3 Our Approach
We will be using Trace-Driven Simulation and a set of 8 program
traces to do all the simulation studies. The description of each simulation will
be discussed in the next chapter under Simulations, Observations and
Results. The cache simulator was introduced in section 1.4. Each of the
traces will be described in the last section of this chapter. Different issues
relating to cache modeling, simulation and how traces are collected are
discussed in the next few sections. They are very important topics as they
form part of the approach taken by this study.The reason why on-chip cache
is studied is also explained in the next section.28
3.3.1 Why On-chip Cache?
Advances in integrated circuit density are permitting the single chip
implementation of features, functions and performance enhancements
beyond those of basic sixteen bit machines that we have a few years ago.
Processors that are now being designed, include not only full 32-bit
architecture instruction sets, but they also have sufficient area for
performance enhancements such as buffering, pipelining, and cache
memories [30,32]. Due to a lack of real estate on the chip, we have to make
the best use of the chip area, and on-chip caches are one of the best choices.
As mentioned, caches are a time-tested mechanism for improving memory
system performance, by reducing access time and memory traffic through the
exploitation of spatial and temporal locality.
On-chip caches may differ slightly from traditional caches. Initially
these on-chip caches will be small (32 to 8 kbytes) because the limited chip
area must be allocated among the instruction set implementation, the cache,
and other possible enhancements. The two microprocessors, i80486 and the
MC68040 that have dominated the microprocessor markets have only 8
kbytes caches [11]. As fabrication techniques improved, we can have more
room for on-chip caches.
3.3.2 Trace Driven Simulation
Cache miss rates can be derived by one of three methods: (1)
Hardware measurement, (2) analytical models and (3) trace-driven
simulation (TDS) [1,2,39]. Hardware measurement, an expensive29
technique, involves instrumenting an existing system and observing the
performance of the cache. This scheme is inflexible because the cache
parameters cannot be easily varied. Analytical and trace driven simulation do
not have this drawbacks, although they have their own set of disadvantages.
Analytical models of caches estimate cache performance quickly,at
the cost of accuracy. Mathematical modelscan give more insight into the
behaviour of caches than other experimental techniques. In addition, models
can be used to suggest useful ways of improving cache performance by
changing the cache organization or theprogram structure after studying
program-cache interactions.
Trace-driven simulation (TDS) is perhaps themost popular
method for cache performance evaluation. TDS evaluatesa model of a
proposed system using previously recorded addresstraces as the external
stimuli. Address traces are streams of addresses (usually of address space)
generated during the execution of computerprograms. TDS involves
studying the effects of varying the input trace and modelparameters on the
behaviour of the model outputs. Chiefamong its advantages are flexibility,
accuracy, and ease of use. Being a software technique, TDS does not require
expensive hardware support. The experimentsare repeatable and the same
data can be used to compare multiple cache strategies. For thepast several
years trace-driven simulation has been the main type of cache performance
estimation [9,14,18,32].30
3.3.3 Current Tracing Techniques
The usefulness of trace-driven simulation dependson the integrity of
the traces used to drive the simulations. The importance of obtaining traces
that provide accurate cache performance predictions motivatesus to mention
some of the methods use conventionally to collect the traces and the method
which is used to collect the traces that are used in this study. Typical tracing
schemes include: hardware monitors that watch the address bus for
memory transactions; software simulators that can generate address
traces by interpretively executing an instruction stream; some hardware
assisted methods such as the VAX T-bit [33] technique, whichtraps
every instruction into the operating system when the T-bit is set to enable
recording all memory references; and analyticalprogram behaviour models
that can generate synthetic reference streams.
A hardware monitor is a device that plugs ontoa working
backplane bus to record all bus transactions ofa working computer. An
example of a hardware monitor is described by Clark [34]. Often these
monitors record only counts of events rather than actual address themselves,
because they have limited memory.
An example of built-in tracing mechanism has been mentioned above.
By using the T-bit facility that is provided in the VAXprocessor architecture
we can cause a trap to be taken at the beginning of each instruction. These
traps are intercepted by the operating system kernel which transfers control
to a user-tracing process, whereupon the address of the instructioncan be
directly recorded. Data referencescan be determined by interpretively
executing the instruction.31
We can also generate traces by using software simulators that can
interpretively execute programs to generate a sequence of addresses.
Examples include MILS: Mips Instruction Level Simulator [35]; and
TRACER[36].Tracer is a program that generates address traces for
programs executing on a VAX computer under the Berkeley 4.2 VAX/UNIX
operating system. In this method, the simulation model of the architecture
interpretively executes instructions and writes a stream of virtual addresses.
Generative analytical program models, such as the Independent
Reference Model (IRM) and the LRU Stack Model [37] can be used to create
synthetic address traces. Usually these traces are only used to provide first
order insights into the operation of the system.
When a tracing method is unable to record the complete address
stream of the running machine, it introduces a distortion into the generated
trace. Many techniques suffer from omission distortion, the failure to
capture some of the addresses of the running system. Common omission
distortions are: no record of addresses (data or instruction) generated by the
operating system, and no tracing of multitask tasks. These distortions are
common because it is quite difficult to write a simulator that takes into
account all the operating system features such as I/O, interrupts, context
switching, and time sharing. Even using hardware assisted tracing this
information is hard to obtain. On most machines, interrupt-handler activity is
not interruptible. Thus, any method that relies on some operating system
support will not be able to trace interrupt-handler activity [1,2].
Another limitation of some tracers is small trace sample size. This
problem arises when the tracing mechanism records addresses until a buffer
is filled up, then stops tracing until the buffer is written to disk or tape. This32
problem is most severe with hardware monitors thatcan only record 1000 to
10000 addresses before filling up. The addresses in each sampleare
coherent, but two successive samples may be completely unrelated. Many
simulation studies show a transient behaviour at the beginning of each
sample. If the sample size is too small, startup distortioncan unrealistically
bias the conclusions [38].
A few techniques capture an address stream that is uniqueto a
particular hardware implementation, and do notrepresent other
implementations. For example, an address trace generated bya hardware
monitor on the memory bus capturesno data about references that hit in a
cache, and a cache bus address tracemay capture redundant instruction-
buffer prefetches. This bias causes distortion, and only sometimescan be
compensated by the programs that use them.
There is a new way of obtaining traces proposed by researchesat
Stanford [1] that tries to relievesome of these problems. It is called Address
Tracing Microcode (ATUM). The basic idea behind ATUMis to do the
tracing "below" the operating systemin microcode. By making minor
changes to the existing microcode ofa machine, a trace of all addresses that
are touched by the processor can be stashed away in a reservedarea of main
memory, and periodically written to disk or tape. These traces represent the
addresses generated by the processor with perfect fidelity [1].
Microcode tracing is applicable toany machine where modifications to
the microcode are possible. Addressesare generated by appropriate
microcode routines for macroinstruction fetches and dataaccesses. At this
level, the addresses directly correspond to the addressesthat the architecture
specification of the machine requires. The addressesare not tainted by33
implementation-specific resources such as prefetch buffers, caches, or bus
sizes. Recording these addresses as they are generated produces undistorted
traces.
ATUM isindependent of the operating system. With the same
microcode, all the operating systems that run on the machine can be traced.
Even interrupt-handler execution is visible because the microcode used is
independent of the interrupt status of the processor. A full multitasking
workload can be traced. Thus, there is no operating system omission
distortion and no multitasking omission distortion. As no additional
hardware is required, ATUM is hence cost effective. Sample sizes of a
million are feasible. Since full addresses can be recorded, there is no
granularity distortion. By recording the instruction-stream and data-stream
addresses as each instruction is interpreted, there is no implementation
distortion. By running 10x slower than normal, there is some time distortion
introduced by ATUM for I/O interrupts, but it is not as severe as that of
software simulation running 100x or 1000x slower than normal.
3.3.4 Trace Description
We have a total of 8 traces. Five of the traces are taken from VMS
operating system applications. They are IVEX, DECO, LISP, MUL2 and
MUL8. IVEX is a sample taken from a DEC program that checks the
interconnect in a VLSI chip. DECO is a trace of DECSIM, a behavioural
simulator at DEC, simulating some cache hardware. LISPO is LISP runs of
BOYER (a theorem prover). MUL2 and MUL8 are traces taken from
multiprogramming samples to study the impact of process switching. The34
processes active in MUL2 are a FORTRAN compile of LINPACK and the
microcode address allocator. Those in MUL8 include two FORTRAN
compiles of UNPACK and a program called 4x1x5, the microcode address
allocator, a spice run, a PASCAL compile of a microcodeparser, and
UNPACK, a numerical benchmark JACOBI,a string search in a file and
MACRO- an assembly level compile. The five traces just mentioned are
ATUM traces that are obtained from Stanford. Besides these five ALUM
traces, we have three benchmark traces. The benchmarks are GCC, TEX and
SPICE. GCC is a C compiler and the tracesare samples of it. TEX is a word
processing program.SPICE is a program that analyses circuits. These three
traces were also obtained from Stanford for use in this study.35
Chapter 4
Simulations, Observation and Results
4.1The System Model
The entire cache design space is incredibly diverse. At any one time,
cache researchers must restrict themselves to a portion of the overall design
space [31]. By consistently using a smaller number of base scenarios, we
can more thoroughly examine a small region of the overall design space.
Selecting the base systems so that they are representative of a common or
interesting class of machines makes the results valuable despite their limited
domain. Only two base scenarios are used in the simulation experiments that
follow : one with a single split I/D cache and a second with the same cache
but with a different write policy.
Both models have a Harvard organization. In the first model the split I
and D caches are 2 kbytes each, organized as 32 bytes per block, direct-
mapped using write-back with LRU replacement. The second model also
uses a split cache but with a write-through policy.36
4.2Simulations, Observations and Discussions
4.2.1 Cache Size
To see how the size of the cache affects the miss and traffic ratio,we
have varied the cache size from lk to 64k bytes. Ascan be seen in Fig. 4.1,
Fig. 4.2 and Fig. 4.20 when the cache becomes larger both the miss and
traffic ratios fall. This is reasonable as we have bigger caches, capacity
misses are reduced. To have smaller miss and traffic ratioswe need a larger
cache. In Appendix A, results from simulations using multiprogramtraces
show similar results that larger caches reduce both the miss and traffic ratios.
Two very important questions when selectinga cache design are how
large should the cache be and what kind of performancecan we expect. The
cache size is usually dictated by a number of criteria havingto do with the
cost and performance of the machine. The cache should not beso large that it
represents an expense out of proportion to the added performance,nor
should it occupy an unreasonable fraction of the physicalspace within the
processor. A very large cache may also require more access circuitry, which
may increase the access time.
Aside from the warnings given,one can generally assume that the
larger the cache the lower the miss ratio and better the performance,as
shown in the results above and also in other studies [14,17,18]. The issueis
then one involving cache size and miss ratio. This isa very difficult problem
as mentioned in Smith [17,18] since the cache miss ratio varies with the37
workload and the machine architecture. As shown in our own simulation
results, all of the traces have different, distinct miss and traffic ratiosas we
vary the cache size.
4.2.2 Associativity
In order to locate an element in the cache, it is necessary to havesome
function which maps the main memory address intoa cache location, or to
search the cache associatively, or to perform a combination of the two. The
placement algorithm determines the mapping function from mainmemory
address to cache location.
The most common used form of placement algorithm is set-associative
mapping. It involves organizing the cache into S sets of E elementsas
explained in section 2.3. Given a memory address r(i),a function f will map
r(i) into a set s(i), so that f(r(i))= s(i). If S becomes one, then the cache
becomes a fully associative memory. The problem is that the large number of
blocks in a cache would generally make a fully associative cachememory
both slow and very expensive. Conversely, if E becomesone, in an
organization known as direct-mapping there is onlyone element per set.
Since the mapping function f is many to one, the potential for conflict in this
latter case is quite high: two or more currently active blocksmay map into the
same set. Generally, the conflict and the miss ratio decline with increasing E
,(as S*E remains constant), while the cost andaccess time increases. E is
the level of associativity.
From Fig. 4.3 and Fig. 4.4 the level of associativity is varied from 1,
which is a direct-mapped cache to 16. We found that the miss and traffic38
ratios do not improve beyond set associativity of 4. This is similar tosome
studies reported [16,17]. Both the Intel i486 and Motorola MC68040 have 4-
way set associative caches. The reason as explained above is that beyond a
4-way set associative mapping the miss ratio does not improve significantly
and more complicated circuitry are needed.
4.2.3 Block Size
As the cache block size increases the miss ratio fallsas shown in Fig.
4.7. But the traffic ratio also increases (Fig. 4.8). Itcan be seen that a block
size of 64 bytes has a lower miss ratio for a certain traffic ratio than for the
smaller block sizes represented. We cannot have caches in which blocksare
too big because big block sizes will increase the access time per demand
fetch. This is why it may be feasible to have subblocks instead of just having
blocks as the unit for demand fetching.
Prefetching in cache design has its advantage. This is shown in Fig.
4.9 and Fig. 4.10. The miss ratio improves significantly for smaller block
sizes but for bigger block sizes it levels off. Thereason is because if we
prefetch smaller blocks there is less tendency of prefetching dataor
instruction that we do not need. This is investigated further using subblocks
with prefetching in the next subsection. The disadvantage of prefetchingis
the increase in traffic ratio. This is shown in Fig. 4.10 and Fig. 4.8. The
increased traffic ratio results since prefetching fetches data and/or
instructions indiscriminately. Even if the datumor instruction is not queried
by the processor, it is fetched. Thus increasing the overhead.39
We should note also that the miss ratio when using prefetching may
not be lower than the miss ratio for demand fetching. The problem here is
cache memory pollution [18]; prefeteched lines may pollute memory by
expelling other lines which are more likely to be referenced. Smith
mentioned [21] that the major factor in determining whether prefetching is
useful was the line size. He said that lines of 256 or fewer bytes (such as are
commonly used in caches) generally result in useful prefetching; larger
blocks made prefetching ineffective. The reason for this is that a prefetch to a
large block brings in a great deal of information, much or all of which may
not be needed, and removes an equally large amount of information, some of
which may still be in use. This is why our simulation results for miss ratio
for prefetching level off and remain steady even with bigger block sizes.
Therefore to find a particular block size for a certain cache size we only need
to find the block size at the knee of the curve that plots miss ratio vs block
size.
There is an interesting observation that we can find in Fig.4.7 to Fig.
4.10 for the LISP trace. The plots show that the miss and traffic ratio for
LISP fall in the beginning, after which they rise. The reason is that LISP
programs do not exhibit very good spatial locality due to its data structure.
Another observation is that block sizes of 2 and 4 are too small for our cache
design as the traces are taken from a 32-bit or (4 byte) machine. This is
reflected in Fig. 4.7 and Fig. 4.8 where the miss ratio remains at the same
level in the beginning.40
4.2.4 Subblock
We can see in Fig. 4.5 that the miss ratio falls as we have bigger
subblocks. The reason is that program and data references within a cache
block exhibit a forward bias. A program typically branches to a random
location within a cache block, proceeds sequentially forward, and then
branches again. Data references also tend to proceed forward because of
processing of arrays, character strings and individual variables whose
storage is defined by the programmer in order of use [21]. The simulation
results show that subblocks reduce the traffic ratio.
We considered five types of prefetching with subblocks in this study:
(1) always-prefetch, (2) missprefetch, (3) tagged-prefetch, (4) load-forward-
prefetch, and (5) wrap-around-prefetch. Always-prefetch means that on
every memory reference, access to subblock i (for all i) implies a prefetch
access for subblock i + 1. Missprefetch implies that a reference to subblock i
causes a prefetch to subblock i + 1 if and only if the reference to subblock i
itself was a miss. For tagged-prefetch we associate with each subblock a
single bit called the tag, which is set to one whenever the subblock is
accessed by a program. It is initially zero and is reset to zero when the
subblock is removed from the cache. Any subblock brought to the cache by a
prefetch operation retains its tag of zero. When a tag changes from 0 to 1
(i.e.,when the subblock is referenced for the first time after prefetching or
is demand-fetched), a prefetch is initiated for the next sequential subblock.
Load-forward involves fetching the target subblock and the subsequent
subblocks within the same block. Subblock wrap-around-prefetch works
like prefetch always within a block except when references near the end of a41
block. At this point subblock wrap-around-prefetches references willwrap
around within the current block.
Fig. 4.11 to Fig. 4.19 show the different fetching and prefetching
policies that we can use with subblocks to try to improve the miss ratio and
minimize the degradation in performance due toan increase in the traffic
ratio. As can be seen, every subblock fetching policy reduce the miss and
traffic ratio over a design which does not use subblock fetching. Among the
different subblock fetching policies, subblock-load-forward-prefetch and
subblock-wrap-around have similar results for miss and traffic ratios.
Subblock-missprefetch has better miss and traffic ratios. But subblock-
prefetch-always and subblock-tagged-prefetch haveeven better miss ratios
though its traffic ratio also increases over subblock-missprefetch. Thereason
is because subblock-prefetch-always and subblock-tagged-prefetch tendto
prefetch data and instruction indiscriminately.
Two of these prefetch algorithms were tested by Smith [21]: always-
prefetch and missprefetch. It was found that always-prefetching reduced the
miss ratio by as much as 75 to 80 percent for large cachememory sizes,
while increasing the traffic ratio by 20 to 80 percent. Missprefetchwas much
less effective; it produced only one half,or less, of the decrease in miss ratio
produced by always-prefetch. The traffic ratio increased bya much smaller
amount, typically 10 to 20 percent. In [18], Smith also tested the tagged-
prefetch algorithm. It was found that tagged-prefetchwas equally effective as
always-prefetch in reducing the miss ratio. Missprefetchwas less than half
as good as always-prefetching or tagged-prefetch in reducing the miss ratio.
These studies by Smith were only done for blocks, butwe can see a42
similarity in their usefulness for subblocks as shown by our simulation
results.
4.2.5 Write-Back and Write-Through
All simulations, except those with plots shown in Fig.. 4.28 to Fig.
4.33 have write-back as the write policy. From the simulations done,we see
that the miss ratio is higher for write-through than write-back. We have only
repeated a few of the simulations with write-through. Simulationsare chosen
on the basis of illustrating the characteristics of write-through caches. The
design parameters that are chosen are subblock size, block size and block-
prefetch. From the simulations, we see that write through hasa higher traffic
ratio than write back for smaller block sizes but lower for bigger blocks.
This is shown in Fig 4.28 to Fig. 4.33. As reported inmost literature
[9,17,18] write through tends to have a higher miss and traffic ratio than
write-back. For a machine which uses write-through, by whichmemory is
written to on every store instruction, the write frequency is usually just the
frequency in the trace of stores tomemory. If the machine uses write-back,
however, the frequency of writes tomemory is the miss ratio times the
probability that a line to be pushed is dirty. As the blockor subblock sizes
become bigger, bigger blocks will be pushed when they become dirty, thus
resulting in an increase in traffic ratio.43
4.2.6 Multiprogram Traces
Typically, a program executes for a period of time beforean
interruption (for example I /O) of some type invokes the supervisor mode.
The supervisor mode eventually relinquishes control of theprocessor to
some user process, perhaps the same one as was running most recently. If it
is not the same user process, the new process probably does not findany
blocks of its address space in the cache, and starts immediately witha
number of misses. If the most recently executed process is restarted, and if
the supervisor interruption has not been too long,some useful information
may still remain.
The effect of the task-switch interval on the miss ratio cannot be easily
estimated [18]. In particular, the effect depends on the workload andon the
cache size. We observe that the proportion of cache misses due to task
switching increases with increasing cache size,even though the absolute
miss ratio declines. This is because a small cache hasa large inherent miss
ratio (since it does not hold the program's working set) and this miss ratio is
only slightly augmented by task-switch-induced misses.
We have used 2 sets of multiprogram traces. Thesame set of
simulations performed with single program traceswere repeated with these
traces. The simulation results are shown in Appendix A. The results show
no distinct difference between single and multiprogram traces. This is44
because of the difficulty of analysing miss and traffic ratios for multiprogram
traces as explained in the previous paragraph. But if we compare the results
of the two multiprogram traces, mul2 and mul8 we see that the trace that
contain the eight proccesses have higher miss and traffic ratios for all the
simulationsdone.This showsthat when designingcaches,
multiprogramming is an important aspect. In the summary and conclusion
section, we will discuss some possible solutions to the problem of high
cache miss ratios to task switching.
4.3Simulation Results
Miss Ratio Vs Size (single)
0.3
single
program
traces
20 40
Size (kbytes)
60
Fig. 4.1 Miss Ratio Vs Size (single)
80
CC
SPICE
TEX
LISP
IVEX
DECOTraffic Ratio Vs Size (single)
45
single
program
traces
20 40
Size (kbytes)
60
Fig. 4.2Traffic Ratio Vs Size (single)
0.2 -/'
Miss Ratio Vs Associativity (single)
80
0.0
0
ID 11' U D
I
10 20
Associativity
a----
single
program
traces
CC
SPICE
TEX
LISP
IVEX
DECO
--o CC
4,-- SPICEIITEX
--0 LISP
IVEX
0 DECO
Fig. 4.3Miss Ratio Vs Associativity (single)0.8
0.6
0.4
0.2
0.0
Traffic Ratio Vs Associativity (single)
,f_....1
0
Er
10
Associativity
20
4 6
single
program
traces
o CC
* SPICE
III TEX
.-- LISP
IVEX
0--- DECO
Fig. 4.4Traffic Ratio Vs Associativity (single)
0.5
0.4 -
0.3 -
0.2 -
0.1 -
Miss Ratio Vs Subblock (single)
O 0.0 / , (
0 10 20
Subblock (bytes)
Fig. 4.5 Miss Ratio Vs Subblock (single)
single
program
traces
CC
SPICE
TEX
LISP
IVEX
DECOTraffic Ratio Vs Subblock (single)
1.0 -/
0.8 -
0.6 -
0.4 -
0.2 -
0.0
0
1
10
Subblock (bytes)
20
Fig. 4.6Traffic Ratio Vs Subblock (single)
0.4
0.3
0.2
0.1
0.0
Miss Ratio Vs Blocksize (single)
20 40
Block(bytes)
60 80
Fig. 4.7 Miss Ratio Vs Blocksize (single)
47
single
program
traces
.......m.
......=1:1.
single
program
traces1:1
________
-II--
-0-
CC
SPICE
TEX
LISP
IVEX
DECO
CC
SPICE
TEX
LISP
IVEX
DECOTraffic Ratio Vs Blocksize (single)
0 20 40 60
Block(bytes)
80
Fig. 4.8 Traffic Ratio Vs Blocksize (single)
Miss Ratio Vs Blocksize (prefetch,single)
0
CI
a
20 40 60 80
Block(bytes)
48
single
program
traces
.0
single
program
tracesa
Fig. 4.9Miss Ratio Vs Blocksize (prefetch,single)
CC
SPICE
TEX
LISP
IVEX
DECO
CC
SPICE
TEX
LISP
IVEX
DECOTraffic Ratio Vs Block (prefetch,single)
0 10 20
Block(bytes)
30 40
49
single
program
traces
0
Fig. 4.10Traffic Ratio Vs Blocksize (prefetch,single)
0.5
0.4
4.30.3
cC
0.2
0.1
0.0
0
Miss Ratio Vs Subblock (single)
subblock prefetch-wrap-around
10
Subblock (bytes)
d-7
20
Fig. 4.11Miss Ratio Vs Subblock (single)
SubblockPrefetch-wrap-around
single
program
traces
-0---
CC
SPICE
TEX
LISP
IVEX
DECO
CL
SPICE
TEX
LISP
IVEX
DECO1.2
1.0 -
0.8 -
0.6 -
0.4 -
0.2 -
0.0
0
Traffic Ratio Vs Subblock (single)
subblock prefetch-wrap-around
i
10
Subblock (bytes)
20
Fig. 4.12Traffic Ratio Vs Subblock (single)
SubblockPrefetch-wrap-around
0.5
0.4 -
0.3 -
0.2 -
0.1 -
0
Miss Ratio Vs Subblock (single)
subblock load-forward-prefetch
1
1 0
Subblocks(bytes)
20
Fig. 4.13Miss Ratio Vs Subblock (single)
Subblock Load-forward-prefetch
50
single
program
traces
--CI--..-
-It--
_____0_
0-..........
single
program
traces
_p_____
- CI--
CC
SPICE
TEX
LISP
IVEX
DECO
CC
SPICE
TEX
LISP
IVEX
DECO0.8 -
0.6 -
0.4 -
0.2 -
0.0
Traffic Ratio vs Subblock (single)
subblock load-forward-prefetch
0 (
10 20
Su bblocks(bytes)
Fig. 4.14Traffic Ratio vs Subblock (single)
Subblock Load-forward-prefetch
0.4
0.3 -
o
0.2 -
0.1 -
0.0
0
Miss Ratio Vs Subblock (single)
subblock missprefetch
10 20
Subblock(bytes)
Fig. 4.15Miss Ratio Vs Subblock (single)
Subblock Missprefetch
51
single
program
traces
single
program
traces
CC
SPICE
TEX
LISP
IVEX
DECO
--D CC
SPICE
TEX
LISP
IVEX
DECO1.0
0.8
0.6
0.4
0.2
0.0
0
Traffic Ratio Vs Subblock (single)
subblock missprefetch
10
Subblock(bytes)
20
Fig. 4.16Traffic Ratio Vs Subblock (single)
Subblock Missprefetch
0.5
0.4
0.3
0.2
0.1
0.0
Miss Ratio Vs Subblock (single,multi)
subblock tagged prefetch
0 10
Subblock (bytes)
20
52
single
program
traces
-0- (:)C
-- SPICE
TEX
LISP
IVEX
DECO
program traces
- 0- CC
0 SPICE
is TEX
LISP
IVEX
0- DECO
MULT2
MULT8
Fig. 4.17Miss Ratio Vs Subblock (single,multi)
Subblock Tagged-prefetch1.2 -/.
1.0 -
0.8 -
0.6 -
0.4 -
0.2 -
0.0
Traffic Ratio Vs Subblock (single, multi)
subblock tagged-prefetch
53
program traces
0
i
1 0
Subblock (bytes)
20
_a_A
.....ar....
Fig. 4.18Traffic Ratio Vs Subblock (single,multi)
Subblock Tagged-prefetch
0.4 -
0.3 -
0.2
0.1 -
0.0
0
Miss Ratio Vs Subblock (single, multi)
subblock always-prefetch
CC t
SPICE_t
TEX_t
LISP _t
IVEX_t
DECO_t
MULT2_t
MULT8_t
program traces
1
1 0
Subblock (bytes)
_.._._
CI-
-A--A--
Fig. 4.19Miss Ratio Vs Subblock (single,multi)
Subblock Always-prefetch
CC
SPICE
TEX
LISP
IVEX
DECO
MULT2
MULT8Miss Ratio Vs Subblock (single)
write-through
5 4
program traces
0.0
0
I (
10 20
Subblock (bytes)
Fig. 4.20Miss Ratio Vs Subblock (single)
Write-Through
0.0
Traffic Ratio Vs Subblock (single)
write-through
....(3..
CC
SPICE
TEX
LISP
IVEX
DECO
single
program
traces
0 10
Subblock (bytes)
Fig. 4.21Traffic Ratio Vs Subblock (single)
Write-Through
0--
CC t
SPICE_t
TEX_t
LISP_t
IVEX_t
DECO_t0.4
0.3
0.2
0.1
0.0
Miss Ratio Vs Blocksize (single)
write-through
I I
20 40
Block(bytes)
60 80
Fig. 4.22Miss Ratio vs Blocksize (single)
Write-Through
Traffic Ratio Vs Blocksize (single)
write-through
2-'
I I
20 40
Block(bytes)
1
60 80
Fig. 4.23Traffic Ratio Vs Blocksize (single)
Write-Through
55
single
program
traces
0 CC
* SPICE
It TEX
40 LISP
IVEX
D--- DECO
single
program
tracesa-
--4--
-CI--..._
1:I
CCt
SPICE_t
TEX_t
LISP _t
IVEX_t
DECO_tMiss Ratio Vs Blocksize (prefetch,single)
write-through
0.4-4
0 20 40
Block (bytes)
60 80
56
single
program
traces
CC
SPICE
TEX
LISP
IVEX --aDECO
Fig. 4.24Miss Ratio vs Blocksize (prefetch,single)
Write-Through
4
3
Traffic Ratio Vs Blocksize (prefetch,single)
write-through
2
1
0
0 20 40
Block (bytes)
60(
80
single
program
traces0 CC _I
----4,---SPICE_t
TEX_t
LISP t
IVEX_t
DECO_t
Fig. 4.25Traffic Ratio Vs Blocksize (prefetch,single)
Write-Through57
Chapter 5
Summary, Conclusions and Future Work
5.1Summary and Conclusions
The reason for this study is to characterise and quantify processor
memory traffic for on-chip caches. The goal is to discuss cache designs that
improve processor performance but do not degrade system performance
resulting from an increase in the processor memory traffic.
Generally having a bigger cache will improve both the miss and
traffic ratios and therefore performance. But a bigger cache will also cost
more. As usual, there is a trade off between performance and cost. Besides,
large caches because of their physical size and logic complexity also increase
the access time to the cache. This may lower the performance gain from the
improved miss and traffic ratios. A possible solution to this problem is to
build a two-level cache, in which the smaller, faster level is on the order of 4
kbytes and the larger, slower level is on the order of 64-512 kbytes [18,31].
Although the miss ratio from the small cache would be fairly high, the
increased cycle time and decreased miss penalty would yieldan overall
improvement in performance.58
Small block sizes have a number of advantages. The access time for
smaller blocks from main memory to cache is obviously shorter than that for
a large block. A high-performance machine may use fetch bypass with
bigger blocks to reduce the latency for large blocks. The smaller block is less
likely to contain unneeded information. The data width of main memory
should usually be at least as wide as the block size, since it is desirable to
transmit an entire block in one main memory cycle time. Main memory width
can be expensive, and smaller blocks minimize this problem.
Large block sizes, too, have a number of advantages. If more
information in a block is actually being used, fetching it all at one time (as
with a bigger block) is more efficient. With bigger blocks, the number of
blocks in the cache is smaller, so there are fewer logic gates and fewer
storage bits required to keep and manage address tags and replacement
status. A larger block size permits fewer elements/set in the cache which
minimizes the associative search logic.
By using bigger block sizes we can improve the miss ratio but this
results in an increase in the traffic ratio. It is inevitable that bigger blocks
make processing a miss somewhat slower. To correct this problemwe have
smaller units of memory transfer called subblocks. This reduces the traffic
ratio as discussed in the previous chapter. But if weuse some fetching
policies like load-forward or miss-prefetching with subblocks,we can
improve the miss and traffic ratios further. Subblock-missprefetch which
prefetches a subblock after every miss is the best design for subblocksas it
reduces the miss and traffic ratio simultaneously.
We also investigated the different write policies. As reported in other
studies [17,18], write through has a higher miss and traffic ratio than write59
back. Write back always results in less main memory traffic since write
through requires a main memory access on every store, whereas write back
only requires a store to main memory if the swapped out line has been
modified. Write back, generally, results in the entire line being written back
rather than just one or two words, as would occur for each write memory
reference. If write through is used, main memory always contains an up-to-
date copy of all information in the system. When there are multiprocessors in
the system, main memory can serve as a common and consistent storage
place. Otherwise, either the cache must be shared or a complicated directory
system must be employed to maintain consistency. Write back has more
complicated cache logic as it requires s dirty bit to determine when to copy a
block back. In addition, arrangements have to be made to perform the write
back before the fetch (on a miss) can be completed. Therefore, besides
performance we have to know the complexity involve in using either write
back or write through for the cache design.
Multiprogramming traces have been used in all the simulations
and they do not show any significant difference from single program traces.
But when we compare the two multiprogramming traces, Mul2 and Mu18,
we found that the simulations for Mu12 have lower traffic and miss ratios.
Mul2 is the trace that has two multiprogramming traces while Mu18 has
eight. Therefore multiprogramming is an aspect of cache design thatwe
cannot ignore. Some solutions proposed by Smith [18] to reduce the miss
ratio due to task switching are: (1) having a bigger cache so that several
programs can maintain in it simultaneously, (2) lengthen the task-switch
interval and (3) modifying the scheduling algorithm in order to give
preference to a task likely to have information in a cache.60
A very important observation and conclusion is that different
programs have different trace characteristics. This is shown in the variations
in miss and traffic ratios for each design parameter. An interesting example is
the LISP trace discussed in the previous chapter. Therefore it is important to
know which are the programs that will be run on the machine so that a
particular cache suitable for it can be designed. Usually, this is not possible.
What has been done is to select a set of programs called benchmark
programs, which is used as a representative workload for caches, for the
analysis of cache designs.
5.2Future Work
It would be both interesting and important to continue this study by
investigating the performance metrics for level two caches. This is especially
important as we realise that we cannot continue to have bigger level one
caches because of the slower response from bigger caches as mentioned.
In this study we have not investigated the two components of traffic
ratio; instruction and data traffic ratios. This would be another important area
as we see how the different design parameters affect data and instruction.
This study can also be extended by changing the base model so that
we have a unified cache with a different size or placement algorithm and
observe how this base model affect the performance metrics. As the demand
for performance increases the area of multiprocessor memory traffic on the
system bus will become increasingly important. Therefore using the61
knowledge gain in this study we can extend it to the study of cache designs
for multiprocessors.62
References
[1]Agarwal, A., Sites, Sites R., Horowitz, M. ATUM: "A New
Technique for Capturing Address Traces Using Microcode." Proc.
13th Sym. on Computer Architecture, IEEE/ACM, Tokyo, Japan,
June, 1986.
[2]Agarwal, A. "Analysis of Cache Performance for Operating Systems
and Multiprogramming." PH.D. Th., Stanford University, Computer
systems Laboratory, May 1987.
[3]Cadell A., William K., Furrokh C., Fays B. "Cache Memory
Performance in a UNIX Environment", Computer Architeture News,
1986.
[4]Briggs F., Hwang K., "Computer Architecture and Parallel
Processing," McGraw-Hill, 1984.
[5]Fielland G., Rodger D., "32-Bit Computer System Shares Load
Equally Among Up to 12 Processors," Electronic Design, Sep 6
1984.
[6]Kaplan K., Winder R., "Cache-based Computer Systems,"
Computer, Mra 1973.[7]
63
Hamacher C, Z. Vranesic, S. Zaky "Computer Organization,"
McGraw-Hill, 1984.
[8]Horowitz, M; Chow, P; "The MIPS-X Microprocessor," Proceedings
of the WESCON-85, San Francisco, Calif., November 1985.
[9]Hill, M.D., Smith A., "Experimental Evaluation of On-Chip
Microprocessor Cache Memories," Proc. 11th Int. Symp. Comp.
Arch.., Jun 1984.
[10]Censier L., Feautrier P.," A New solution to Coherence Problems in
Multicache Systems," IEEE Transactions on Computers, Dec 1978.
[11]Milenkoviv, M. "Microprocessor Memory Management Units",IEEE
Micro, April 1990, pp. 70-85.
[12]Przybylski,S, Horowitz, M., Hennessy J.
"Performance Tradeoffs in Cache Design,". Proceedings of the 15th
Annual International Symposium on Computer Architecture,June,
1988, pp. 290-298.
[13]Denning P., "On Modeling Program Behaviour, "Proc. Spring Joint
Computer Conference, AFIPS Press 1972.64
[14]Smith, A.J ,"Cache memory design: An evolving Art", IEEE
Spectrum, Dec, 1987, pp. 40-44.
[15]Smith, A.J,"Line (Block) Size Choice for CPU Cache Memories",
IEEE Transaction on Computers C-36, 9 (Sep 1987), 1063-1075.
[16]Smith, A.J, "A Comparative Study of Set Associative Memory
Mapping Algorithms and Their Use for Cache and Main Memory".
IEEE Transactions on Software Engineering Se-4,2 (March 1978),
121-130.
[17]Smith, A.J,"Cache Evaluation and the Impact of Workload Choice,"
Proceedings of the 12th Internatioanl Symposium on Computer
Architecture, Boston, June 17-19, 1985, pp. 64-73.
[18]Smith, A.J"Cache Memories," ACM Computing Surveys, Vol 14,
No3, September 1982, pp. 473-530.
[19]Smith, J.E., Pleszkun A.R., Katz R.H., Goodman, J.R., "PIPE: A
High Performance VLSI Architecture," Proceedings of the IEEE
International Workshop on Computer Systems Organizations, New
Orleans, La., Mrach 1983.
[20]Goodman J.R, "Using Cache Memory to Reduce Processor Memory
Traffic," Proc. Tenth International Symp. on Computer Architecture,
pp 124-131, Stockolm, Sweden, June 1983.65
[21]Smith A.J.,"Sequential Program Prefetching in Memory
Hierarchies," Computer, vol. 11 no12, pp 7-21, December 1978.
[22]Hill M.D., " Misc Reference Manual Pages for DineroIII" pp 3-4.
[23]Pohm A., Agrawal 0., Monroe R., "The Cost and Performance
Tradeoffs of Buffered Memories," Proceedings of IEEE, Aug 1975.
[24]Rosenberg R., Keller E., "Multiprocessing 32-Bit Buses Are Starting
To Blossom," Electronics, March 22 1984.
[25]Smith A., "Characterizing the Storage Process and Its Effect on the
Update of Main Memory by Write Through," Journal of the ACM,
Jan 1979.
[26]"MOS Memory Data Book," Texas Instruments, 1984.
[27]Yen W., Fu K., "Analysis of Multiprocessor Cache Organizations
with Aleternative Main Memory Update Policies," Proc. 8th Int.
Symp. Comp Arch., May 1981.
[28]Edenfield R.,Laakso P., Ledbetter W. Jr, "Advances in the 68040
memory subsystem", Electronic Engineering, Feb 1990.66
[29]Kane,G. "Mips RISC Architecture", Prentice-Hall, Englewood Cliffs,
N.J., 1988
[30]Hennesy J.L, Patterson D. ,"Computer Architecture: A Quantitative
Approach",Morgan Kaufmann Publishers Inc.,1990
[31]Przybylski S. A., "Cache and Memory Hierarchy Design", Morgan
Kaufman Publishers Inc.,1990.
[32]Smith J. E., Goodman J. R., "Instruction Cache Replacement Policies
and Organizations." Proceedings of the 10th Annual Symposium on
Computer Architecture, May 1983.
[33]VAX-11 Architecture Reference Manual. Digital Equipment
Coporation, Bedford, Ma, 1982.
[34]Clark D. W. "Cache Performance in the VAX-11/780." ACM
Transactions on Computer Systems, Feb 1983.
[35]Chu C.Y, "MILS: MIPS Instruction Level Simulator." September
1985. Computer Systems Laboratory, Stanford University.
[36]Henry R. R.,"Tracer-Address and Instruction Tracing for the VAX
Architecture," University of California, Berkeley, November, 1984.67
[37]Spirn J. R.,"Program Behavior: Models and Measurements."
Operating and Programming Series, Elsevier, N.Y, 1977.
[38]Strecker W. D.,"Transient Behaviour of cache memories," ACM
Trans. Comput. Syst., vol. 1, pp. 281-293, Nov. 1983.
[39]Laha S, Patel J. H., Iyer R. K., "Accurate Low-Cost Methods for
Performance Evaluation of Cache Memory Systems," IEEE Trans.
Computer., Vol 37, pp. 1325-1336, Nov. 1988.
[40]Aho A. V.,Denning P. J., Yamamura M., Chow Y., Mak P. "32-bit
Processor Chip Integrates Major System Functions," Electronics, pp.
113-119, July 1983.
[41]MacGregor D, Mothersole D, Moyer B., " The Motorola MC68020,"
IEEE Micro, 101-118, August 1984.AppendixSimulation Results Using Multiprogram Traces
0.3
0.2
0.0
Miss Ratio Vs Block (prefetch,multi)
0
I I I
20 40 60
Block (bytes)
80
68
two multiprogram
traces
0-- m uI2
--io--mult8
Fig. A.1Miss Ratio Vs Blocksize (prefetch,multi)4
2-
0
Traffic Ratio Vs Block (prefetch,multi)
0 20 40 60
Block (bytes)
80
69
twomultiprogram
traces
Fig. A.2Traffic Ratio Vs Blocksize (prefetch,multi)
Miss Ratio Vs Subblock (multi)
subblock load-forward-prefetch
0 10
Subblock (bytes)
20
Fig. A.3Miss Ratio Vs Subblock (multi)
Subblock Load-forward-prefetch
M U I2t
m u18_t
2 multiprogram
traces
mult2
mult81.0
0.8 -
0.6 -
0.4 -
0.2 -
0.0
Traffic Ratio Vs Subblock (multi)
subblock load-forward-prefetch
0 10
Subblock (bytes)
20
Fig. A.4Traffic Ratio Vs Subblock (multi)
Subblock Load-forward-prefetch
0.2
Miss Ratio Vs Subblock (multi)
subblock missprefetch
0.0 I I 1
20 40 60
Subblock (bytes)
80
Fig. A.5Miss Ratio Vs Subblock (multi)
Subblock Missprefetch
7 0
2 multiprogram
traces
...1:1.mult2_t
mult8_t
2 multiprogram
traces
---13---multi2
--*--multi83
2
1
Traffic Ratio Vs Subblock (multi)
subblock missprefetch
7 1
2 multiprogram
traces
0
0 20 40
Subblock (bytes)
60 80
Fig. A.6Traffic Ratio Vs Subblock (multi)
Subblock Missprefetch
0.3
Miss Ratio Vs Blocksize (multi)
0.2 -
0.1 -
0.0 I I
20 40 60 80
Block(bytes)
Fig. A.7Miss Ratio Vs Blocksize (multi)
multi2_t
multi8 _t
2 multiprogram
traces
0------multi2
multi83
2
1
0
Traffic Ratio Vs Blocksize (multi)
0 20 40
Block(bytes)
60 80
72
2 multiprogram
traces
0 multi2-t
--*--multi8-t
Fig. A.8Traffic Ratio Vs Blocksize (multi)
0.2
0.1
0.0
Miss Ratio Vs Cachesize (multi)
0
0 20 40 60 80
Cache(kbytes)
Fig. A.9Miss Ratio Vs Cachesize (multi)
2 multiprogram
traces
0----multi2
--* multi8Traffic Ratio Vs Cachesize (multi)
0
0
1:1
I I ' I
20 40 60 80
Cache(kbytes)
73
2 multiprogram
traces
0-- multi2_t
--*---multi8t
Fig. A.10Traffic Ratio Vs Cachesize (multi)
0.13
0.12
0.11
0.10 -
0.09
0.08
Miss Ratio Vs Associativity (multi)
0.07 .
0 10
Associativity
20
2 multiprogram
traces
0----multi2
*--- multi8
Fig. A.11Miss Ratio Vs Associativity (multi)Traffic Ratio Vs Associativity (multi)
10
Associativity
74
2 multiprogram
traces
-13---.-
20
Fig. A.12Traffic Ratio Vs Associativity (multi)
multi2_t
multi8_t