Simulation Study of Snoopy Cache Coherence Protocols by Chung, In-Suk
A SIMULATION STUDY OF SNOOPY CACHE
COHERENCE PROTOCOLS
By
IN-SUK CHUNG
Bachelor of Science
Oklahoma State University
Stillwater, Oklahoma
1993
Submitted to the Faculty of the
Graduate College of the
Oklahoma State University
in partial fulfillment of
the requirements for
the Det,'Tee of
MASTER OF SCIENCE
May, 1996
OKLAHOMA STATE UNIVERSITY
A SIMULAnON STUDY OF SNOOPY CACHE
COHERENCE PROTOCOLS
Thesis Approved:
Thesis Adviser
Dean of the Graduate College
u
ACKNOWLEDGEMENTS
I sincerely thank my graduate adviser Dr. K. M. George for the guida.nce, help
and time he has given me toward the completion of my thesis work. His tenacity and
hard work inspired me to venture into the advanced aspects of this work. I would like to
express my sincere thanks to Dr. G. E. Hedrick for his direction and leadership. Without
the encouragement and help he has given me, the completion of this work would have
been impossible. I also sincerely thank Dr. J. P. Chandler for serving on my committee.
His suggestions have helped me to improve the quality of this work.
My respectful thanks goes to my parents Mr. Lee-June Chung and Mrs. Boon-Ok
Kim for all the love and support they have given me throughout my life. And, I thank
all other members of my family for the love, encouragement and confidence they have
contributed to me.
I would also like to express my gratitude to all those people who have contributed
by giving many valuable suggestions.
1lI
Chapter
1. INTRODUCTION
TABLE OF CONTENTS
-------
Page
------------ 1
2. LITERATURE REVIEW ------------------------------------------------------- 5
2.1. Directory Cache Coherence Protocols ----------------------- 7
2.1.1. Limited Directory Protocol --------------------------------------------- 8
2.1.2. Chained Directory Protocol -------------------------------.- 10
2.2. Snoopy Cache Coherence Protocols -----------------.------------------.---. 12
2.2.1. Write-Invalidate Protocols ------------------------------------- 17
2.2.2. Word Invalidate Protocol (WIP) --------------------------------------- 20
2.2.3. Read Broadcast ------------------------------------------------------------- 2 I
3. A HYBRID WORD rNYALIDATEIREAD BROADCAST
PROTOCOL (HWRP) ------------------------------------------------------------- 22
3.1. The Hybrid Write InvalidatelRead Broadcast Protocol
Detailed Description ----------------------------------------------------------- 28
4. SfMULATION MODEL -------------------------------------------------------------------- 43
4.1. Multiprocessor Model ---------------------------------------------------------------- 43
4.1.1. The Cache Controller Process -------------------------------------------- 45
4.1.2. The Snoop Controller Process -------------------------------------------- 46
4.1.3. The Bus Process ------------------------------------------------------------ 46
4.2. Workload Model --------------------------------------------------------------------- 48
IV
5. DISCUSSION OF SIMULATION RESULTS 50
5.1. Impact on Miss Ratio with Varying Parameters --------- 50
5.2. Impact on Miss Ratios with A Larger Cache Size - 54
5.3. Bus Utilization ---- ----------- 57
6. CONCLUSION -- 59
REFERENCES ----------------------------------------------- 60
v
LIST OF TABLES
Table
t. Snoopy Cache Coherence Protocols
Page
--•.-••-.-••-.-------.-.--- 13
2. Summary of Cache Block States .--.__•__••_-_•••••_-- 23
3. Timing for Fundamental Bus Operations --- 47
4. Summary of Bus Cycle Costs - _ .._ - - 47
5. Summary of Parameters and Ranges .- ---.----.-.- 49
vi
LIST OF FIGURES
Figure
1. Cache configuration after reading two words by Po and PI
Page
2
2. Cache configuration after writing a word in X' by Po (write-through cache) ------ 3
3. Cache configuration after writing a word in X' by Po (write-back cache) - 4
4. Cache configuration after cache Cland cache C2 request
a data block of a location X in memory -------- 9
5. Cache configuration after cache C3 requests
a data block of a location X in memory - --------- 9
6. Cache configuration after cache C j requests
a data block ofa location X in memory ---------------------- 10
7. Cache configuration after cache C2 requests
a data block of a location X in memory --------------------------- 11
8. Cache configuration after reading four words from memory -----------.-_.---------- 18
9. Cache configuration after a Write (x -4> X') on one word
in a block of P2's cache -------------------------------------------------------- 19
10. Cache configuration after a Read on four words in a block of
private caches ofPl , P2 and P3 ---------------.-.------------------------------- 28
11. Cache configuration after a Write on word (x -4> X) in a block
of PI (write-back cache) -----------------------------------._----- 29
12. Cache configuration for WIP ---------------------------------------------------------- 30
13. Cache configuration for HWRP --------------------------------.--------- 31
14. Cache configuration after a Write on word(y -4> Y) in a block
of PI '5 cache by PI (write-back cache) ------------------------------- 32
vii
15. Cache configuration for WIP
16. Cache configuration for HWRP ---.----------
33
-----•• 34
17. Cache configuration for WIP after a read request ofP (z --+- Z)
in a block ofP, 's cache by PI ---------.----------.---------.-•• 35
18. Cache configuration for WIP after a read request of p) ------- 37
19. Cache configuration for HWRP after a read request ofP) ------------------.- 37
20. Cache configuration after the whole valid block --- -------------- 38
21. Cache configuration after the whole valid block is reloaded ------------------.•- 38
22. Cache configuration after the write hit by PI ---------------------- 40
23. Cache configuration after the write hit by Pion the block in the IW1 state --.----. 40
24. Cache configuration after the write hit by P3 ----------------------------- 42
25. Cache configuration after the write hit by P3 on the block in the IW2 state ----.-- 42
26. A diagram of multiprocessor model ----------------------------------------..-.---- 44
27. Ratio of Invalidation Misses for Both Protocols -----.••------------.-------------------. 51
28. Ratio of Total Misses for Both Protocols .-------------.-------------------------.-- 51
29. Ratio ofInvalidation Misses for Both Protocols --------..-.-----•.•.-...-.-------•••.•-.- 53
30. Ratio of Shared Misses For Both Protocols --••-.--•••----.-.-••--••------------------ 53
31. Ratio of Invalidation Misses for Both Protocols --.----.-.-.---------------------------- 54
32. Ratio of Shared Misses for Both Protocols -----------------------.----------._--------- 54
33. Ratio ofInvalidation Miss for HWRP -------------------------------.------------•••----. 55
34. Total Miss Ratio for HWRP ---------------------------.----.----.--.------------------- 56
35. Ratio of Invalidation Misses for WIP --------------------------.---.-----•.--------------- 56
36. Total Miss Ratio for WlP ------------------.---.-----.-----------------------------..---.--- 57
VlII
37. Total Bus Cycles ------------------- 58
38. Total Bus Cycles -- 58
ix
1. INTRODUCTION
Shared-memory multiprocessors have provided a cost-effective solution to the
problem of increased computing power and speed because they use relatively low-cost
microprocessors interconnected with shared memory modules. But shared-memory
multiprocessors are faced with three problems: Memory contention, Communication
contention and Latency. These problems all contribute to increased memory access time
and hence slow down the processors' execution speeds [] 9].
Cache memories have served as a significant way to reduce the average memo!)'
access time. The main memory traffic from each processor is determined by the success
of the cache memory in satisfying memory requests without main memory operations
[17]. In shared memory multiprocessors, all processors with private caches are limited
in their perfonnance by cache access time. Accordingly, cache memory performance is
one of the most significant factors in achieving high machine performance.
Private caches in shared memory multiprocessors aTe essential to reduce the
average time to access main memory and to decrease bus congestion [12]. But shared-
memory multiprocessor systems introduce a cache coherence problem because multiple
caches could have different copies of the same memory block if one of the processors has
modified its copy. A system of caches is said to be coherent if all copies of a main
memory location in multiple caches remain consistent when the contents of that memory
location are modified [3].
For an example, consider cache coherence problem that can be caused by the
sharing of writable data. Three figures are provided to describe cache coherence
problem in shared memory multiprocessors. We assume that X' and y' refer to the
2cached copies of X and Y in a shared memory. IfPo and PI read two words in X and Y
from a shared memory. then the read of two words in X and Y by two processors results
in consistent copies of X and Y. Figure I shows caches and shared memory in a coherent
state.
Shared Memory
X 100
Y 200
I I
X' 100 X' 100
Y' 200 Y' 200
Po P J
Bus
Processors
Caches
Figure 1. Cache configuration after reading two words by Po and PI
Depending on the memory update policy used in the cache. the cache level may
also be inconsistent with respect to main memory. A write-through policy maintains
consistency between main memory and cache. If Po writes 300 into X' in Pu's cache.
then the copies of X' in both caches become inconsistent, whereas the copies between
3Po's cache and memory are consistent. A read ofa word in X' by P l will not return the
latest value. Figure 2 shows an inconsistent state between Po's cache and P l 's cache.
Shared Memory
X 300
Y 200
I I
X' 300 x· 100
Y' 200 y' 200
ors Po PI
Bus
Process
Caches
Figure 2. Cache configuration after writing a word in X' by Po (write-through cache).
However, a write-back policy does not maintain such consistency between main
memory and cache at the time of write. The memory is updated eventually when the
modified data in the cach.e are replaced or invalidated. If Po writes 300 into X' in Po's
cache, then the copies in both caches are inconsistent. Also, the copies between caches
and memory are inconsistent. Figure 3 on the next page depicts the inconsistent state of
the caches and memory for write-back policy.
4Shared Memory
X 100
¥ 200
I I
X' 300 X' 100
¥' 200 Y' 200
s Po PI
Bus
Processor
Caches
Figure 3. Cache configuration after writing a word in X' by Po (write-back cache).
The cache coherence problem has attracted considerable attention over the past
years. A lot of research within university environments and company environments has
been devoted to this problem, resulting in a number of proposed solutions.
Write-invalidate protocols, one of the many solutions to the cache coherence
problem, allow multiple readers of the shared block, but only one writer at a time [21].
Write-invalidate protocols maintain coherency by requiring a writing processor to
invalidate aU other cached copies sharing the same data before updating its own data. It
can then perform the current write, and any subsequent writes, without invalidation
5requests. Write-invalidate protocols have two main sources of bus-related coherency
overhead. The first is the invalidation request of shared data in each cache. The second
is the cache misses that occur when processors need to reference invalidated data. These
misses, called invalidation misses, can result from an invalidation requested by another
processor prior to the cache access. invalidation request and invalidation misses are
recognized as a main obstacle in achieving high perfonnance for write-invalidate
protocols [21].
In this thesis, we propose a Hybrid Word Invalidate/Read Broadcast protocol
(HWRP) to reduce the invalidation misses which are an important perfonnance issue for
write-invalidate protocols. The hybrid word invalidate/read broadcast protocol is an
extension of Word Invalidate Protocol presented by TomaSevic and Milutinovic [21],
with one major difference: 11 uses a read broadcast mechanism [16] which can
simultaneously update invalid copies while a data item is transferred on the bus as a
response to a read miss request. Also, we study in this thesis the effectiveness of the new
scheme using simulation. The organization of the rest of the thesis is as follows. A
literature review of cache coherency protocols is presented in Section 2. Then in Section
3, we present a hybrid word invalidate/read broadcast protocol (HWRP). Section 4
presents the simulation model and results are analyzed in Section 5. We finally conclude
in Section 6.
2. LITERATURE REVIEW
Basically, all solutions to the cache coherence problem can be classified in two
large groups: software-based and hardware-based [19].
6In the software-based approach, most solutions generally depend on the actions of
the programmer, compiler, or operating system, in handling the cache coherence
problem. Several software-based protocols have been proposed where memory blocks
are tagged as cacheable or noncacheable depend.ing on the access pattern to shared data.
Read-only or non-shared data can always be cached. but shared read-writable data can
never be cached to prevent the existence of inconsistent cached data. In software-based
schemes, an advantage is that software schemes are generally less expensive than their
hardware counterparts, although they may require considerable hardware support [22]. A
disadvantage is that they all suffer from high cache miss ratio for shared read-writeable
data structures simultaneously accessed by several processors (18). Software-based
solutions are not considered further in this thesis.
In the hardware-based approach, cache coherence protocols can be divided into
two large groups: directory-based protocols and snoopy-based protocols. Directory
protocols are appropriate for multiprocessors with general interconnection networks.
The directory protocols are characterized by the existence of some kind of global table or
directory that stores the infonnation concerning the current location and state of shared
blocks. Unlike the directory protocols, snoopy protocols are suitable for multiprocessors
with a shared bus. Snoopy protocols differ substantially from directory protocols for
general networks because first, they depend on each snoopy cache controller observlng
the bus transactions of all other processors in the system, then taking appropriate actions
to maintain consistency, and second, the state of each block in the system is encoded in a
distributed way among all cache controllers [3].
72.1 Directory Cache Coherency Protocols
Directory Protocol is characterized by the existence of some kind ofglobal table
or directory that stores the information concerning the current location and state of
shared blocks [22]. Directories can be organized in different ways and it is the
responsibility of the centralized controller to take appropriate actions to preserve the
coherence by sending directed individual messages to known locations, avoiding the
broadcasts [22]. The Directory Protocols are predominantly delegated to a centralized
controller that implements the algorithm which moves data into and out of the cache
memory and the cache directory. The centralized controller checks the directory and
issues necessary commands for data transfer between memory and caches, or between
caches themselves. It is also responsible for keeping status information up-to-elate, so
every local action that can affect the global state of the block must be reported to the
centralized controller. Besides the global directory maintained by centralized controller,
the private caches store some local state information about cached blocks.
Directory methods generally suffer from significant memory overhead for tag
storage, so newly generated solution try to avoid this problem by introducing a limited
number of pointers in the directory or employing distributed directories in the form of
linked lists [22]. Since the directory is a critical system resource, frequent needs for
directory accesses can seriously damage the system performance.
The directory methods can be divided into three groups: full-map directory,
limited directory, and chained directory schemes. Chained directory scheme and Limited
directory scheme are reviewed in this thesis.
82.1.1 Limited Directory Protocol
The Limited directory protocol is designed to solve the directory size problem
which is a significant memory overhead for tag storage.
A directory protocol can be classified as Dir I X using the notation from Agarwal
[1]. The symbol i stands for the number of pointers, and X is either NB for a scheme
with no broadcast or B for one with broadcast. A full-map scheme without broadcast is
represented as Dir N NB. A limited directory protocol that uses i < N pointers is denoted
by Dir I NB. The limited directory protocol is similar to the full-map directory, except in
case when more than i caches request read copies of a particular block of data.
Figure 4 on the next page shows the cache configuration after cache C I and cache
C2 request a copy of a location X copy in a memory system with a Oir:! NB protocol. In
this case, we can view the two-pointer directory as a two-way set-associative cache of
pointers to shared copies. If cache C3 requests a copy of a location X, the memory
module must invalidate the copy in either cache C 1 or cache C2. This process of pointer
replacement is sometimes called eviction. Since the directory acts as a set associative
cache, it must have a pointer replacement policy that requires no extra memory overhead
[5]. In Figure 5, the pointer to cache C3 replaces the pointer to cache C2.
Limited Directory Protocol is very storage efficient and expandable for growing
number of processors without any further modification. State infonnation is distributed
over memory or cache modules, which reduces contention. Furthermore, tre presence
flag vector stores the residency of copies, eliminating the need for the search associated
with full-map directory scheme.
Directory
v v
M: shared memory
C: cache memory
X: data block
P: pointer
V: valid bit
D: dirty bit
CT: chain terminator
N: number of processors
v
9
Figure 4. Cache configuration after cache C\ and cache C2 request a data block of a
location X in memory.
Directory +--------i----i
v v v
Figure 5. Cache configuration after cache C3 requests a data block of a
location X in memory.
10
· 2.1.2 Chained Directory Protocol
Another way to ensure scalability ofdirectory schemes with respect to tag storage
efficiency is the introduction of chained directory scheme. It is important that the
approach does not limit the number ofcached copies. Entries in such a directory are
organized in the form of linked lists, where all caches sharing the same block are chained
through pointers into one list Unlike the limited directory approach, a chained directory
scheme is spread across the individual caches. Entry into the main memory is used only
to point to the head of the list and keep the block status.
Directory
v p
M: shared memory
C: cache memory
X: data block
p. pointer
V: valid bit
D: dirty bit
CT: chain tenninator
N: number of processors
v p
C
[Q]O 0
C
[QJO 0
Figure 6. Cache configuration after cache C l requests a data block ofa
location X in memory.
Directory
D
+-----1-----40
M
11
v p V p
OJ[g] [X] ITJlliJ W
C1 C,
,
v p
CQJO D
c..
Figure 7. Cache configuration after cache C2 requests a data block of a
location X in memory.
Requests for the block are issued to the memory and subsequent commands from the
memory controller are usuaJly forwarded through the list. using the pointers. Then the
chained directory can be organized in the form of either singly-linked lists or doubly-
linked lists.
Suppose there are no shared copies of location X. If cache C I reads location X,
the memory sends a copy to cache Cj, along with a chain termination (CT) pointer in
Figure 6. The memory also keeps a pointer to cache C 1_ Subsequently, when cache C2
reads location X. the memory sends a copy to cache C2• along with the pointer to cache
C I_ The memory then keeps a pointer to cache C2 in Figure 7. By repeating this step, all
of the caches can cache a copy of location X.
12
Although the chained protocols are more complex than the limited directory
protocols, they are still scaleable in terms of the amount of memory used for the
directories. The main advantage ofchained directory schemes is their scalability, while
performance is almost as good as in full-map schemes [5].
2.2 Snoopy Cache Coherence Protocols
In snoopy cache coherence protocols [8,10,14,15,16,20,21], the approach to cache
coherence is based on the actions of local cache controllers and distributed local state
information by watching all coherency transactions from the bus. All the transactions for
the currently shared block must broadcast to all other caches to maintain cache
coherence. Local cache controllers are able to snoop on the bus and to recognize the
actions and conditions for a coherence violation [22]. Actions are taken to preserve
cache coherence according to the protocol used. The snoopy cache coherence protocols
are divided into two groups by applying two write policies: write-mvalidate pr%Go{s
and write-update protocols. In write-invalidate protocols, a processor invalidates all
other cached copies of shared data and can then update its own without further bus
operations [6]. Unlike write-invalidate protocols, write-update protocols follow a
distributed write approach that allows the existence of multiple copies with vvrite
permission. The word to be written to a shared block is broadcast to all caches, and
caches containing that block can update it. Write-update protocols usually employ a
special bus line for dynamic detection of the sharing status for a cache blo~k. While
invalidation misses are effectively eliminated by write-update protocols, their major
disadvantage is the extraneous network traffic caused by the updates that now have to be
13
propagated to all caches having copies of a block [9]. The following table 1 shows how
the actions of various snoopy cache coherence protocols to maintain cache coherence.
Table 1. Snoopy Cache Coherence Protocols
WI : Write Invalidate Protocol
WU: Write Uodate Protocol
Protocol! Read Miss Write Hit Write Miss
Write - II anotber cache is tbe IItbe block is in state Dirty Like a read miss, the
Once owner of missed block., or in state Reserved, block always comes from
(WI) • The owner writes the • Write to the block and the owner.
block back to main update the local state to If another tache is the
memory and supplies the Dirty. owner oftbe mined
block to the requesting If the block is in state Valid, . block.,
cache. • Write to tbe block and • The owner writes the
• The requesting cache sets update main memory with block back to main
its local state to Valid. the new data. memory and supplies the
U main memory is the • A Write-Inv consistency block to the requesting
owner of missed block, command is broadcast to all cache.
• The block comes from caches, invalidating their • Send a Read-Inv
memory. copies. consistency command
• All caches with a copy of • Updates the local state to which invalidates all
the block set their state to Reserved. cached copies.
Valid. • The requesting cache
sets its local state to
Dirtv
Synapse If another cache is the If the block is in state Dirty, Like a read miss, the
N+I owner of missed block, • Write to the block and block always comes from
(WI) • The owner writes the· block update the local state to memory.
back to main memory. Dirty. If another cache is the
• The owner updates the lfthe block is in state Valid, owner of missed block.
local state to Invalid. • The procedure is identical • It must first be written
• The requesting cache must to a write miss since there is to memory by the
then send an additional no invalidation signal. owner.
miss request to get the • All other caches with
block from main memory. copies change their state
If main memory is the to Invalid
owner of ml!5td block, • The block in the
• The block comes from requesting cache is
main memory. loaded in state Dirty.
• The loaded block state
always is set to Valid.
Berkeley Ifanother cache is in Dirty If tbe block jj in slate Dirty, Like a read miss, the
(WI) state or in Shared-Dirty • Write to the block and block comes directly from
state. update the local state to the owner.
• The owner must supply the Dirty. If another cache is the
block directly to the If tbe block is in state Valid owner of missed block,
requesting cache and set its or in state Shared-Dirty, • All other caches with
local state to Shared-Dirty. • Send an invalidation sill.na1 copies change their state
14
·1 • The requesting cache sets to system bus before the to Invalid.I its local state to Shared- write is allowed to proceed. • The block in the
Diny. • All other caches invalidate requesting cache is
If main memory bas Dirty their copies upon matching loaded in state Dirty.
copy, the block address.
• The block comes from • Update the local state to
main memory. Dirty.
• The loaded block state is
set to Valid.
minois If anotber cacbe is the If the block is in state Dirty like a read miss. the
(WI) owner of missed block. or in state Valid-Exdusive, block comes directly from
• The owner supplies the • Write to the block and the owner.
block directly to the update the local state to If anotber cache is the
requesting cache, updates Dirty. owner of missed block,
main memory with dirty Iftbe block is in state • All other caches with
copy and sets its local state Shared, copies change their state
to Shared. • Send an invalidation signal to Invalid.
• The requesting cache sets to system bus before the • The block in the
its local state to Shared. write is allowed to proceed. requesting cache is
Ifanother cache bas • All other caches invalidate loaded in state Dirty
Shared or Valid Esclusive their copies upon matching
copy, the block address. I
• The owner supplies the • Update the local state to
block directly to the Dirty.
requesting cache and sets
its local state to Shared.
• The requesting cache sets
its local state to Shared.
If main memory is the
owner of missed block.
• The block comes from
main memory.
• The loaded block state is
set to Valid-Exclusive.
RB If another cache is tbe If the block is in Slate Local Like a read miss, the
(WI) owner of missed block., (Dirty), block comes directly from
• The owner interrupts the • Write to the block and the owner.
bus read and performs its update the local state to If another cache is the
Ovm bus write Local. owner of missed block,
• UpdaL~s mt:mory Lo Lhe If the brock is in state Read, • A write updates the
correct value. • A write updates the block block and a bus write is
• The bus read wiU be retried and a bus write is generated.
immediately. generated. • The cache state is set to
• All the caches update with • The cache state is set to Local.
the correct value from the local. • The bus write updates
bus read and change into • The bus write updates the the mer-lOry and at the
state Read. memory and at the same same tIme causes all
If main memory is the time causes all other caches other caches to change
owner of missed block, to change into state lnvalid. into state lnvalid
• The block comes from
main memory
• The loaded block state is
set to Read.
15
RWB Ifanother cache is the If tbe block is i.o state IfaDotber cache is the
(WI) owner or missed block, Local. owner or missed block.
- The OWDer intenupts the - Write to the block and - A bus write is generated,
bus read and perfonns its update the local state to the cache value is
own bus write updating Local. updated to this new
memory to the correct If the block is in state Read. value, and broadcasts
value.
- The first write to a shared the new value to all
- The bus read will be retried block updates that block other caches sharing that
immaiiately. and broadcasts the new block.
- All the caches update with I value to all other caches - The requesting state is
the correct value from the sharing that block changed to F but all
bus read and change into • The requesting state is other caches'state with
state Read. changed to F but all other the copy of that block
If main memory is the caches'state with the copy remain in state Read.
owner of missed block. of that block remain in state • A subsequent write
• The block comes from Read. carries out and
main memory. • A subsequent write carries broadcasts an invalidate
• The loaded block state is out and broadcasts an signal causing all other
set to Read. invalidate signal that all caches to enter state
otber caches invalidate their Invalid.
copies upon matching the
block address.
WlP If anotber cache is the lfthe block is in state Mod· Like a read miss, the
(WI) owner or missed block. En or Unmod-Eu.. block comes directly from
- If a block is not in cache or - Write to the block and the owner.
invalidated, !be owner will update the local state to If the missed block is in
supply a whole valid block Mod-Exc. state lNV. not in cache
to the requesting cache. If the block is in state Mod- or in state IW2.
The requesting cache sets Shd or Unmod-Shd. -It will be loaded in the
its local state to Unmod- • Send an invalidation signal same way as when a
Shared. to system bus before the read miss occurs and
• If a missed block is in state write is allowed to proceed. then a write is followed.
rWl, tbe owner will supply • All other caches invalidate • Update the local state to
onJy a vaiid word, not a their copies upon matching MOD-Exc or Mod-Shd
whole valid block, to the the block address. If the missed block ill in
requesting cache. The • Update the local state to state IWI,
requesting cache sets its Mod-Shd. • Send an invalidation
local state to UnmOO- If the block is in state IWI signal to system bus
Shared. orIW2 before the write is
• If a missed block is in • The whole valid block will allowed to proceed.
state TW2, the owner will be reloaded befoTe the write - All other caches
supply only a valid word, takes place, as in the case of invalidate their copies
not a whole valid block to a write miss. upon matching the block
the requesting cache. The • Send an invalidation signal address.
requesting cache sets its to system bus befoTe the - Update the local state to
local state to IWI. write is allowed to proceed. MOO-Shd.
If main memory is the • AU other caches invalidate
owner of missed block. their copies upon matching
• The whole valid block the block address.
comes from main
• If the invalidated block is in
memory. state IW1, update the state
• The loaded block state is to IW2.
set to Unmod-Exclusive. ,
16
• If the invalidated block is in
state IW2. update the state
to lNV(fully invalidated).
• Ifthe invalidated block is in
the other states, update the
state to IWI .
Firefly Ifanotber CAche is tbe If the block is iD state Dirty Like a read miss, the
(WU) owner of missed block, or Valid Exdusive, block comes directly from
• The owner will supply the • Write to the block and the owner.
block directly to the update the local state to If another cache is the
requesting cache. and Dirty. owner oftht missed
update main memory. The It tbe block is in state block.,
requesting cache sets its Shared, • The other caches
local state to Shared. • The other caches (including (including memory
If otber cacbes have memory copy) with shared copy) with shared copy
Sbared copy, copy are updated. are updated.
• The other caches with • The resulting state is • The resulting state is
shared copy supply the Shared. Shared.
block to the requesting • If sharing has ceased, then
cache. The requesting the next state is Valid-
cache sets its local state to I Exclusive.
Shared. ,
If main memory is the
owner of missed block,
• The block comes from
main memory
• The loaded block state is
set to Valid-Exclusive.
Dragon If another cache is tbe If the block is in state Dirty Like a read miss, the
(WU) owner of missed block, or Valid Eulusive, block comes directly from
• That cache supplies the • Write to the block and the owner
data to the requesting update the local state to If another cache is the
cache. The requesting Dirty owner of missed block.
cache sets its block state to If the b10ck is in state • That cache supplies the
Shared-Dirty Shared, data to the requesting
If main memory is tbe • The other caches (including cache. The requesting
owner of missed block, memory copy) with shared cache sets its block st8te
• The block comes from copy are updated. to Shared-Dirty.
main memory. • The resulting state is • Other caches with
• Any cache with a Valid- Shared-Clean and raises the copies set their local
Exclusive or Shared-Clean SharedLine. indicating that state to Shared-Clean.
copy raises the SharedLine the data are still shared. • Upon loading the block,
and set their local state to • By observing this line on the requesting cache sets
Shared-Clean. the bus, the cache the local state to Dirty if
• The requesting cache loads performing the write can SharedLine is not raised.
the block in state Shared- determine whether other • If the Sl:aredLine is
Clean if the SharedLine is caches still have a copy and high, the requesting
high; otherwise. it is loaded hence whether further write cache sets the state to
in state Valid-Exclusive. to that block must be Shared-Dirty and
broadcast. performs a single-word
• If the SharedLine is not bus write to broadcast
raised, the block state is the new contents.
changed to Dirty; else it is
set to Shared-Dirty
17
In Section 3, we propose a hybrid word invalidate/read broadcast approach to
reduce invalidation misses. The hybrid word invalidate/read broadcast is based and
developed on write- invalidate protocols, specifically Word Invalidate protocol and Read
broadcast. Therefore, the remainder of this chapter focuses on write-invalidate protocols
to show how the hybrid word invalidatefread broadcast protocol is related to write-
invalidate protocols.
2.2.1 Write-Invalidate Protocols
Figure 8 and Figure 9 illustrate how write-invalidate protocols work basically.
Figure 8 demonstrates that copies in three caches are consistent. From Fi!,'Ure 8, if p~
tries to write the data (x ~ X') in the block of private cache of P2 , then P2 sends an
invalidation request to a shared bus to invalidate all other cached copies. The
invalidation request is carried out via a shared bus. Caches of other three processors
monitor the bus through the snoop portion of their cache controllers. When they detect
an address match, they invalidate the entire cache block containing the address. Figure 9
shows that the other two caches invalidate their entire block upon matching the block
address.
In the introduction, we mentioned about the bus related coherency overheads of
write-invalidate protocols for maintaining cache coherency: invalidation requests and
18
Shared Memory
w x y z
1 2 3 4
Bus
I
S
W X Y z w x y z w x y Z
I ..... ,.
sors
P j P2 ....... Pn
Shared
Cache
Proces
Figure 8. Cache configuration after reading four words from the memory
19
Shared Memory
ock
I I I I
1 2 3 4
iDvali7re bl
Bus
................................ ................................./
s
I I I I w X' y z I I I I
.......
1
sors
PI P2 ....... Pn
Shared
Cache
Proces
Figure 9. Cache configuration after writing (x ~X') on one word in block of P2 s
cache.
invalidation misses. Write-invalidate protocols sutTer from memory-access penalties due
to invalidation misses and invalidation requests[9]. Write-invalidate protocols have two
main sources that can increase invalidation requests and invalidation misses: severe
inter-processor contention and a large block si::e. Severe inter-processor contention for
an address produces more invalidations; the invalidations interrupt all processors' use of
the data and increase the number of invalidation misses [6]. The overhead of
20
maintaining cache coherency can be made worse by a larger block size, because
contention can occur for any ofan address in the block. Therefore, the probability that
the block will be actively shared increases. So, increasing the block size cannot reduce
invalidation misses [19]. Consequently, the additional invalidation requests and
invalidation misses increase bus utilization. Reducing the number of invalidation
requests and the number of invalidation misses is the most important perfonnance issue
for write-invalidate protocols.
2.2.2. Word Invalidate Protocol (WIP)
An invalidation of a word in a block usually causes all other words in the block to
be invalidated. When other processors subsequently reread these addresses on the next
reference, additional read misses are incurred because the block is invalidated fully. The
overhead is paid even when a processor reads an address that was not updated. To
protect useful valid data from unnecessary invalidation, Toma~evic and Milutinovic
introduced Word Invalidate Protocol as an enhancement of write-invalidate protocol to
minimize the overhead cost paid by an entire block invalidation in their paper [21].
The WIP (Word Invalidate Protocol) differs from the other write-invalidate
protocols that usually use a whole block invalidation because WIP invalidates only one
word in a block, instead of the usual block invalidation. Each time some processor
updates a word in a block, it sends a request to other processors sharing that block to
invalidate only the requested word, not the full block [21]. If a processor tries to read or
write any invalidated word from the block in the cache, then a read miss or a write miss
occurs because the word is invalidated. Only the valid word is reloaded instead of
reloading the full block. After reloading only the valid word, the block is partially
21
recovered if the block has two invalid words or fully recovered if the block has the only
invalid word in the block.
WIP uses a pollution point which is a certain number of invalid words in a block.
Tomasevic and Milutinovic have examined the influence of different pollution points on
the WIP performance by simulating the WIP versions with I, 2, and 3 allowed invalid
words, for a block size of four words. As a result, they concluded that the version with
two allowed invalid words is the most appropriate solution as the optimal pollution point
for WIP. Under the pollution point, the allowed number of invalid words within the
block (degree of pollution) may be just one or two. After reaching the pollution point,
any subsequent invalidation request invalidates the whole block.
In [21], TomaSevic and Milutinovic compared WIP with Berkeley protocol which
is the best representative of write invalidate protocol. As compared to Berkeley protocol,
they show that WIP has lower number of invalidation requests. The WIP's selective
invalidations save useful data in cache from being wasted. Consequently, WlP has a
higher hit ratio for shared references than Berkeley protocol. WIP avoids some
unnecessary invalidations and achieves better data utihzation [22]. TomaSevic and
Milutinovic [22] demonstrated that WIP has better data utilization and lower bus traffic
than the other write-invalidate protocols using a whole block invalidation. The most
important factor which brought better data utilization and lower bus traffic is reduction
of invalidation misses.
2.2.3 Read Broadcast
Read Broadcast presented by Rudolph and Segal [16] and evaluated by Eggers
and Katz [6] is an extension for snooping protocols that utilizes the broadcast nature of
22
the bus. Under read broadcast, when a cache issues a bus read miss, the bus read will
fetch the data stored in the memory. However, the owner of the missed cache block
interrupts the bus read miss and perfonns its own bus write to update memory to the
correct data. The original bus read will be retried immediately. Caches of the other
processors monitor the bus through the snoop portion of their snoop controllers. When
snoop controllers of the other caches detect the block's address on the bus with matching
addresses to invalidated blocks, snoop controllers update their invalidated block \-vith
data from the bus.
3. A HYBRID WORD INVALIDATEIREAD BROADCAST PROTOCOL (HWRP)
Word Invalidate Protocol may be classified as "event broadcasting", whereas in
Read Broadcast, events and data values are broadcast. Through combination of the
different classification between WIP and Read Broadcast, we propose HWRP (Hybrid
Word InvahdatelRead Broadcast Protocol). HWRP is a modification of WIP with Read
Broadcast capability for more reduction in invalidation misses.
Under HWRP, if a processor tries to read or write an invalidated word or an
invalidated block in the cache, then a read miss request or a write miss request is
broadcasted on the bus and then the owner of the missed word or the missed block puts
either a valid word on the bus or a valid block on the bus. Snoop Controllers of the other
caches with an invalidated word or an invalidated block update either an invalidated
word or an invalidated block upon matching address from bus, when snoop controllers
detect a read operation or a write bus operation for the block's address. Like the WIP,
HWRP uses the idea of ownership. If the cache that has the block in state MOD-SHD or
23
MOD-EXC is the owner of that block. If a block is not owned by any cache, memory is
the owner.
Under read broadcast, on a bus read miss, the owner of the missed cache block
interrupts the bus read miss and performs its own bus write updating memory to the
correct value. Unlike read broadcast, HWRP does not need to interrupt the bus read miss
and to perfonn the bus write to update memory because HWRP uses direct cache to
cache transfers, if a cache is the owner of a missed cache block.
The HWRP uses the same seven states that WlP uses. The seven states for cached
blocks are given in Table 2.
Table 2. Summary of Cache Block States
State Description
INV Block does not contain valid word
!Wl Block has only one invalid word
IW2 Block has only two invalid words
UNMOD-EXC Unmodified-Exclusive. No other cache
has this block. Word in block is
consistent with main memory
UNMOD-SHD Unmodified-Shared. Some other
caches may have this block
MOD-SHD Modified-Shared. This block is owned,
but it can not be updated without
infonning the other caches. Its data
must be given to any requesting cache
and flushed back to main memory.
MOD-EXC Modified-Exclusive. This block is
owned and unique. Therefore, data can
be updated locally. Its data must be
given to any requesting cache and
flushed back to main memory.
The operation of the HWRP protocol is specified for all possible situation as follows:
Read Hit
• Upon a read hit, no coherence action is necessary because the read hit is defined
24
as the read access to a valid block or to a valid word within a partially valid
block.
Read Miss
Case 1) The block is Dot in cache.
• A request is made to the owner for the block.
• The owner puts the valid block on the bus.
• If the block is shared by any other cache, caches update that block with the
value from the bus. If the state of the owner is MOD-SlID or MOD-EXC,
then set the state of the cache as MOD-SlID.
• If the state of owner is MOD-EXC, change it to MOD-SlID.
• If memol)' is the owner, then the state ofall sharing caches are set to
UNMOD-SHD. Ifno other cache shares the block, then set the state of the
requesting cache to UNMOD-EXC.
Case 2) The block is in cache with state INV.
A. If a cache is the owner of the missed block,
• The owner of the missed block accesses its own cache memol)' to provide a
valid block to any requesting cache for a bus read miss request.
• The owner of the missed block puts the valid block on the bus.
• If the block shared by the other caches has been invalidated fuJly, the snoop
controller of each cache accesses its own cache memory to update the
invalidated block with the valid block from the bus upon matching the block
address.
• The updated block state of the other caches is set to MOD-SHD state, after
reloading the valid block from the bus.
B. If main memol)' is the owner of the missed block,
• The block comes from main memory.
• The loaded block state is set to UNMOD-EXC.
Case 3) The block is in cache with state IWI or IW2.
A. If a cache is the owner of the missed word,
• The owner of the missed word accesses its own cache memory to provide a
valid word to any requesting cache for a bus read miss request.
• The owner of the missed word puts the valld word on the bus.
• If the word in the block shared by the other caches has been invahdated, the
snoop controller ofeach cache updates the invalidated word with the valid
25
word from the bus upon matching the block address.
• If the updated block state of the other caches is IW 1, the updated block state
of the other caches is set to MOD-SI-ID state because the updated blocks are
recovered fully.
• If the updated block state of the other caches is rW2, the updated block state
of the other caches is set to IWI state because the updated blocks are
recovered partially.
B. Ifmain memory is the owner of the missed word,
• The word comes from main memory.
• The loaded block state is set to UNMOD-SI-ID, if the other caches are sharing
the same block.
• The loaded block state is set to UNMOD-EXC, if the other caches are not
sharing the same block.
Write Hit
Ifthe block is in state MOD-EXC or UNMOD-EXC,
• Write to the block without an invalidation request and update the local state
to MOD-EXC.
If the block is in state MOD-SHD or UNMOD-SHD,
• A word invalidation request will be issued on the bus before the write is
allowed to proceed.
• All other caches sharing the block invalidate the corresponding word in thei r
block upon matching the block address.
1. [fthe block state of the invalidaJed word is in state MODpSHD or in state
UNMOD-SI-ID, they will be changed to the state IWI.
2. [fthe block state of the invalidated word is in state IW I and a word
invalidation request tries to invalidate one of the valid words of the block,
the block in state IWI is changed to the IW2 state.
3. If the block state of the invalidated word is in state IW2 and a word
invalidation request tries to invalidate one of the valid words of the block,
the block in state IW2 will be fully invalidated (IW2 ~ INV).
• Update the local state to MOD-SHD.
If the block is in state IWI or IW2,
• The whole valid block will be reloaded before the write takes place.
• Like read miss, if the snoop controllers of the other caches detect a bus read
26
request upon matching the block address, update invalidated words in block
with data from the bus.
• Send a word invalidation request to system bus before the write is allowed to
proceed. All other caches invalidate a word in their block upon matching the
block address.
• Update the local state to MOD-SHU.
Write Miss
Case 1) The block is not in cache.
A. If a cache is the owner of the missed block,
• The write missed block will be loaded in the same way as when a read miss
occurs.
• Send a word invalidation request to system bus before the write is allowed to
proceed.
• All other caches sharing the block invalidate the corresponding word in their
block upon matching the block address.
• Update the local state to MOo.SHD.
B. If main memory is the owner of the missed block,
• The block comes from main memory.
• Send a word invalidation request to system bus before the write is allowed to
proceed.
• The loaded block state is set to MOD-SHD, if the other caches are sharing the
same block.
• The loaded block state is set to MOD-EXC, if the other caches are not sharing
the same block.
Case 2) The block is in cache witb state INV and being sbared.
A. If a cache IS the owner of the missed block,
• The write missed block will be loaded in the same way as when a read miss
occurs.
• Send a word invalidation request to system bus beforc the writt> is allowed to
proceed.
• All other caches sharing the block invalidate the corresponding word in their
block upon matching the block address.
• Update the local state to MOD-SlID.
B. If main memory is the owner of the missed block,
27
• The block comes from main memory.
• The loaded block state is set to MOD-EXC.
Case 3) The block is in cache with state IWt.
A. If a cache is the owner of the missed word,
• Only the write missed word will be loaded in the same way as when a read
ffilSS occurs.
• Send a word invalidation request to system bus before the write is allowed to
proceed.
• All other caches sharing the block invalidate the corresponding word in their
block upon matching the block address.
• Update the local state to MOD-SHD.
B. Ifmain memory is the owner of the missed word,
• Only the write missed word comes from main memory.
• Send a word invalidation request to system bus before the write is allowed to
proceed.
• The loaded block state is set to MOD-SHD, if the other caches are sharing the
same block.
• The loaded block state is set to MOD-EXC, if the other caches are not sharing
the same block.
Case 4) The block is in cache witb state 1W2.
A. If a cache is the owner of the missed word,
• The whole valid block, not the missed word will be reloaded in the same
way as when a read miss occurs to a block in state INY.
• Send a word invalidation request to system bus before the write is allowed to
proceed.
• All other caches sharing the block invalidate the corre!Jponding word in their
block upon matching the block address.
• Update the local state to MOO-SHO.
B. Ifmain memory is the owner of the missed word,
• The whole valid block, not the missed word comes from main memory.
• Send a word invalidation request to system bus before the write is allowed to
proceed.
• The loaded block state is set to MOO-SHD, if the other caches are sharing the
same block.
28
• The loaded block state is set to MOD-EXC. if the other caches are not sharing
the same block.
3.1 The Hybrid Write lnvalidateIRead Broadcast Protocol Detailed Description
This section provides some figures to describe how HWRP and WIP work
differently for maintaining cache coherency. Only misses due to invaiidation are
considered Assume that Pl. P2 and P3 are sharing the same data block as shown in
Figure 10. The block state in the cache ofeach processor is UNMOD-SHD.
WIPandHWRP
he
ache
hared Memory w x y z 0: valid bit in cac
I : invalid bit in c
us
I ,
0 0 0 () 0 0 0 0 0 0 0 0 ...............
w x V 7
"
X V 7 W X V 7 Caches
I
..........
PI P2 P3 Processors Pn
B
s
Figure 10. Cache configuration after a read on four ",ords in a block of private caches of
PI, P2 and P3. Copies in all three caches are consistent.
29
WIPand HWRP
n memory
ared Memory w - y z 0: valid bit in cache
1: invalid bit in cache
- : invalidated word i
s
I
0 0 0 () 0 1 () 0 0 1 0 0
................
Iw X .\1 7 W Y V 7 \AI Y \I 7 Caches
r
,
................
PI P2 P3 Processors Pn
Sh
Bu
Figure 11. Cache configuration after a write on word (x -+ X) in a block by PI (write-
back cache). The word "x" in a block ofP2's cache and P3's cache is
invalidated. The block state in P2's cache and P/s cache is changed to IWl.
The block state in PI'S cache is MOD-SHD.
Figure 11 shows the cache configuration after PI modifies "x" in a block of P t'S cache.
The Cache Controller of P I sends an invalidation request to the shared bus. The Snoop
Controllers of the other processors monitor the bus and then the Snoop Controllers of P2
and P3 invalidate only that particular word upon matching address. The state of the block
in P2's cache and P3's cache is changed from UNMOD-SHD to IWl. The state of the
block in PI'S cache is changed from UNMOD-SHD to MOD-SHD.
Read Miss to A Block in The 1W1 State
In Figure 11, ifP) tries to read a word "x" from the block in PJ's cache, then a
read miss occurs because the word is already invalidated by PI. A read miss request is
broadcasted on the bus and then P1(the owner of a missed block) puts the word "X" on
the bus.
30
On a read miss ofWIP, WIP updates only a word (x -:" X) of the block in PJ S
cache. The state of the block in P3's cache is changed from IWI to UNMOD-SHD.
On a read miss ofHWRP, the Snoop ControHers OfP2'S cache and P~'s cache
updates an invalidated word (x -:" X ) with data from bus, when the Snoop Controllers of
P/s cache and P3'S cache detect a read bus operation for the block's address. Therefore,
the state of the block in P/s cache and P3's cache is changed from IWHo UNMOD-SHD
because there is no invalid word in the block OfP2'S cache and P3's cache.
Figure 12 and Figure 13 show the difference in cache configuration between
WIP and HWRP on read miss to a block in the IWI state.
WIP
n memory
ared Memory
i
0: valid bit in cachew I - Y z
I: invalid bit in cache
- : invalidated word i
s
I $$In 0 ' n 0 ........... , ...Iw y v 7 Cachesr
G B ............... ~PI .Processors
Sh
Bu
Figure 12. Cache configuration for WIP. The block state in P2's cache is lWl. The
block state in P3's cache is UNMOD-SHD. The block state In PI'S cache
is MOD-SHD.
31
HWRP
e
che
emory
ared Memory w . y z 0: valid bit in cach
I: invalid bit in ca
- : invalidated in m
s
0 0 0 0 10 0 0 0 0 0 0 0
..............
w Ix .v 7 Iw Ix v 7 Iw X V 7 Caches
.............
P j P2 P3 Processors Pn
Sh
Bu
Figure 13. Cache configuration for HWRP. The block state in P/s cache is UNMOD·
SHD. The block state in P3's cache is UNMOD-SHD. The block state in
PI'S cache is MOD-SHD.
From Figure 11, ifP j requests one more write to a valid word of a block in PI'S
cache, then an invalidation request will be issued on the bus to invalidate one of the valid
words of the block in P/s cache and P/s cache. If an invalidation request tries to
invalidate one of the valid words of the block in the (WI state, the block in P2's cache
and P/s cache will get into the IW2 state. The block in PI'S cache stays in the MOD·
SHD state. Figure 14 illustrates a cache configuration after a write on a valid word
(y ~ Y) of the block in PI'S cache.
32
WIPandBWRP
n memory
o l"d b" h
Processors
: va 1 It In cac e
edMemory w
- -
z I: invalid bit in cache
- : invalidated word i
1 I 1
............
Caches
Shar
Bus
Figure 14. Cache configuration after a write on a word (y -)0 Y) in block of PI'S cache
by PI' The word "y" of block in P/s cache and PJ's cache is invalidated.
The block state ofP/s cache and P3's cache is changed to IW2. The block
state ofP]'s cache is MOD-SfID.
In the configuration shown from Figure 14, ifP3 tries to read either "x" or "y"
from the block in P3 's cache, then a read miss occurs because the words already are
invalidated by PI' Assume that a read missed word is Y". A read miss reque t is
broadcasted on the bus and then PI (the owner of a missed block) puts a word ..y" on
the bus.
Read Miss to A Block in IW2 State
On a read miss of WIP, WIP updates only a requested word (y -)0 Y) of the block
in P/s cache. The state of the block in P3'S cache is changed from lW2 to IWI.
On a read miss of HWRP, P/s cache gets the word "Y" from bus when the
Snoop Controller ofP/s cache detects a read bus operation for the block's address. And
then the invalidated word ''y'' is updated to "y". Therefore, the state of the block in P/s
33
cache is changed from IW2 to IW 1>since the number of invalidated words in the block
are reduced from two to one. Like the WIP, HWRP updates a requested word
(y~ Y) of the block in P/s cache. The state of the block in P~>s cache is changed from
IW2 to IWI.
Figure 15 and Figure 16 show the difference in cache configuration between WlP
and HWRP on read miss to a block in the IW2 state.
WIP
ared Memory w - - z
I
() 0 () () I () 1 1 () () 1 0 ()
.............
UI iy \ 7 UI 'll V 7 \AI 'll V 7 Caches
.............
PI P2 P3 Processors Pn ,
Sh
Bus
Figure 15. Cache configuration for WIP. The block state in P/s cache is IW2. The
block state in P3's cache is IWI. The block state in PI'S cache is MOD-
SHD.
34
HWRP
Processors
d Memory w
- -
z
I I 1
-fm~ ...............Caches
Share
Bus
Figure 16. Cache configuration for HWRP. The block state in P2's cache is IW J. The
block state in P3's cache is IWI. The block state in PI'S cache is MOD-
SHO.
From the configuration shown in Figure 14, ifPI tries to update a word
(z~ Z) in the block ofPI '5 cache, then the processor, PI sends an invalidation request
on the bus to invalidate one of the valid words of the block in P2'S cache and in Pl'g
cache. The block in P2's cache and the block in P3's cache are fully invalidated
(IW2 ~ INV). The processor P j updates a word (z ~ Z) in the block ofPj's cache.
Figure 17 shows the cache configuration after a write (z -4 Z) on a word in block of PI '5
cache.
35
WIPandBWRP
Shar
Bus
ed Memory - - - -
J I 1
.................
Caches
Processors
L
Figure 17. Cache configuration after a write on a word (z~ Z) in block of PI'S cache
by Pl' The entire block in P2'S cache and P3'S cache is fully invalidated.
The block state in PI'S cache is MOD-SlID.
Read Miss or Write Miss to A Block in INV State
In Figure 17, ifP3tries to read or write any word from the block in PJ's cache,
then a read miss or a write miss occurs because the words are fully invalidated. A read
miss request or a write miss request is broadcasted on the bus, then PI (the owner of a
missed block) puts a whole valid block on the bus.
On a read miss of HWRP or on a write miss of WlP, the whole valid block in PI'S
cache is reloaded into a block in PJ's cache. On the case of a write miss, the entire valid
block should be reloaded before the write takes place. On a read miss, the state of the
block in P/s cache is changed from INV to UNMOD-SHD. But on a write miss, the
state of the block in P1:s cache is changed from MOD-SHD to IWI because a word of the
block in PI'S cache is invalidated by an invalidation request issued by PJ and then writes
a word in the reloaded whole valid block. The owner of that block is changed from PI'S
IL
36
cache to P3's cache. Therefore, the state of the block in P3's cache will be MOD-SHD.
The block in P2's cache still stays in the rNV state.
On a read miss of HWRP or on a write miss ofHWRP, the Snoop Controllers of
P2's cache and P3's cache with a fully invalidated block catch the entire valid block from
the bus when the Snoop Controllers ofP/s cache and P/s cache detect a read bus
operation for the block's address. And then the invalidated block is reloaded. On a read
miss, the state of the block in P3'S cache is changed from INY to UNMOD-SHD. The
state of the block in the cache ofP2 is changed from INV to UNMOD-SHD. On a write
miss, the state of the block in P1's cache and P2's cache is changed from MOD-SHD to
IWI because a word in the block ofP, 's cache and P2's cache is invalidated by an
invalidation request issued by P3. After the invalidation of a word in the block of PJ's
cache and in the block ofP/s cache, P3updates a word in the reloaded valid block. The
owner of that block is changed from PI'S cache to P~'s cache. Therefore, the state of
block in P3's cache will be MOD-SHD.
Figure 18 and Figure 19 show the difference in cache configurations between
WIP and HWRP on read miss of a block in the INY state. Figure 20 and Figure 21 show
the difference in cache configurations between the WIP and HWRP on write miss of a
block in the INV state.
37
WIP on Read Miss
Shar
Bus
edMemory - - - -
1 1
•
Ii .............. ICaches
Processors
Figure 18. Cache configuration for WIP after a read request of P3 The whole valid
block is reloaded into a block in P/s cache. The block state in PI'S cache
is MOD-SHO. The block state in P2's cache is still in INV state. The
block state in PJ's cache is UNMOD-SHD.
HWRP 00 Read Miss
Shar
Bus
ed Memory - - -
-
1
••
..............
Caches
Processors
Figure 19. Cache configuration for HWRP after a read request of PJ. The whole valid
block is reloaded into a block in P2's cache and P3's cache. The block
state in PI'S cache is MOD-SHO. The block state in P2'S cache and P3'S
cache is changed to UNMOD-SHD from [NY
38
WIP on Write Miss
Shar
Bus
ed Memory - - - -
1 I
•
_..............
Caches
Processors
Figure 20. Cache configuration after the whole valid block is reloaded into a block in
P3's cache and then P3 writes from a word "X" to a word" K" in the block
ofP3's cache. The block state in PI'S cache is changed from MOD-SHD to
IWl. The block state in P2's cache is still in INV state. The block state in
P3'S cache is changed from INV to MOD-SHD.
HWRP on Write Miss
Shar
Bus
ed Memory - - - -
1
••
..............
Caches
Processors
Figure 21. Cache configuration after the whole valid block is reloaded into a block in
P2'S cache and P3's cache and then P3 updates from "X" to "K" in the block
of P/s cache. The block state in PI'S cache is changed from MOD-SHD to
IWl. The block state in P2'S cache is changed from INV to IWI. The block
state in P3'S cache is changed from INY to MOD-SHD.
39
Write Hit OD IWI State
There is a difference in cache configurations between WlP and HWRP on a write
hit ofa block in the IWI state. Assume that PI tries to update a valid word (Y -+- H) in a
block ofPt's cache from Figure 21.
In WIP, the whole valid block is reloaded into a block of P1's cache from PJ's
cache (the owner of the block) before the write takes place. The write is delayed until an
invalidation signal can be sent on the bus to invalidate the word in block of all other
caches with the same word. Therefore, the block state in P/s cache is changed from
MOD-SlID to IWI. The block state in P2's cache is changed from IWI to IW2. The
block state in PI'S cache is changed from IWI to MOD-SlID after the word is changed
from "Y' to "R". PI'S cache is the new owner for that block.
In HWRP, the Snoop Controllers of PI 's cache and P2's cache catch the entire
valid block from the bus when the Snoop Controllers ofPt's cache and P2'S cache detect
a read bus operation for the block's address. And then the entire valid block is reloaded
into the block of PI'S cache and the block ofP2's cache from PJ's cache (the owner of
the block) before the write takes place. The write is delayed until an invalidation signal
can be sent on the bus to invalidate the word in block of all other caches with the same
word. Therefore, the block state in P3'S cache is changed from MOD-SHD to IW I. The
block state in P2's cache is changed from UNMOD-SHD to IWI. The block state in Pt's
cache is changed from IWI to MOD-SHD after the word is updated from "Y" to "U".
The PI'S cache is the new owner for that block.
Figure 22 and Figure 23 show the difference in cache configurations between
WIP and HWRP on a write hit of a block in the state rw I.
40
WIP
Shar
Bus
,
ed Memory
- - - -
1 I I~~ ...............Caches
Processors
Figure 22. Cache configuration after the write hit by Pion the block in the IWI state
and then PI updates from a word "Y" to a word "H" in the block of PI'S
cache.. The block state Pj's cache is changed from !WI to MOD-SHD.
The block state in Pz's cache is changed from IWI to IW2. The block state
in PJ's cache is changed from MOD-SHD to IWI .
HWRP
Shar
Bus
edMemory - - - -
1
••
.................
Caches
Processors
Figure 23. Cache configuration after the write hit by Pion the block in the {WI state
and then PI updates from a word "Y' to a word "0" in the block of PI'S
cache.. The block state in Pj's cache is changed from lWlto MOD-SHD.
The block state in Pz's cache is changed from UNMOD-SHD to {WI. The
block state in P/s cache is changed from MOD-SHD to IWI.
41
Write Hit OD IW2 State
There is cache configuration difference between W1P and HWRP on a write hit of
a block in the IW2 state. Assume that P3 tries to update a valid word (z~ M) in a block
ofP3's cache from Figure 14.
In WIP, the entire valid block is reloaded into only a block of P3'S cache from
P J 's cache (the owner of the block) before the write takes place. The write is delayed
until an invalidation signal can be sent on the bus to invalidate the word in block of aU
other caches with the same word. Therefore, the block state in PI'S cache is changed
from MOD-SHD to rwl. The block state in P2's cache is changed from IW2 to INV
The block state in P3's cache is changed from IW2 to MOD-SHD after the word is
changed from "z" to "M". P3's cache is the new owner for that block.
In HWRP, the Snoop Controllers ofP2's cache and P~'s cache catch the entire
valid block from the bus when the Snoop Controllers ofP/s cache and P/s cache detect
a read bus operation for the block's address. And then the whole valid block is reloaded
into a block ofP2's cache and P3's cache from PI'S cache (the owner of the block) before
the write takes place. On a miss, a block must be chosen for replacement. If the chosen
block is owned., then it is wrinen to memory. The requested block is then read in
UNMOD-EXC state and is updated. The final state of the entry becomes MOD-EXC.
Therefore, the block state in PI'S cache is changed from MOD-SHD to (WI. The block
state in P2's cache is changed from UNMOD-SHD to IWI. The block stateln P~'s cache
is changed from IW2 to MOD-SHD after the word is changed from "z" to "M". P~'s
cache is the new owner for that block. Figure 24 and Figure 25 show different cache
configurations in the WIP and in HWRP on write hit of a block in the lW2 state.
42
WIP 00 Write Hit
Shar
Bus
ed Memory - - - -
1 I ~ .................Caches
Processors
Figure 24. Cache configuration after the write hit by P3 on the block in the IW2 slate
and then P3 updates from "z" to "M" in the block of P/s cache The block
slate in PI' 5 cache is changed from MOD-SlID to IW 1. The block state in
P2'5 cache is changed from IW2 to INV. The block slate in P/s cache is
changed from IW2 to MOD-SHD.
HWRP 00 Write Hit
Shar
Bus
ed Memory w - - -
I
••
..................
Caches
Processors
Figure 25. Cache configuration after the write hit by P3 on the block in the IW2 state
and then P3 updates from "z" to "M" in the block of P/s cache The block
state in Pj's cache is changed from MOD-SHD to IW1. The block state in
P2'S cache is changed from UNMOD-SHD to (WI. The block slate in P3'S
cache is changed from IW2 to MOD-SHD.
43
In order to assess the effect of HWRP on invalidation misses, we have performed
a simulation study. HWRP is compared against WIP.
4. SIMULATION MODEL
In shared-memory multiprocessor systems, the write-invalidate protocols should
pay a high cache miss penalty due to invalidation misses necessitated by maintaining
cache coherence. Therefore, the reduction in invalidation misses is a significant factor to
get higher pe.rformance because the reduction in invalidation misses produces a
corresponding decline in the cache miss ratio. The simulation presented here is designed
with these factors taken into consideration.
In order to simulate HWRP and WIP, we use a simulation model driven by
synthetic workload model to obtain quantitative measures rather than by actual traces.
The actual traces could be created, but they would be as artificial as the method that we
have employed [2]. Although synthetic traces are artificial in nature, sometimes they can
be more useful than real traces [21]. Carefully varying appropriate parameters in a
flexible synthetic model is a more convenient way to evaluate the perfonnance of
simulated solutions than real traces witch are influenced by the particular conditions
under which they are collected. The first step in the simulation model is the deti.nition of
a basic multiprocessor model.
4.1. Multiprocessor Model
The simulated multiprocessor model is organized into two main mooLlles:
processor module and bus module. The bus module is unique in the system and contains
only one process, while the processor module is replicated according to the size of the
multiprocessor system. The processor module has three main processes: a process for
44
each processor, a cache controller process, and a snoop controller process. Figure 26
shows a diagram of the simulated multiprocessor model.
I System Bus I
..... :
;' S~oop"'~
:Controllet; .:
~"'" , .. , .. ,:'.. Cache
: Cache : ~emory
:Controller.
. .
. . . . . . -. -... ~
.....
.' .
:"s~'~op":
:Controller; :'
~ / .. Cache
: Cache; Memol)'
:ControlJer:
.............
" .
:"Snoop"~
~CoDtroller: ...
~ , Cache
: Cache : ~emory
:Controller:
. .
. .
..............
I Po I I
Figure 26. A diagram of multiprocessor modeL
Data lines are solid and control lines are broken.
When a processor generates a memory request, it sends the memory request to the cache
controller process. The cache services a memory request from its processor by
determining whether the requested block is present or absent. If the requested block is
present in the cache, the request can be serviced without a bus transaction. If so, the
cache sends the processor a command to continue. If a bus transaction is required, a bus
request is generated and inserted into the service queue of the bus. The cache sends the
processor a command to continue only upon compLetion of the bus transaction.
The cache can also receive commands from the bus process relating to actions
that must be performed on blocks of which it has copies. Such commands ~re detected
through the snoop controller process and have higher priority for service by the cache
than processor memory requests. In a multiprocessor system, this is equivalent to
matching a block address on a bus transaction and halting the service of processor
45
requests to take action. After that action is completed., the cache is free to respond to
processor requests.
A more detailed description of the cache controller process, the snoop controller
process, and the bus process is provided in following paragraphs.
4.1.1 Tbe Cache Controller Process
The cache controller's behavior depends on its processor's request, whether the
data is in the cache. and the state of the cache entry on a hit. When a processor read
results in a cache hit, the appropriate word is provided to the processor. Upon a miss, a
miss request is broadcast through a bus to all caches and to main memory. If a missed
block is in the lNV state. it will be obtained from the cache-owner, if it exists, or from
memory. The block will be loaded in one of the unmodified states (UNMOD-EXC' or
UNMOD-SHD2), depending on the owner of the missed block.
The procedure for a processor write to a block in the cache is as follows. The
write can be performed locally without access to the bus, if the block is in one of the
exclusive states (UNMOD-EXC or MOD-EXC). FOT both cases, the final state is MOD-
EXC. If the state of the hit block is IWl, IW2, UNMOD-SHD or MOD-SHO, then the
cache controller must issue an invalidation request on the bus to invalidate the word in
block of all other caches with the same word. IWl, IW2 or!NV state indicates that the
snoop controller invalidates the block in response to detecting an invalidation request
from another processor, after the cache controller had initially detected a hil.
llfthe missed block is supplied by main memory, then the block will be loaded in UNMOD-EXC state.
2lfthe missed block is supplied by cache-owner, then the block will be loaded in UNMOD-SHD state.
46
4.1.2 The Snoop Controller Process
The snoop controller process monitors the bus for a bus read request, a bus write
request and an invalidation request. If the snoop controller processors of all other caches
except a requesting cache detect an invalidation request from the bus, then they access
their own cache memory to invalidate the word in block of their own caches. If the
snoop controller processor of an owner cache observes a read request from the bus, then
it accesses its own cache memory to provide an owned block for a bus read request.
Moreover, when the snoop controller processors of all other caches with the same block
detect a read bus operation for the block's address, they accesses their own cache
memory to update invalidated data or an entire invalid block with data from the bus,
After updating an invalidated word in a block with data from the bus, if the
block state is [WI, the snoop controller processors of the updated word or the updated
block access their own cache memory to change the block's state to UNMOD-SHD, If
the block state is IW2, the snoop controller processor ofall other caches accesses its
cache memory to change the block's state to IW1. If the block state is TNV, the snoop
controller process ofaB other caches accesses its cache memory to change the block's
state to UNMOD-Sl-ID or UNMOD-EXC after reloading an entire valid block from the
bus. The Snoop's actions are a function of the system bus request, whether it hits or
misses in its cache, and the state of the block.
4.1.3 The Bus Process
The bus process receives service requests of five types ( read miss. write hit, write
miss, invalidation miss and inva/idatioo signal) from all caches. The cache controller
process generates one of the five types of requests to the bus process, which serves the
47
incoming requests in the order of arrival. Infonnation about ongoing bus transaction is
sent to aLI snoop controller processes.
We use the communication cost per memory reference as our basic metric. This
cost is the number of cycles that the bus is busy during serving one of the five types of
requests. We refer to this metric as bus-cycles-per-memory reference. The bus cycle
costs per reference depend on the five different types of requests. The bus cycle costs
used in the simulation model are adopted from the examples considered in [2]. The costs
related to bus transaction are summarized in Table 3 and Table 4.
Table 3. Timing for Fundamental Bus Operations
Bus Operation Bus Cycles
Send address 1
Transfer 1 data word I
Invalidate 1
Wait for Memory 2
Wait for Cache I
Table 4. Summary of Bus Cycle Costs
Access Type Total 8us Cycle Costs
Memory access 7
Cache access 6
Write back 4
Invahdate I
Write update to another cache 2
In the simulated multiprocessor model, a memory access costs 7 cycles, 1 cycle to
send the address, 2 cycles to wait for the memory access, and 4 cycles to get four words.
An access from another cache is 6 cycles, and takes a cycle less than the memory access,
because the cache access wait is only one cycle. Write-back costs 4 cycles. While the
write into memory is taking place, the bus need not be held A write update to other
caches requires 2 bus cycles, 1 cycle to send an address and 1 cycle to update an
invalidated word Invalidations cost one cycle. The data transfer width of bus is
assumed to be one word (32 bits).
4.2 Workload Model
The choice of workload model is a critical point because the performance of
cache coherence protocols heavily depend on the characteristics of the workload. The
workload model selected is similar to one developed in [3,21]. The simulation
parameters and ranges used are summarized in Table 5 on the next page.
The memory reference stream of each processor is divided into two distinct
classes: reference stream to private blocks and reference stream to shared block. Each
time a memory reference is called for, the processor generates a reference to a shared
block with probability shared and a reference to a private block is generated with
probability 1 - shared. Similarly, the probability that the reference is a read is read and
the probability that it is a write is 1 - read. If the request is to a private block, it is a hit
with probability hit and a miss with probability I - hi/.
With probability (l-shared), references to private blocks do not affect cache
coherence. The most important parameter is hit ratio for private blocks. Also, they do
not create invalidation traffic, nor do they degrade the hit ratio of the other caches.
With probability shared, the reference is for an shared block. A refc;rence to a
shared block i is made according to a probability distribution Pi 3for i = 1, ... , Ns
Jp , = I I the number of shared blocks(N.)
48
49
Therefore, the probability that a reference is a write on shared block i is shared *Pi· (1 -
read).
Table 5. Summary of Parameters and Ranges
Parameters Ranees
Probability of shared references (shared) 2% ·5%
Read probability(read) 70% - 85 %
Hit ratio for private blocks(hit) 95% • 98%
Word size Four bytes
Block size Four words
Cache size 2 • ]0 Kbytes
Memory Mappin~Method Fully Associative
Number of private blocks(No} 1024
Number of shared blocks(Ns} 16- 64
Number of processors 2-32
Number of references per processor ]0000
The parameters and ranges shown in table 5 are adopted from the examples mentioned in
[3,21]. All references to shared blocks in our model include a block number generated
by a pseudorandom number generator. To service a shared block request, the cache
determines from a directory whether the requested block is present or absent, and
whether a bus request must be generated.
If a cache miss occurs, either for a shared block or for a private block, a block
must be ejected to make room for the new block. The probability that a shared block is
selected is equal to the percentage of blocks in the cache that are shared blocks at that
point in time. If the selected block is private and is modified, it needs to be written back
to main memory. If a shared block is chosen for replacement, one of those present in the
cache is chosen at random. The state of that particular block detennines whether or not it
is to be written back.
50
5. DISCUSSION OF SIMULAnON RESULTS
In this section, we analyze the behavior ofeach protocol to demonstrate the effect
of various parameters on the cache miss ratio and bus traffic. Output from the simulation
includes some figures as the results of the simulation we have run. Each figure shows the
result obtained with the indicated parameter values for both schemes (WIP and HWRP)
from two to thirty two processors. The ratio of invalidation misses4 as the result of the
simulation is a significant factor for comparison between WIP and HWRP because the
objective of this research is to reduce additional invalidation misses caused by
invalidation in write-invalidate protocols.
5.1 Impact on Miss Ratios with Varying Parameters
Figure 27 and Figure 28 demonstrate that the ratio of invalidation misses and total
miss ratio for both protocols increase as the write ratio increases. The increase is due to
the increased nwnber of invalidations because of more writers to shared blocks. The
increased number of invalidation is responsible for a subsequent rise in inval idation
misses[7]. The total miss ratio of each protocol increases as much as the increased
proportion of invalidation misses within total misses. From Figure 27, we see that
HWRP has a lower invalidation miss ratio than the invalidation miss ratio of W1P.
Consequently, HWRP has a lower total miss ratio.
"The invalidation miss ratio is the invalidation misses divided by total number of cache misses.
,e __ ,....00,.,'-"-5'11 _
....---- _..
'01< C«:I'e SIaJ'CIlIo_
51
~
I ~
I V
/
o.v""
0_
U 0.21>
/~V-
~ ---
o.•!!
0."
0.35
0.3
..) 0.25
0.2
0.15
0.1
2 8
NlUaber of PnIuuon
16
--- HWRP -e- WlP
Figure 27. Ratio ofInvalidation Misses for Both Protocols
1e _ _ ..~ (My R.4pIooed Illocl<s
5% S__R-'......
--_...---~...__>4I_
10K Cocno SIa7'0%_
or
.
.-
--;:i·/
..- .._--
°i;""'/
- ----V...
01'/ ~1It' ~~: _~r
./
~~--
"'Ii"""""
0.0'9
0.085
.. 0.08I
0.075
0.07
0.065
0.06
0.055
0.05
2 II
Numbu of Procnwn
16 32
-.- HWIP -e- WIP
Figure 28. Ratio of Total Misses for Both Protocols
52
Figures 29 through 32 illustrate different invalidation miss ratios and shared miss
ratios5for both protocols to test the impact of handling shared blocks efficiently by
changing onJy a number of shared blocks. On the invalidation miss ratio, Figure 29 and
Figure 31 show that both protocols have a higher invalidation miss ratio at a tighter
sharing (32 shared blocks) than invalidation miss ratio at a looser sharing (64 shared
blocks). At tighter sharing, the number ofprocessors contending for a shared block
address is relatively high. Therefore, the shared data has a higher probability to be
referenced or to be invalidated, and consequently, is referenced via invalidation misses.
On the shared miss ratio, Figure 30 and Figure 32 show that both protocols have a
lower shared miss ratio at a tighter sharing than shared miss ratio at a looser sharing. At
looser sharing (64 shared blocks), the cache has a lower probability that a shared block is
referenced by its own cache or by the other caches. Therefore, there are fewer cache hits
on the shared blocks because each shared block is not accessed very often.
From these four figures, we see that the read broadcast approach in HWRP yields
a lower miss ratio in handing of shared data, since the read broadcast approach in HWRP
leads to preserving the valid, frequently shared data.
5The shared miss ratio is the number of misses to shared data divided by the total references to shared data.
32__ l:nlo~__
~--­
-----_..!lC c:.a.llIa
--
S3
I /~
~
//
cu/"/
/
/' lW2 CU'!
~ --
-----.....
0.36
0.32
0.28
0.14
.. 0.2
I
0.16
• 0.12
·
·I 0.08
0'.04
2 4 • 16 32
-.- HWRP
--- W1P
Figure 29. Ratio ofInvalidation Misses for Both Protocols
or'
,
'lV'"
v
"T
/
,,-
L--T---~- '. r- --~it·,:,.~
~-
32__ l:nlo[);rty__
~... _--115llo _8I6llo _ ... _
11K c-..s...
SSlloR_
..
,
0.7
0.6
0.5
0.4
0.3
0.2
0.1
2 4 8
NumMr of P~naon
16 32
-.- HWRP -e WIP
Figure 30. Ratio of Shared Misses for Both Protocols
55
invalidation miss ratio of WIP, even though the invalidation miss ratio ofeach protocol
increases slightly as cache size increases.
The number of invalidation misses for both protocols is inversely proportional to
the number ofblock replacements. At small cache sizes, the number of block
replacements is relatively high. As cache size increases, the percentage of block
replacements drops. Shared data tends to remain in the cache for a longer period of time,
has more opportunity to be invalidated, and, consequently, is referenced via invalidation
misses. The number of invalidation misses should be higher with each successively
larger cache, approximately by the percentage decrease in block replacements. Note that
the greater the number of processors contending for an address, the greater the number of
invalidation misses [7]. Consequently, the lower invalidation miss ratio in HWRP is due
to improved cache hits because of the approach updating invalidated blocks, which leads
to preserving the valid, frequently shared data.
0."'.
1li'P :::.:.t.~.- .. ~.. .
e·li' ~'..---
c,_
.'1.1.~~~- l- e...
.' --- ....--
'.
OVlJ-'~--
. -
.~ '....
V' e4»' e- T
"F /? -
,
nn...'
.
0.1"
0.11
0.1
0.08
..
I 0.06
0.04
0.02
0
2
-e- 2
....•.... tl
6
Cacbr SizC'(Kbyle!)
-- ...
Il 10
Figure 33, Ratio ofInvahdation Misses for HWRP
56
~--5'll _
_IIIocio _
___H1_
2l( • I. c.or. SID
o,~-
o,~
~t\.
~ "-
""
o"l....r-......~
I'x ""
""
-.........- ll..IlY
/' ~ ~ 0 .... ~
/
"'"
//.~o.~
~ "T
0.068
0.066
... 0.064I
0.062
R
0.06
0.058
0.056
0.054
0.052
2 6 8 10
_- I
.......... 1.1 •so
- .
Figure 34. Total Miss Ratio for HWRP
J2 ShInodIllocl<lo5'0\ __ Ro--...._ IIIOclI:_
ll5"4Pn__Hl_
ZK· 10K cecrw Silo
II.5'o\R..-
10864
0.4 r----~---..,._---_._---__,_---...,
1- -+ +----"~--_+_-;;;;o·"'"r.:::=:F==Dl"::!..~$-~0.35 I ______y.---"'-- -
0.3 I---o.-~!~--+-=--"O:::=,~=-,-....-...-..,+-,-...-..."o."".¥;:-••-.-..._-I.f--...+... ~--.-.-+-,-...A,i....--l
0.25 t--~'---+.-.. ...,.,~-=--+------Ir---q,IJ_-_--+-~-----,In'IT".--t
0.2 ~.f.G""'-.:.::.·+----::-:=--+-::==-s..0":l--+.-===tf:::::::::.....+ -!
~j/.!- I-
0.15 1---0 ~-.- ....,.,.+-"''-----+--.-.~,-_--1_-._-.-."..~•""'-_----+--._~--i
0.1 t-----t__---"if===-+---===---;-----t-------t
oW!--- o~ 0_ o~.--,!fI'
0.05 1---~~---1_---=~~=F=:3E===F=:w:::==F=::=!!--lOil .-0-----'"---_....._-.........__...._---'"
2
...
I
-.- 2
11 •
_._ l:l - .
Figure 35. Ratio ofInvalidation Misses for WTP
57
108
o.
64
,.
2
~._-q.~
1-- .....--+--·---··-t-··--·..i4?....:.·-!.....:..-:.:.;;. 'P..........;0'-1'n........... ° n
0.118
0.076
..
I 0.072
·
.. 0.068
·I
I 0.064
·
0.06
0.056
0.052
32 __R _
85.. _
___Hlt_
2J( -'01< e-... s;.-
o. 'llo_ 0
__- 2
....+. l' •12
Figure 36. Total Miss Ratio for WIP
Figure 34 and Figure 36 show the total cache miss ratio for each protocol. We see that
HWRP has a lower total cache miss ratio as much as the proportion of reduction in
invalidation misses than WIP has. In HWRP, this reduction in invalidation misses is
contributed by the read broadcast [16] mechanism which updates an invalidated block or
an invalidated word, when snoop controllers detect a read or write bus operation for the
block's address.
5.3 Bus Utilization
The critical system bottleneck in a single-bus, shared memory multiprocessor is
the bandwidth of the system bus [7]. Write-invalidate protocols have two main sources
of bus-related coherency overhead. The first is the invalidation request of shared data in
each cache. The second is the invalidat.ion misses caused by invalidation request
Consequently, reducing the number of invalidation misses produces a corresponding
decline in the bus traffic for write-invalidate protocols. As discussed in the previous
section, we know that HWRP always has a lower total cache-miss ratio resulting from
58
reduction in invalidation misses than WIP has. Figure 37 and Figure 38 show that HWRP
has a lower number of bus cycles for bus operation since read misses and invalidation
misses for shared blocks are relatively infrequent.
32 __
5"~ __
----
1l!l'll__Ht_
11K c.a.._85,._
20~04
J8~13
Nfi"
..--::.~
.......
...........
-101TI ~
...
24ססoo
10ססoo
c 160000
J
,
I 12ססoo
80000
4ססoo
0
2 4 8 16 32
Number of P1"CICftson
---- HWRP -.- W1P
Figure 37. Total Bus Cycles
, ~fF"
~':A~.'A
/ '.
/'¥
4~
10UA .::::r-
-
140000
20ססoo
160000
12ססoo
8ססoo
4ססoo
o
2
'8 _8b:u5"~__
_ R_
___.. RooD<>
'OK Cctw&.
7'0%_
4 8 16 32
Nu..btr of Proccuon
---- HWIP ----- WIP
Figure 38. Total Bus Cycles
On the contrary, the number of bus cycles in WIP is higher because of a larger percentage
of invalidation misses. As mentioned in [2], the cost of a write update is assumed to be
S9
much lower than the cost of an invalidation and a subsequent miss. On the bus
utilization, the most important consequence of HWRP is the effect of its lower miss ratio.
6. CONCLUSION
Since the purpose of a cache is to speed up access to data, cache misses are the
main hindrance for obtaining better perfonnance in cache memory system. In shared-
memory multiprocessor system, the write-invalidate protocols should pay a high cache
miss penalty due to invalidation misses necessitated by maintaining cache coherence.
Since invalidation misses play such a large role in caches and bus perfonnance,
coherency protocols that can reduce them are desirable. In this thesis, we presented the
Hybrid Word InvalidatelRead Broadcast Protocol for more reduction in invalidation
misses. We have studied the effects of the cache coherency on the miss ratios of both
protocols (HWRP and WIP) and on the bus traffic between the caches. Through some
experiments, we demonstrated that HWRP has a )ower invalidation miss ratio than WIP.
In HWRP, the reduction in invalidation misses produces a corresponding decline in total
miss ratio and bus utilization. Consequently, eliminating invalidation misses leads to a
potentially better utilization of data already fetched in the cache and achieves a higher hit
ratio. Therefore, the solution proposed in this thesis can be expected to improve
performance as compared with other write-invalidate protocols.
60
REFERENCES
[1] A. Agarwal. An evaluation of directory schemes for cache coherence. In
Proceedings ofthe 15/h Annual International Symposium on Computer
Architecture, pages 280-289, 1988.
[2] A. Agarwal, R. Simoni, 1. Hennessy and M. Horowitz. An evaluation of directory
schemes for cache coherence. In Proceedings ofthe 15th AnnualInternallonal
Symposium on Computer Architecture, pages 280-289, 1988.
[3] 1. Archibald and 1. L. Baer. Cache coherence protocols: evaluation using a
multiprocessor simulation model. ACM Transactions on Computer System.,;,
4(4):273-298, November 1986.
[4] M. L. Censier and P. Feautrier. A new solution to coherence problems in
multicache systems. IEEE Transactions on Computers, 27(12): 1112-1118,
December 1978.
[5] D. Chaiken, C. Fields, K. Kurihara and A. Agarwal. Directory-based cache
coherence in large-scale multiprocessors. lEFt: Computer, 23(6):49-58, June
1990.
[6] S. 1. Eggers and. R. H. Katz. Evaluating the perfonnance of four snooping
cache coherency protocols. In Proceedings ofthe 16'h Annual International
Symposium on Computer Architecture, pages 2-15,1989.
[7] S. 1. Eggers and R. H. Katz. The effect of sharing on the cache and bus
performance of parallel programs, In Proceedings ofthe 3rd International
Conference on Architectural Support for Programming I.anguages and Operating
Systems, pages 257-270, 1989.
[8] S. 1. Eggers and R. H. Katz. Implementing a cache consistency protocol. In
Proceedings ofthe 12th Annual International Symposium on Computer
Architecture, pages 276-283, 1985.
[9] D. Fredrik and P. Stenstrom. Using write caches to improve performance
of cache coherence protocols in shared-memory multiprocessors. Journal of
Parallel and Distributed Computing, 26(2): 193-21 0, April 1995.
61
. [10] J. R. Goodman. Using cache memory to reduce processor-memory traffic. In
Proceedings ofthe 10th Annual International Symposium on Computer
Architecture, pages 124-131, 1983.
[11] 1. L. Hennessy and D. A. Patterson. Computer Architecture a QuantitaUve
Approach. San Mateo, California: Morgan Kaufmann Publishers Inc., 1993.
[12] D. 1. Lilja and C. P. Yew. Improving memory utilization in cache coherence
directories. IEEE TransactioTl.'; on Parallel and Distributed Systems, 4( 10): 1130-
1146, October 1993.
[13] D. J. Lilja. Cache coherence in large-scale shared-memory multiprocessors:
issues and comparisons. ACM Computing Surveys, 25(3): 303-338, September
1993.
[14] M. E. McCreight. The dragon computer system, an early overview, In
Proceedings ofthe NATO Advanced Study Institute on Microarchitecture of
VLSJ Computers, Urbino, Italy, July 1984.
[15] M. S. Papamarcos and 1. H. Patel. A low-overhead coherence solution for
multiprocessors with private cache memories. In Proceedings oftile 11 lh Annual
International Symposium on Computer Architecture, pages 348-354, 1984.
[16] L. Rudolph and Z. Segall. Dynamic decentralized cache schemes for MIMD
parallel processors. In Proceedings ofthe 11 th Annual International Symposium
on Computer Architecture, pages 340-347, 1984.
[17] A. 1. Smith. Design of CPU cache memories. In Proceedings ofll:;EL
l'ENCON'87, pages 30.2.1-30.2.10, 1987.
[18] P. Stenstrom. A cache consistency protocol for multiprocessors with multistage
networks. [n Proceedings ofthe 16rh Annual JnternationalS)mposium
on Computer Architecture, pages 407-415, 1989.
[19] P. Stenstrom. A survey of cache coherence schemes for multiprocessors. /FF/~'
Computer, 23(6):12-24, June 1990.
[20] C. P. Thacker and L. C. Stewart. Firefly: a multiprocessor workstation. JEFf:
Transactions on Computers, 37(8):909-920, August 1988.
[21] M. Tomasevic and V. Milutinovic. A simulation study of snoopy cache
coherence protocols. In Proceedings ofthe 25th Hawaii International Conference
on System Sciences, pages 427-436, 1992.
62
. [22] M. TomaSevic and V. Milutinovic. The Cache Coherence Problem in Shared
A/emory Multiprocessors: Hardware Solutions. Los Alamitos, California: fEEE
Computer Society Press. 1993.
VITA )-
1n-Suk Chung
Candidate for the Degree of
Master of Science
Thesis: A SIMULATION STUDY OF SNOOPY CACHE
COHERENCE PROTOCOLS
Major Field: Computer Science
Biographical Data:
Personal Data: Born in Seoul, Korea on September 23, 1964,
the son of Lee-June Chung and Boon-Ok Kim
Education: Graduated from Hwan Ii High School, Seoul, Korea, 1983;
received Bachelor of Science in Computer Science from Oklahoma
State University, Stillwater, Oklahoma in 1993. Completed the
requirements for the Master of Science degree in Computer
Science at Oklahoma State University in May 1996.
