A VLSI design for an efficient multiprocessor cache memory by Luo, Xiao




A VLSI DESIGN FOR AN EFFICIE NT
MULTIPROCESS OR CACHE MEMOllY
BY
Xiao LlIO
A thesis
submitted \.0 the School of Graduate Stud ies
i ll p.utial fulfillment of L IIt~ rcqulrcrucut-. [Ill'
1111 ' clq~l'l,t' ul Muste r o f Science
1+1 Nalional litltary01Canada BibliotMque nalionale(I I; ca nada
Canadian Theses Service sevce des meses caoacteooes
The author has granted an irrevocabl e non-
exc lusive licence allowing the National Ubrary
01Canada to reproduce. loan, distribute or sell
copies of hlslher the~is by any means and in
any formor fonnat, maklogthis thesis available
to interested persons.
The author retainsownership of the copyright
in his/her thesis. Neither the thesis nor
substantial extrac ts from It may be printed or
otherwise recrocucec w unout hislher per-
mission.
L'auteura aCCOI'de une licence irrevocable et
non exclusive permattant a la Bibliotheque
nationale du Canada de reproduire, prll ter,
distribuer c u vendre des copies de sa these
de quelque mantere et sous quelque fonne
que ce soit pour mettre des exe mplaires de
ce tte thesea la disposition des personnes
jnteresaees.
L'auteur con serve Iaproodetedu droit d'auleur
quiprotegesathese.Ni latheseoj desexlreits
subs tantiels de celte-ct ne doivent ~lre
Imprimes ou autremenl reproduil s sans sen
autorisa tion.
Canada
Abstract
This th esis proposes a. cache memory, used for a 32·bit processor system, which
consists of four components: the Directory, Line llcpfncemcnt U ni~ (LRU), Cache
Memory, and Conlrol Unit, An a-wayset-associaiivema pping method is employed
i ll the directory, The Line Replacement Unit is based on the Icast I'CCClll/y used
line replacement algorithm. Th e cache memory unit has a capacity of sk bytes,
32 bytes in each line, and it is di rectly accessible 10 I, 2, 3, or ,I bytes [one word)
once by tile associa ted processor. This cache memory is designed for a lllulLiple
processor system as well as in single processor system; a wrile-lIlIYn/gll algorithm
and an updaling algor it hm arc combined together to keep the information ill main
memory l:ollsislcn t wit h tll<lt of the cache and to make the muhicachcs coherent .
The hit ratios arc predicted to be over !l5 percent. A two-phase clock of 10ns is
emp loyed to pipelin e t his cache , and it can turn out a result ill zune dur ing read
operat ions without Iluc misses. Thi s cache is imple mented into a single chill, and
is des igned so that it is possible to build cache systems of variour sizes using t hese
chips, witho ut decreas ing the system epc...-d, T his cache memory has bee ll laid out
as a si ngle in tegrated circui t using 3 Micron NTCMOS techno logy, and it s electri cal
and logical behav ior has been simulated .
Acknowledg ments
First of all, I would like to exp ress my sincere gra tit ude to my adviso r Dr. Paul
Gillard ior his supervision, gu idan ce, suggestions, encou ragemen t, a nd pati ence,
which howegrea tly helped me to complet e this thesis.
I would like to than k Dr. W lodek Zuberck, \\'110 has contrib ute d his support
11.11(1 advi ce tOIVi\Cd tho de velopment of this thes is, nlld the other st aff mcmbe rs
of Dep ar tment of Compu ter Science , who have provided me with t heir technical
support . Especially the support and pat ience of my wife wa s of grea t help .
I am very grn telul to Department of Compu ter Science andSchoo l of Graduate
Studies, Memorial Unive rsity of Newfoundland , for prcvidiug me with 1\11 cppor-
t unlty for graduate studies and flnauclal support ill the Iorm of 1\ fellowship and
teaching aseist.antshlp J uriug lily study.
To the M em o?'y of M y Father and to My M() l./tl~ 1'
Co nte nts
1 INTll.ODUCT IO N
2 DASI CS OF C ACH E MEM ORY
2.1 O\'(':fvicw of the Memor y JlicrllrChy .
2.2 T he COIKcl'l of Gn,h c f\'h-lIIory
2.3 T ile Bl'Isic Str uct ure of Cache Memory IU
2.'1 Th e Lim: Size Choice I ~
V i A Survey of Cliche Design IG
3 IMP LEM ENT ATI O N 0 ....T il E C AC HE ALGORlT HM S 12
;1.1 C...chc Dt'Sigll Pa raIllL1.crs .. . . . .. •• •••• •• .
3.2 T he St ructure of the Cache Memory
3.3 The Addr CSlll Space ~Iapping ...
3.4 Tile Set-associative Mappiu&•.••...
3.5 lmpleme ntatlon of the Directory . • . . . .
3.5.1 T he LineSlot of the Directory .
3;
'0
3.5.2 The Address Register . . . . • . • . . . . . . . • . . • . . . . 'i5
3.5.3 T he 32-bit Decoder for Sel Select ion
3.5.-1 The Line Number Gene rato r .. , ,
3.6 T he Line Rcp lacCIll(:I1 ~ Unit
3,rU One LItU Cell .
.jj'
41)
62
3.6.2 One Uit of ti le LRU Cell. . . . . . . . . . • . . . . . • • . . 65
4 THE MEMORY AND CONTROL UNITS
4.1 Structure of the Memory ••••• • •
4.1.1 'l'he Cache Mcmory Ilegiatee .
4.1.2 The 256-bit Ito" Decoder
4.1.3 Thc Cache MClllory .
4.1.1 The Data BU9 Control Circuit .
t2 The System COII~ roJ Unit .
4.2.J Thc Rcgulnr Il.cild /Wrilc Operil.lions •
4.2.2 The Upde tc Operet icns
4.2.3 The Miss Opera ricus .
5 USE IN A MUL'l'IPRO CESSOR SYSTEM
5.1 The Coherence Solution Strate~y
'0
70
.2
'0
SII
!J.1
!J.1
1lI2
I II
. III
5.1.1 The Protoco ls between the Bus and tile Cach.. . . . . 117
5.1.2 Till! Protocols bet weenthe Processor anti the (;adw III
5.2 8x lcrtlal l ll t,~r fal'e .
5.2.1 Tliclnlcrf<lcc Sigllills . ;
5.2.2 TlreTituiug Operat ions
5.3 Considerationor the S)' slc~ 1I1 Busand Main Memory.
5.1 Silllulaliuus of1I1eCache-based Mullilltoccssor .
5.5 Testing the Cache ~ IPI Il(Jt)"
6 CONCLUSIONS
vi
. 1 ~1i
. • . . . J:W
t :l~
ns
.. .. 14·1
List of Figures
A Von NeumannComputer Organi1.lIl ioll • •
A'l'yp ical Memory Hierarchy
Th, ~ llasicCadit, Mt'mor~' Struct ure .
'l'heSt'l·associal iw MalIJlitiS .
'1'111' Bind ory .
SiJJlu)alioll of thl'']'ag Array .
The Ilirl.'d ory 'l'ag
The Tag Bit and Valid Bit cl the Ducc tc ry .
One Uit of the Add rl'~s Rl!gislcr .
IU The l). l)'l)C Rising-edge-tr iggered Fhp-Ilop .
11 Til:! :l2-~ it Ulrec tcry Decoder
12 The 16-bit Decoder . . . . .
13 Simulationof tbc 32-bit Decoder
:12
35
38
<I
-12
·12
"
50
H The Line Number Genera tor. . . . . . . . . . . . . . . . . . . 51
15 Layout of lilt' LineNumber Generat or 52
Hi Simulation of the Llue Number Generator .'j~
17 St ruc ture of the LIlU Unit , • . . . . . . . . . . . .. . , ••• .
HI The Que-shot Circuit. •• . • ••
lfl St ructure of an LRU Cell • . ••
2U Au Example of the I.IW Aigorillutl • • ,
vii
56
GO
63
21 Layout ur the LRUCell • . GO
22 One Bit of the LRU Cell . .. . . .. . . . . li j
23 Simulal Mlu o( the LIlU LJuil .. . . .
'"
2" Structureorthe Cadle MrlIlory . . . 71
zs One Bit of the Memory Register . . . 72
2G Tile Rc&ist.er/CoulIlcr Logical Circui ~ . 7·'
27 Layout of the Memory Column Control . , . , ju
28 Simulation or the Memory ColumnControl . . 17
29 Circuit of the Transfer Decomposer 78
30 Tile 2S6-bit Memory Decoder .... 79
31 Simulation of the 25G·lJit Row Ileccder • .
"
32 The Memory An ays ~I
33 Four Dils ol t he Memury . • . .,
3i The Data Du' Centrol Circuit . . . . 8.
35 One Il it of the Gate Circuit . ... . .. . . ... . .
'"
36 The GIII(' O:l1lrol Logic
""
37 l.ayoIILof the Gat.cLogic . till
38 Sillllllil.tiun of the Gil.lc Logic . !JU
39 The BUI Write/ ltt'ad 0 IlCr.llion Control 91
40 Sil11 ulat ion of the Writc ColltroJ .. . . . . . . .. .. . . . . . . .• fJ:l
oil SilllUllltioll of the !t(!ad Contro l ~';I
-4 2 Tile Sill,;1efoJlowin~ Pulse Producer
viii
!).'j
43 TbcClock Pulsc Ccnerator .
H A Timing Illngrarn for the Clock Pulse Generator
45 The Circuit of the Bus Update Watc!Jer
46 The Updatc-wrltc Ocncretcr
H The Mi~s Circuit
48 Till.'Circuit [ur till' Ilu~ Control Sigllnl Cl'nl'r111ol'
,J!) The Update I{C( I\leS ~ Clear Circuit
51! The Read Valid Circuit for llead Mi:>.~
51 A Typical Cache-basedMultiprocessor System .
96
. . . ' · 9G
99
. . . . 102
.. 102
Wi
' 07
1U9
111
52 Ccnuuunicntion be tweenthe Cache andBus fora Write Opcreuon. 118
5:1 Communication be tweenthe Cache and Bus forII Line Miss
51 Ccnaunnication for an Update Operation .
55 The ProcessorSubsystem
56 Pin Functions
:;7 A Timing Diagramfor Read/ Write Operations ...
58 A Timing Diagram for Read Operations with 1\ Line Miss .
S!) A Tim'ng Diagram for Write Operationswith a Line Miss
GO The N-User l-Scrvcr Bus Arbitcr
61 TheArbitration Unit . . .
62 The Shared Main Memory Partitionedinto 8 Modules .
6:1 SimulaLions for the Multiprocessor System . .
6·1 The TCRtiug Circuit for Shilting-out Cache Line Alldl"l:lscs
ix
120
121
'24
127
. . ],11
. . .. 132
. 133
. . . . . 136
.. 137
. 139
.113
I1G
List of Tables
TIn~ Desi(;nTarget Miss Ratk>s •• .•• . .. . . . . .. H
The Re1ennt Cache-mapping-type Ralio . . . . . . . . . 15
The Truth Table Rdding the - HoL Code- and the ni nn y Code 54
TimeDelay for the Directory . . • .. ••.. . • .. •. . 5-1
The Gate Control Functions . • . • . ISS
1 INTRODUCTION
III 1%5, .JOIIiI VOII Neuman n made proposal s for a. digita l electronic computer
s t ructure. IIIhis prop osals. the baulc logical structur e of a digit a l computer syste m
hns lht~ fo llowillg chn mdt'ri stkll:
I. H hus au inpltt l1lf'diulIl, by means of which en essentially nn l illlit,( ~,1 11 1ll 111 x!1'
of operands or inst ruc tions may Ill' entered.
2. It Ila S .~ fQmgc, from w hich opera nds or iustruct.lons may be obtaine d aud lute
whic h result S may 1m entered, in allY desired ul'da .
:1. ILhas n ('ale ll/ l/lil/!/ u nit , eupahle of carrying o ut aritlnnctic andlogicnl oper
al iu lIS on an y opcrnn ds taken from st o rage.
4. ILlias 'In O ll fp lIl ul Cl lilllll , by means of whichnn essentiall y unlimit ed number
of results may ~ deliveredto ti le users .
5. It has a cOI17'ol unit, c a pable of interpreting instructions ob tained Irc mmem-
cry or etcre gc, and capable of choosing between cltematc courses of ar.tlon
Oil the l msis of comp u ted result s,
III gene ral, a compu ter which meet s the c riter ia defined as the \1011 Neuman n
# n lclnrc is orgiluiz"d us sho wn in Fig . I, Alt hough t he components of the flve parts
of t he [,;lSie structu re Jlll<l th e technologies used may vary widely, tlw funct ions o f
l ilt: pMls nmy be d~'arl)' idcli lilied i ll virtua fly any Iligital comp uter,
Fig ure}; A VOIl Neumann Computer Organixation
Memory is the source of all information , data, and instruc tions . lI(J\\'i ll.~ 1" "I
from t he four ot her parts . T he data and instruc tions a rt) sto red in 1111' 1III '11 10l"y (·,·lls.
each of which is associated with a locfltion , or f1ddl'C.~S . The ce-lls 1''' 11 1)(';l1 " " "' ~" d
by other parts of t he computer by means of t hese atld rcssl's.
The main functions of input and output, as indicated hy r.lu-ir lIa lll"S .
derive information from and 10 deliver the results am! ut h '!" i llr" r1l1a lj"u I" 11...
outs ide wor ld. T hey have also two subsid iary Iuuetlous, hllff{'rilljl, nlld d "l a "'lIl
versio n, T he buffering function provides an interfac e lind sY l l\' llI"l Hli ~ al.i (JII 1. '1\ \""'11
the processing part of the computer an d the outside world. '!'Iw runvorxiou rll W"
tion can convert t he data type in the processing unit lnto for ltls I1sl',[ IHlhi, I,' IIll'
computer syste m.
Th e proecsalng part of a com purer, referred tc 'is th e Il r il/u net ic. /ogic unit,
implement s thevarious a rithmet ic and logic opera tions on o pera nds obtained from
the nwnrory, 'I'he result s, after t hese opera tions, arc typically sto red back ill the
memory,
The cont rol unit obt a ins instructions {rom the memor y, decodes them, and,
d(,pcllliing 0 11 their ltll'llll iuS, scu ds the app ropria te cont ro l sigrmls to o ther parts
of tire computer so tha ~ t hc desired operatlcus will be accom plished . It also makes
fk~ :i~i"rrs about Wllitl nct jou unrst be taken aftur receiving t he re sult s orvar ious ll:sts
un II'lta Wilde by the f/l'ithmcl jc·fogic Ullit . T he combluation of lhe arithlllci ic-Iogic
uuitand control uuit is known as the cClIlI'It! proccssillg Ull i t,o r processing dCl1lt~nl
in lIw cas e or multiple processo r syste ms,
Until the las t two decades, almost all the electronic d igi tal compute r' systems
ns!!!l this V OII NCUIII (Il!ll architect ure, Even whe n t he underlylug architectures of
the computer systems be gan to contain a. limited a mount of parallelism (such as
in llle CDCli liOO, for examp le) i~ was generally concealed from the U SCI'~ , In this
1'l:riod, t he Jel1l1l.11I1 for highe r s peed, liU'ge r storagc, nud mere reliahle computer
systems WII."! rap idly increasing because large sca le computation a pp lications were
vi~llalil',('IL Th e dl!IlIi1UJ was such that, despite many technolog ical advances ill
electronics, uniprocessor systems proved to be inadequate fo r the most highly COIfl '
putationnliy inte nsive p roblems since t he point had been n 'adl ed wherecommu-
uicatlou delays between switch ing clemcnts or lu tcgratcd circuits piety It Jo minan t
role in t he speed of the cotnpnta tlon. T herefore, ne w ways had to be found to meet
these req uirements . Th e genera l approac h is based 011 parallelism, implying that
compute r archit ectures willhave to depart Ircm the strict V OII Ne umann concept.
Parallelism ill var ious fo rms had already appeared ln com puters prod uced durill!;
the 1060's, and !lns pro ved to be au effective approach, III 1I1i~ coutcxt., pnrul-
lelism docs not onl y mean the rcpllcaricu of logic bu t also hns oll lf'1" 1Il1:alliugs. Fur
example, a uniproc essor using a plpcliucd instru ction unit ami a pilldi lWd nrith-
mctic unit, 1\5 well as t he implementatio n of multi ple prog rams executed "simul-
tancoualy" , all im ply concepts o f parallelism, Therefore parallelism in a computer
system presentl y has th ree meanings:
1. Time interle aving
2, Resource replication
3, Resource sharing
Time inter leav ing in t roduces a lime factor inLo t he concept of pa rellclisru. That
is, several proces s steps are int erleaved ill l illie, each using a par t o r the Sill nl '
hardwa re at different times. In this case, it is not necessa ry to ha ve a replication
of hard ware to increase t he performance of a comp uter system. l'i pd illillg is ;U J
example of lime-interleavin g.
Reso urce replication i~ the replication or addition o f hurdwurr-units whidl
C;l.I1 operate simultn ueo usly 011 n proh lum, thereby aLlnillilig cn lrll" IL"l.i'm I"JIV I '1
through fIllllka tioll of log ic, ra t her thau relyingsolely 01 1 fast lndividual ga ...s and
~maJ I dimensions 10 reduce logic delay in order to obtain high speed. Multiple pro-
ccssca using the same h ar dware in SOIllC time-slice order arc an exa mpleof resource
sharing .
Since parallel ism was introduced into computer architecture, various parallel
computer archit ectures such as vector processors, pipelines, array processors, as
wpJl itS lIlultiprocessor architectures, have beendeveloped and used to hnudle largf"'
(1lHlllities of oala simultaneo usly and concurrent ly wit h high performance. I/O
1'1'OCL'lOSUrSha ve I,el!ll us ed for iup ul and output to spo-dup conunuulentlon IlCtWCCIl
t he processing deme nt s 1\1Id cxtcmnl s tomge cr users, Thus , illgcucrnl , JJfITilllelisl1l
ind lltll'll1101only altuultnuclty !JIlL 11 150concurrency The former means thal two or
IIlIJ r(' evcnls oc cur at. the sallie lime and U,e latter HlelUl ll that two Of mom events
occur within a giV '~1l iJl!l!rv.l1 of Lime.
011th~ other 11<\1111, memory has U('C1Iorganized in diffcreut ways in order to
obtain access speeds r.olllpnli htc wit h thaLof ptocessiliKclements and to have D.
larger capacity. IIIgcooral. there arc two basic approaches: ouc is to orgenlec the
metuory ns a memory hierarchy, LII{' other is to decompoec tile merncry into several
modules shared by the processors ill the system.
These kinds of computer urchltccturcs Me, more or less, not strict ly Von Neu-
mann struct ufes; Indeed , the nmlt.ipleprocessor syste ms ill p iU .cular have quite
dilfl'rcnl charnctcristj cs.
2 BASICS OF CACHE MEMO RY
Th roughou t the histo ry of elec tronic computers, wheneve r develo pments have tak en
place in comp uter tlyslcm."I which inc rease processor speed, there is COrTl':ll lOlltli ll~
pressure to have t he memory maLd , tb is speed and, at. the same t ime, inc rease ils
capacity, Th erefore, performance huprovcmcnt e in compu ters luwc b lOCH esscclat e...1
wit h illlprovcmcnh in memor y c,'\l'acily and sp eed. Alth oug h both pmn."S!\I'r!l an,1
mniu memory syslc lIUI 11l\Vc Ix-CII im proved by 1I1c,uJily developing ll.'ch ll(>I(J~il'll 11 " ,1
no vel architec tur es, th e re 1m,"! been a persist en t. mislIl lil ch but ween 1I11~ speed of
processors and that. ormain memory. ThILL is , the maln memo ry is slow relal ive' tn
the processors. The memory system limits the 511('(:<1at which input dil l'" cnu b(~
del ivered to a processor and t he resu lt s recei ved from the pwn "llsor. This e thc l;tl-
ca lled lion NU1rl41U1 ~otllf:ll ed:. li e nee there has IJCCIl a.constant need forsle 1tcly
improvements to ma.in me mo ry subs ystems for Iligh overall system per formance .
Approaches or inte rest toward improving mcmory s peed and cil.lJacity have been
the rollowing 12,5, 71:
l. Memory hierarchiesand virtllal memory
2. Cachc memories
3. Development or larger and ras te r memory cllipll
-I. Memory illtcrleavins
2.1 Overview of t he Memory Hierarchy
In order to improve the performance of computer systems , especia lly sing le prcces -
SQr sys tems, there are two appro aches to speed up a memory sys t e m with a large
capac ity, One is to develop a higher speed memory sys te m with a larger capacity ,
theo t her is to parfifion a memory system into all efficient memory hiera rchy con-
sis~ i llg of ecvcra llcvcle of eubeys tcme with var ious 51'(''<:(15 and size s 12, 3, 5J. T he
first app roach seems mo re straig ht forwa rd and simple - to have a fast one-level
memory wilh a large cepecity, However , even wit h imp roveme nts in tedurclcgy,
a fast memory system with 11 la rge ca pacit y is still very cxpeuslvc, so t hat it is
lICCCl'lSl11'y to usc slower memory at a lower cost to creat e a memory SYSl f ~11l with
a large enough cal' ad ty . In orde r lo gi ve the memor y subs yst em nil adequate cf-
Ir-ct jvc speed, th e memor y subsystem call be orge ulecd as 11 hicI"(lIddcI11 III CIlIOI ' Y
sysicm . T his kin d of memory system call be mat ched to bot h the speed and size
requirements of the high-s peed processor at relati vely low cost. A t y pical hierarchi-
calmcmcry str uct ure is dep icted ill Fig 2. The lop level of t he memory hierarch y
(ncar t ill! IlTOC("Ssor) has the Iast est spee d but also th e h ighClltcost . Ther e fore, the
capac ity of this level is made sma ller to de crease system costs. For ti le lower levels,
the s p eed of the subsys te m dec rea ses while tile cap acity in creases . At th e level 011
the bottom of ti le memo ry hiera rchy, t he memory subsys tem ]JOiiSCSSf$ th e largest
capacity, but slowr,~l. spee d , with lowest cost per word st ored . In this memo ry hier-
arehy, each level is direct ly conne cted to the hmucdiately high er level. T ha t is, each
memor y subsys t em call directly comuumicnte wilh th e imm ed iately highe r or lower
Figure 2: A T ,)'pical Memory Hierarchy
subsystem ill the hierarch y, For example, t he (Jl'..>ceSSOIS can d irect ly coruruuulcntc
with Lhe first-lev el memory, e.g. register array or cache memory; and sim ilarly
t he first-level sub system ca ll communicate with the second- level one, a~ sho wn in
Fig, 2, and so on . Ocncrn lly the top-level subsys t em, such as cache memo ry, is
used to at tempt to bridge the spee d gap between th e J)fOCCS~O I'S and th e lower Icvd
subsystem, while the [ower level me mory subsyste ms arc employed to enlarg e 1I11'
capacity orthe whole memory sys t em.
2.2 The Concept of Cache M emory
The co ncept or cache mem ory was proposed by Wil ke [19051 i ll it bri ef article in
which be describ ed a syste m that contained lwo k inds or muiu mClllor) ' : cue was
conven t ional, and the e t her was u ncouvcu tiouelhlg b.spcc.l uunuory ('nlll'd a l that
time sIalic memory, llOW called cache memory. III19G5, the first real cache memory
was implemented on the IBM 360/85. Since then, use of cache memory has ra pidly
Increased Oil a wide range of compu ter systems, initially on mainframes, then on
mlnlcomputcrs, and today even on microcomputers.
Cache memory, a relat ively sma ll, high speed random access memoty Is de-
signed for transparently bridging the speed gap between the CPUan d main memo
ory, since it typically has a speed compatible with that oCthe CPU. T his mea ns
that a cache memory ill a cache-bas ed system is invisible an d not. direct ly acres-
sible to users or even to syste m ope rators. Typically, tile speed of cach e memory
is five to te n times faster th an that. oCmain memory. Using this kind of memory
hierarchy, the computer may seem to have a one-level memory with the capacity
or the slow main memory and the speed of the cache memory (21.
The idea oCthecache memory. sim ilar to t he primary-secondary virtualmemory,
is to duplicate the active portions of a lower speed memory in a high speed, but
sma ller, memory. Only the data most likely to be needed in near future by the
CPU reside in tile cache, and obsolete data are automa tically repl aced by the
newly requested data. In general, th e speed of the cache memory is matched to
the maximum data. rate of th e processor so that the processor can access data in
the cache witllout delay, whenever the data requested by the processor are found
in th e cache. If the request ed data are not in the cache, a cache miss occurs,
and a request is made tc t he main memory for transfer of the reques ted da ta to
the cache. H the dat a current ly resides in the main mernorj , it is transferred to
the cache immedia tely. If it is n o t, but is in the second ary memo ry, a. reques t
is issued to bring the requ ested d a ta from the backing stor age. T h erefore, when
t he requ ired references to the memo ry can be capt ured by t he cache, speed is not
d egraded . Otherwise, the performance will be de graded by the t imc requir ed to
transfer d ata from the mai n memory to t he cache.
The use of cach e memories in m odern com puter systems is based on the locality
of memory refere nces - both spa t ial and temporal (7, 9\. Spatial lo cality refe rs 10
the property tha.t memory accesses over a shorl period of ti me tend to be clustered
in space. This type of behavior can be ex pected based on t he common knowledge
o f typica l program behav ior: relat ed data it ems (va riables, arrays, e rc.) are usually
s tored toget her a nd instructions a re mostl y executed sequentia lly. T emporal local-
it y refers to the property that re fe rences to a give n locality are typically cl ustered
in time. T his type of beha viorca n be exp ected from progr am loops in whic h both
data and inslructions a re reused. Therefor e, use of a cache memory in a compu ter
s ystem can minimi ze the intercon nection network tr affic bet ween the proces sor and
m ain me mory and speed up thesys tem since timaccess delay of the m emory system
a nd the frequency of references to the slowe r ma in memory are hig hly redu ced.
2.3 The Basic Structure of Cache Memory
The capacity of cache memory is far sma ller th en tha t of main memory: lhat
is , the a dd ress space of cache memory is far sm aller t ha n tho address s pace of
m ain me mory, th erefore cache rn etuory requires an address mapp ing mechanism
10
to translate themain memory add resses, a t a high speed, into the cache memo ry
address wh ere the copies of data in the mai n memory reside. Also because the
1I 1()~ ~ actlve portions ill th e main memory ar e copied in t he cache memory, if till)
ca che memo ry is full a nd the assoc iated pr ocessor needs dat a not ill the cache
1IU'1110ry, some of tho d;ltn in the ca che will be repla ced with tho newl y reques ted
da ta from the main rllclllory. There Illust ex ist au algori thm whicht;\1I ]Hl ltl id th al
th e data to he repl aced willnot be used in ncar [uture . Since th e speed of the cadre
IIICll lll r) ' is t he key factor in cadre m emory d rsigu, llL i ~ kind of 1llgorit hlll1illls l IH!
implemented in ltnrdwnre . llcnce, th e bas ic struc t ure of a cadit' me mo ry sho uld
hi1.VI~ li t leas t throe basic hardware co mpone nts : 111 address mapping m echanism, 11
data r('pla cclIll'llLunit, am] storage (or tire da ta in th e cache.
The ba sic Innc tions of a cadre memory call gene rally btl describe d as follows:
Each reference from the [II' OCCSSOI' t o it memor y locatio n is pre sented to the ca che
m emory. T irecach e flret searc hes the direct o ry of the address mappin g mecha nism
to see if t he request ed data reside in the cache memory. If t he request ed dala a rc in
the cache, tile da t a arc operated on to sat isfy the processor immediately without
di st urbing the maiu memo ry. If the <lala arc not resident in the cac he, a cache
mi ss occurs which willcause the transfer of t he new data from the main mem ory
to t ire cache . The n the requ ested d a ta CRn be refere nced by th e proces sor. Defore
transferrin g a new line to the cache , some data 11M to be removed from the ca che
m e mory to make ro om for t ile new. Which old duta ill thc cache \ViII be discar ded
is determined by th e data r-placcmc ut unit. Therefore, the cache-rep la cement de-
I I
cis tou directly alred s tile pc rfcrnwncc or th e .-ache. A guo" rllpliln o,ucu t 1I1gor it lllll
ca n make t he cache have a somewh a t highe r performa nce tJU UI a bad a lgorit h m.
Since a cad le memory has a high speed compatihle . ith th aLor the iWOdak,1
p rocesso r , al l the algorithm s of a cache memory have LoIJcimpleme n ted in ha rd-
ware. The refore, the desigaer5 of a c achc memory ha vc Loconsider not only ho w to
irnp lcmcnt, iLsfunct ions b ut 11 1:10 ho w to imple nent these fuuct ious wi th IJr;ldinal
ha r dware .
Traditionally, a cache system is IJUill with a elnglc cache Ior both lilit a and
in st ruction s. T his cache is callcJ a s II unified cache , ill which cas e the CPlJ '~
componen ts have cnl y cne cache u n i t to refer 10 for lonth insl rud iollAall/Illata . '1'111'
associated processo r shnr<'lltile sam e carhe for data end inst rudi lllls , which n mkl'!l
more efficient use of a limited resou rce and Jow{'f'J t he average miss ra tios. Also a
cache sys te m can be IlJlit lut e Lwo s eparate Clld_: one for da ta, and l l l'~ oth e r fnr
ins tructions. One of the major ad va ntages to sl'liLLill&data and iuerructicns iulo
two sepa rate aches is tha t. conllicta betweeu simult a llCOll:'l inAltlld io u rd.dll':'l au.1
dat a reads and wr it('SAre eliminated [9).
2 .4 T he Line Size Cho ice
T he perfo rmeur c o f nrus1 l:olll l ll l l ,l'n . depends strong ly UII 11l1' 'I'lillity of till" (,/...ln-
d es ign an d theway ill which il i:'l implomcu tcd. Therefore, cache dl'!ii gn is a wry
s ig nifican l part or compute t sJ At.c rn 111'Sigu. lu nnle r ln lll'!lig li a lIigh -p"rfllfl llllIlCI'
cache me mory, there arc se \"r.ril.l choices to be made lind pa rame ters 10 bf~ "d.
12
Designers ha ve to make ul!CisiollSabout. the a lgorithms (retch, placemen t , etc. ],
about. the bes t sizes (cache size, line size, etc. }, and ab out the ways of address
IllIlJlping and maintain ing consis tency among several caches in a multip rocessor,
Design ers also have to make rrn dooffs in set ting these pa ra meters ; e,g. cache Si7£ ,
Iilie siz e, the se t-associativity, nud so on . Each of these parameters allccts cache
performance , choosing di frerent paramete rs pro duces different cac he performance .
T he cache Ilue elze is a very important parameter that strongly affects t he cache
performance, especially the cac he miss ratio [It}. Many surveys of cache m emory
and/ o r memory hierarchy performance h a ve been made fo r high performan ce sys-
tems. III these surveys, the ca che line s ize choice, with t he 0\'C1"<1 11 cache size, has
hccn show n 1.0 strongly alfed the cache miss rat io, Sm ith s \lgge.~tl'd in [!lJ t he
line size giving the minimum miss rati o for a given cache memory capac ity. He
also ind icated t hat the minimum number of clemente pe r set ill o rder to obta in
an acc eptable miss rati o is 4 lo 8. Bcyond S, th e miss ra tio is likely to d e crease
very lit tle. After a great number of simulati ons , Smith III) presente d practical
values for the m iss ratio i1S a fun ct ion of cache s ize i1.I1d liu...size whic h arc listed in
'lsble 1. The De8igll Targcl Miss lllllios ( DTMR) shown i ll Tab le I arc pr oposed
for un ified cnchca, instr uction caches, a nd data caches, respective ly, The DTMH.
provide lle:ligllflrs with a reference to hnplement a variery of new SySLI'IlIS. It can
he UNf'd 10 csli l11ah! lIle pcrformn uce imp act of certain desig n choices. The models
ol ('Il l'll(' 1Il('llIode~ fur tile 1)']'r,,11t essutuo 11" lIlil1 \11 fetch, copy-buck caches witll ,I
Llt l! n 'plaCl:lIWlll algor-ithm. T hey also ar c Iull-nssoclutlve for ad dress nra pplng,
rs
Cache Type: Miss Ratto
Unified Line Sizc:
Size • 8 IG 32 ... 128
32 0.717 0.556 0.5 0.75
G. 0.68G 0.488 004 0.48 0.72
128 0,674 0.467 0.35 0.33 0.428 0.686
256 0.643 0.42 0.3 0,258 0.276 0.386
512 0.596 0.39 0.27 0.216 O.Wi 0.257
1024 0.473 0.309 0.21 0.102 0.137 U.I{j!
2048 0.405 0.258 0.11 0. 121 U.098 0.093
1096 0.329 0.193 0.12 0.082 0.059 0.05
8192 0.232 0.135 0.08 0.05 0.033 0.025
IG384 0.182 0.103 0.06 0.036 0.23 0.016
:mns 0.124 0.07 0." 0.024 0.014 0.OU9
Cache'I'ype:
Inst ructions
32 0.125 0.478 0.33 0.247
... 0.674 U.438 0 .3 0.22'.! 0.191
128 0.61.Oj 0.397 0.27 0.\ 97 O.UH 0.15;
256 O•.'j!J2 0.:)13 0.25 0,\ 17 0.138 (I.12!!
512 0.562 0.348 0.23 0.159 0.119 0.108
1024 0.5U4 0.3U8 0.20 0.13<1 0.098 fl .lnH
2U48 0.391 0.234 0.15 O.O!!S 0.068 0.D.Oj1
40!J6 0.271 0.161 0.1 0.00:1 0.0"3 0.U:12
81!)2 0.112 0.1 0.06 0.037 0.02.1 0.11\6
1631t'1 0.148 0.085 0.05 0.02!J 0.018 0.11 12
327GB 0.091 0.052 0.03 0.017 0.01 0.1107
Cachc'fype:
Data
32 0.131 0.611 0.$ 0.715
... 0.66 0.515 0.'15 0.-195 0.69.1
128 0.561 0.412 0.35 0.351 0.467 0.677
200 0.47 0.337 0.28 0.2n 0.326 0.156
512 0.345 0.246 0 .2 O.WI 0.215 0.282
1024 0.283 0.211 0.16 0.138 0.14 0.161
2048 0.256 0.169 0.12 O.O!!" 0.083 O.OB!!
4096 0,247 0.153 0.1 0.D7 O .O~1 0.048
8192 0,211 0.129 0.08 0.053 0.039 0.0:12
IG384 0.161 0.097 0.06 0.039 0.26 0.0 19
327GB 0.108 0.065 0." 0.0'.!5 0.017 (J.OI2
Tallie I: The Design Target Miss Il.i\t ios
14
cxcue TYPE ADJUSTMENTS
CacheType Ratio of Miss natc natio of f..'liss Rate
to Direct ~'Iapping 10 Full Associative
dire ct- ma pped l.00 1.515
two-way set -ass ocia tiv e 0.78 1.182
fuur -wuy sut-nsscc iat ive 0.70 LOG1
eight-way sct -nssocintiv c U.67 1.015
Iuhnssociativc fJ.GG I .OUU
Ta ble 2: 'I'hc Relevan t Cache-mapping-type Ra tlo
exce pt for t hose wit h .. and 8 by le line sizes, whichnrc 1·way sct -nssociative. TI le
(Mile miss ratio is " I s~ rclntcrl to the 11l11pping methods llSI..'<I. There arc three map -
ping methods: direct-mapped, S-wny set -assoc iat ive, a nd fully associat ive. These
arc descrjh ed ill the next chap te r. Va lues in Table 2 ex press the relati ve ra t ios of
miss rates bn scd 0 11 bo th the dlrcc t-utapp cd and full as sociativ e mapping meth-
Oils. Th ese cache 1YIlCadjustments originally arc from [30J. Th ey arc Lased Oil
the direct-mapped method. and are expanded 10 be used for those based on the
lull-ussocinti ve method . Since the miss ra t ios shown i ll Tablc I are based Oil the ful l
nssocla ti ve model, ill order to estimate the act ual miss rat io of other systems, the
1l1l1dac ~tl i\ 1 miss ra t io call be obtained by multiplying the given miss ratio found
ill Titble I hy the correspo nd ing relevant cacile-Illapping. lype ri"lio from column
1IJl'(~ lal.ck-<I Ratio olll1is.~ Rtll e to f'ullllsM ciative of Table 2.
15
2 .5 A Survey of Cache D esign
Since IBMCo rporation intr oduced the first cornmerclal c,Kbe memor y ill ils S.rs l"lll
360/85 to br idge the speed gap bet ween the proc essor aml nmlu tuotnory, roll'ions
cache memories hav e been employed in differen t iype.~ of COlllplll,'r " 10 ,\<"h ;,'\""
higher performance. A number of approaches have been used for d"I·, 'lop ing l. igh
perform ance cache memories. Allho ugh the operation of a Lypit',l! ,'<tell(' 111<'11101".1'
seems relatively sim ple in conc ept , implementation of a rcalis tic-: cndl!' Ilh'llIUl".I' is
qui te comp lex, invo lving many Factors which influence cache 1l(' rfol'lIlill\l"" TIll',"
factors involve inte rn al Factors such as cache cap lIcit); line sir.'" ml. lr<'Ss lIlappilll!,
strategy, fetch algori t hm, placement algorit hm , replacement algor it hm. ,Is II'dl ;1s
the swapping algorithm , and extern al fac lors or system Iact urs: prun 'Ssol"urgnui-
eaticn, hierarchical memor y orgnniae tio n, as well as Ih e intcrconnc rtiou Ill'lll"mk,
such all th e syste m bus . For supercomp uters, sy nchroniznt .ion is I' llI" I"<' s, ' r i,, " ~
prob lem since a t least two or more processo rs ar c cruhcdth-d ill 1.1 11'".\'sh 'lll. Tl w!'"
fore , atte mp ting t o evaluate cache pcrformeucc exactly ill II !"I 'lllisti .. "oll'l," l" l
sy stem is quite diffic ult. We can, howeve r, use app roxi mat e lllod<'ls I<>r ('\'alll,1!iOIl
of cache beh avior an d per forma nce,
Cache performan ce can be descr ibed with referenc e to two "sp(~ds [!l]: l' i!,.[w
miss rate and access time . Th e first aspe ct is cache access t ime _. 1110' ti ll \" p''lll il'l ~ l
for the pro cessor to gel informatio n from or store information into I I" , "ad "" ( ' ~ ,,'I ,, ·
access tim e de pen ds not only 011 the design its elf but also Oil li lt' ll~·llIlOh,_c,.l" u-«!
in cache design, Therefore, t he effect of design changes Oil m"'t-SS IiUII"is rliflil llll
IG
to )Il"t'clid witho ut specifying the circuit technology Ilsed . T he second aspect is the
miss rat io of the cache me mory - the fract ion of all meruory references at tempting
to llCCCSS data which arc not Tl'sident ill the cache memory. I II gener al, every cache
miss makes t he processo r wail until t he desired data. can be received. The miss
ratjo ill relat ed not only to how the cache uc:sign alTcct.~ the numbe r of misses,
hut also lo how the umchluc design, including hard ware and soft ware, affects tl lC~
number of cache references (mai n memory references). For exam ple, the cache
miss rat.io depuud s on the program localit y implied by software and t he amount
of iufonna tion (onc word, lwo words ctc.) obtained by t he processor at it cache
rclcreuce .
fo, lallY comp uter systems (a lillost all modern supercomp uter and large computer
systems ] have cache memories of various designs to bridge the speed gap between
processor and main memory in order to improve system perfor mance. T his section
presents a survey of cache memories and their performanc e in severaltypical cache-
based comput er sys tems.
1\ high-speed cache memory was employed i ll the IBM Syste m 370 Model IG8.
T he cache was ava ilable in a size of e it her 8k or 16k byt es. The 8K-byle cache
memory had a cycle time of80 us [the same as the machine cycle lime) for access ing
-i·byle data. Il was organiz ed iutc 64 sets as a -t-way se t-associative cache. Th e
write-throughscheme was used for upd ating t ile main memory. T he average miss
fIlt,in was " uuul 7 percent [271, and rhc miss l'Iltio predicticu, accor ding lo the
DTt-IIl, iJlfi.3perce nt
17
The JDM 3033 has a G4k-byte cache memory for bot h iustr uctic us and d al ,l
wit h 57 ne cycle t ime. T his large , higb-epecd cache memory is one or the main
reaso ns for the high perfc nn ance en bancemenL or th e 3033. T his cache is org.luizlod
into 64 5Cb as a 16-w1l.Y associat ive cache. The line , i7.eor the mM 3033 is 6·\
by tes . Also t he IDril t.thro~g/l policy is employcc.li ll UIC 10M 3U33. III Uli s IIplA,.·UI,
the main memory is divid ed into 8 modules so t llaL main melllory can trallS rl ~r ,1
line by int erleaving [51.
T he VAX-I I / i80 is 1\ :J2· bit high-pcrfonueuce lIlin:' :Ol1ll'u ler lirsl iutruduced
by DEC in 1978. Ire cache !La,>8k byte capacity organize d lute 512 sets , two lilli'S
per 5C1, an d B byt es (.1 by tes per word ) in each line [51 . Fo r the ruche memory or
the VAX· II/7S0, a dist inclio ll ill made be tween tI. rea d and a write mis!!. If lh(, l1~
is a read miss, th e required line has 10 be retr ieved h om the main mem ory and
writt en into t he dat a CAche, If t wo lines in til e given sd lUClull , some :;orLllf lilll:
replacement str ategy has to heemployed to dete rmine which line is S\\',IIIPl'll with
the 111,."\" requ ired line. The VAX-II /i8Ucache memo ry ust.':'!II. JTmdolll rf:pltl r r lllrn t
s t rlll('SY liS its poli cy Icr tlll(laLins UIC liue, If there is a lII i :<.'l CiUISt~1 hy a wrilA'
operation , only t he referenced location ort he main memo ry ill updated. This d., la
cache uses II buffered UJI·ile. 'h roug/i policy. TIH~ miss ratio or VAX-II /'18fJ wus
mea s ured 10 be I\bo ul 13.05 per cent [31J, and iLls alsocsli lllllll·tl 1o 1 1f~ 1:1); Imr('f'lIl
by t he DTM R.
To day cach e IIw mor ies he ve IJIX'IL integ rat ed witl l their com~pnllfl i llg lIlk ro-
prcreseors un a. s ing le ,·hi" . giviug so-ca lled on-chi ll cache 1II(.·IiIU1k'S. TI' l ~ Z8UtllIfJ
18
microprocessor produ ced by Zilog ill 1935 includes a 256 byte on-chip cache mcrn -
er y which is organlec d into Hllincs, 16 hytes each , as a fully associative cache. T he
ma ximum clock freque ncy for t he Z8000 0 is 25 til liZ , a nd when the Z80000 fetches
Froruits cache, only one systeru d ock cycle is required [28]. The leasi l'Ccclilly used
tille ( Ln.U) replacem ent algo rithm is used to choose th e line to be rep laced by the
new one from the ma in memory in the case of liue-m iss OCCUHcnce. T he write-
tll1vllgfl algorit hm is used in this cac he for its writing strategy. Whe n there is a
miss caused by a wr ite cpcrution, on ly the main memo ry is updated. Thi s cache
1In.<; a miss ra tio er as per cent for a no burst t ransfer mode and 12 perc ent for 1'1
u 1Il'1lLtrausfue 1I101ic 1:.1 !1]. It is predict ed to hnve, n ll a IlU ifiCI[-Cilc!W, n mlss rati o of
:mpCI' cent uslng ti le J)TMH.
A cuche memor y has a lso been ap plied to the Bala nce multiprocessor sys te m
introdu ced by Sequent Comp ute r Systems Inc. ill 1988 [371. Th is mu ltiprocessor
sys tem call pool up to t hirty 32-bil processors with a sha red main memory. A
subsystem ill this syste m is compose d of an NS32032 microprocessor, an NS3208 1
Ilcating-polut un it , ami an NS32082 paged vir tua l memo ry mana gcment unlt, pro-
d uced by Nationa l Semiconductor. III add itio n, each subsystem lias an Sk-by tc
two-way se t-associative cache memory to achieve a high pe rfo rm ance while mini-
mizing bus tra llic. In th is cache, wilh a 50 liS cycle tim e, there arc 512 sets , two
lineseach, and 8 by tC::! per line. T he wli /e_!/uvIlg li po licy is em ployed to keep a ll
tile copies in the sys tem consistent . W henever there is write rellllest Iro m one pro -
ressor in the syste m, th is request wil h the eor respoudiug address is sell t to upd ate
stale data in t he sha red memo ry while it is broadcas t to all tile caches to sec if lllt're
arc any copies of the dat a to be upd ate d. If so, the coerespoudiug cache routr oller
invalidates the affected hue. T he miss rat io of a single-thread cache memory is IS
pe r cen t [31], while the predict ed miss ra tio Iron, the DTMR is IS.!.!pe r cent.
Since the cache m iss ratio is very depende nt on the program s tll;,t execut e on
the cache-based systems end the models in l! l] ;HC ideal (in general , a rea l c;I(:11I'
memory is more complicated , a nd the re arc more facto rs Lo he considered}, we ca ll
see t111\t 0 ' 11' design tar get miss ratios are slightly higher than seen in simulations
descr ibed abo ve, and close to those from measured resul ts, such as for t he VAX-
11/1 80, which lends some credibility to the usc of the DT MIt as 11 reasoueble
estimator of cache pe rformanc e, as note d ill [i I). Th us, til(: sot or design t arget
miss ra tios is very useful for design ami impleurcutntiou of a possibly new cnche 01
architecture. Also we cnn see that the line sizes of the systems discussed n!JoveS('C11 1
too sma ll. A larger line size providesa lower miss ratio under a fixed cache size. It is
clea r tha t caches using sct-associ a tlvi ty have lower a miss rat io tll,111 t hose ueingI,llt'
direct-mappedmethod. Another problem is tlml ~IUJ above f;YS I.I~ IIl~ which use sd·
assoclntivity have a smal l sot size, which alfcd!! t he cache miss rut ios. III addition,
for impleme ntations o f exisliu g ca che memories, a~lllosL all caches are i l1 l ll ll ' l l ll~Jll' ~fl
in ei ther multi-c hip or on-chi p configurations. In t he case of chip lids , sevnr a] chi p.~,
inclu ding a ile cache contr oller andseveral high-speed sta tic RA M chips, a rc used tn
buihl a cache memory. This kind of cache memory is designed for specia l p]"(J!;essors
and has a fixed Cliche size. T IleYdo not Il1l\C much Ilexihili ty; for eX ;Lllll'l,~ , ti ll'
20
cache size call net be cha nged alter the cache com rcller is designed , and t hey ha ve
longer delay t ime bet ween Ihe cachecon troller and RAM chips. An on-chip cache
docs 1I0t have a delay p enalty du e to lnt ercouno- t jon bet ween th e chips of a multi-
chip cache ml'mory, bill en- ch ip caches have LIII~ same p roblem ur innexi bilily ilS
tlu multi- chip cache mc mories . In additioll, t ili" kind of cache in general lias on ly
a small ca paci ty using lrnlay' l'Itechnology, which le.'l.ds to l\ 11ighcr miss rat io.
lJl<Iing VLSI tcchuo jogy, IVecall make tratkUrr8 to ~'t-"lIi gll " lloyd cache memory
chip with littl c lIeiay peunlty by eliminating tIle wlrc-ccnncctlon delay between t he
cache contro ller aud the cnchc data memory. Multiple uniform cliche ch ips call be
use-d to build cache syste ms o f various sizes, asso ciated with one processo r. Th is
cache S}'Slclll can he used as a tr ad itiona l unifiCtI CAChe fo r be t h inst ructions a nd
Ui'lta, or iUI separate illd rllcl io ns or data cache .
21
3 IMPLEMENTATION OF THE CACHE
ALGORITHMS
3.1 Cache Design P ara met er s
Typically, a cache memory sys tem can ca pture well over 90 pe rcen t of a ll references
to mai n memory. Optimizati on of the cache desig n pa ramete rs is very importa nt
to decrease t he cost/performance ratio for high-perfor mance cache memories.
Optimizing the d esi gn of cache memory has (om aspects (91:
L maximizing the hit rat io
2. minimizing the access time to cache data
3, minimizing delay due to a cache miss
4. minimizing t he overhea d or upd ati ng mai n memory and maintaining cache
coherence
In add iti on , for cac he me mories fo r multipr ocessor syst ems, ccneldcrarion 11IL~
lo be ta ken to maximize bus a nd sha red. memory bandwi dth a mi Lo minimize lIw
bus ban dwidt h required b)' cadi processor in ord er 10 maximiz e tl\(~ sys tem pcr fcr-
munce, T he re arc also tral lc·o lfs which <!CllCIHI on the technology of illlpiemcn1at iull
for t ile cache; for exmuple , betwee n h it rat io a nd access li llie.
Ther e are many factors to be conside red duri ng f:lld w dos ign whicha ffed system
performance. Parame ters for cache design are classified into int rinsic ilnd extrinsic
22
paramete rs [5/. Elk-d ive mcmory speed and cost arc t wo int rinsic param eters.
Extr insic parameters, such as hit ra tios, control algori thms, etc., are selected based
011 the results of experiment al da ta and simula tion, and arc varlables which must
be conside red for t he syste m design,
Of all the ccnshlerntions which are related to cache memory, t he Icllowiug nrc
mai nly ronslderr-d (luring de sig ll lliuCI! cache jJerrnr l1la llCC is scnsiti ve to chuin :s
concemlng these aspects:
J. Fd ch policies
2. Mapping policies
3. lteple ccment policies
~. Swapping policies
5. lIi t ratio and access time
6. Cache mcmory capaci ty
7. Line size
8. Cache data path wid th
!l. Ma in mcmo ry organ izatio n
Felch algori th ms are used to determ ine when t ile system br ings infor m,l t ioll
into the cache memo ry. In ge neral, the major fetc h algorit hms arc de mand-retch
end prcfctch. Unde r the dem and felch a lgorithm , a linc is fetc hed only if it is
23
needed. The pr('[eld l algorilhm, 00 other hand, gelfl informalion before il is needed.
Therefore, tbe prefcteh algorithm is based00 some kind of prediction about. wl,irh
line will be used IIcxl. IlIll IlS ~ be designed cardully iCthe lIIachilll! pcrfOrn\;lIlCr il'
Lo be improved rather tha n dq; ri\dcd [91. h nlllcmcnl i\l ion of i\ prd l'k lt Ill~uri tlllll
is usually U IO! C collljllicate c.J than dema nd fclch.
Mapping policies tire used to t ranslate the logical nddn.'S.~ space to rcll ll'l(I,lr. ,,~
space. Efficienl address ltilllll lLt ion schemes should accomplish addr ess t ti\II~ lalioll
ill such a way as to minimize the appare nt i\CCC~S time. lnfonua tion generally i~
obta ined from t he cache nssocintivcly; larger associative memory is more expensive
and slowe r. Helice, there lIIusl be some tr ade-off of assodllli vity during cache
design, in terms of the design end technclogles that are employed. A mapping such
~hal any of the lines in main memory call be mapped into any line slot.sin cache
memory is called a lu ll assoc iati ve mapping. T hat is, a line of main memory limy
be mapped into any locaLion oCthe cache memory. TYI.i!:ally, k'ngLh of a. line ill
cache memory is as the sallie as that of main memory. If the cache melllory is
full and t here is a mip , t he requested line (' 1m be Irausferrod into any line IIlut
of cache memory from main memory, in a. manner depending 0 11 ti le rf'l'laCCltll'lll
po licy employed . Thus UJi!t mapping provides the minunumprobability Ior I j lU ~ ~ ltJL
content ion problems and the largest hil rat io for /l. given problem. However, Illwillg
one comparator per address tllg makes it very difflcult ;UlI! costly to impll:IJI(~ll1. t
especiallyin iI. large c.'chc memory.
A direct -mapped cache 11<t.!t only one ccllllpara t.e ,r ",llid . is c' J1I lll'(:k .1 lo Ill! u...
ad dress tags in cache memory. Each time only one address t ag Cill 00 selected
lo compare with theaddress from the processor. Th is mapping is a many-to-o ne
mapping. 'nat is, any given line ill main memory call reside logically only in one
spec jfled lino slot in cache memory. A dlrect-me pped cache memory mandates
a fixed replacement policy; if tl-erc is a linc miss, beth the cache tag and t he
corresponding line am replaced wil h the requested main memory address and its
line. This mapp ing has the highest probabi lily o f cache memo ry slot contcutlon
since there is iI fixed replacement scheme. Furthermore, it generally has a rclntlvely
low llit rati o. Unlike the full-associative mapping, it is (Illite simple ami easy 10
implcrucut.
A thi rd llHlpping mdhod is an S-w;,yset-assoclutive mapping, which is 1.1. hybrid
of the direct- mapped an d full-nssociutivc metho ds. An Scweyset-associative cache
has multip le seta which e1111 be selected by direct -mapping, and there arc S lines
slots in each set which can be simultaneously compared wit h t he address from the
processor. 111 this mapping syste m, there arc S compa rators, a compa rator for each
"WAY" _ Set-associative llIi1ppillg has a reasonable intplemc utnt lon com plexity ane]
hlt rauo. lucrcasiug the cache size of a set-associa tive cache gives 11 greater hit ratio
t han increasing th e depth of a direct-mapped system. On ot her hand , increasing
the number of seta, or ways, of a set-ass ociative cache memory a lso gives a greater
hil rat io. lienee many high-perfo rmance cache memories, especially large scale
caches, adopt the set-associative ma pping mechanism as a compromise between
complexity and performance. More details ahout Scway set-associative mapping
25
arc given ill t he next chapter.
An opt ima l replacement policy would predict the line which will be used il1t·"dll-'
memory (or a given set) furthest in th.. Juture and which consequently should h•.'
discarded when the cache memory (or a given lid ill cache memory) is full mill
a cache miss occurs. This policy would keep <.\a~n in the cadlC ~ Clptirlliwd for thr-
lIigl1(:st hit rat io, and tile maximum system throughput . However, t llis opLilllal
replacement policy can not be implemented since it requires a predicLion of the
future behavior of the running program.'!. Thcrdo re, SOlllC approximatiou hilS to be
made. T here arc three types of pract ical replacement 1l.lgorilll1l1sccnuuouly used
for cache memory syslems; Iirat-jn first-out (FIFO), random, ami I(~nst rec ently
used (LRU) line replacement, to approximate this function. The FIFO algorithm
is based on the principle that the first line tc be referred is predicted to he the
line not to be used in cache memory (or in a given set) Iurthost in the lntun-,
and that this line is replaced by the new one from main memory. Th is algorithm
does not rea lly rcllect the program locality VCf)' well, since the first line mi~)' hi'
used frequently, but it is easy to implement. The random scheme is hIL~ I 'd 0 11 a
random number from a random number generator to create the line 11 11 1l1hl~r or
a line which is replaced by a new one whenever there is a replaceme nt need. A
cache memory employing this a lgorithm typica lly has a 101'1 Idt 1"I11iu slnco this
algorithm is not able to reflect the program locality. T he It'a .~1 rccclIlly 1I.~cd tine
replacement algorithm, which looks backward (pa.~t) , is usually ulile tc rdb:t the
program locality well since it is bnsed 0/1 historical line usage. Th ilt is , lllll least
1ISt.-d line in the recent past is replaced by th e request ed line from main memory.
Since this algori thm requires more stored information about the past , it is more
dirficult to implement in hard warc, especially in & lar&c scale cache memory. A
varia tioll, a n approximation of the LRU algorithm, can be used to simpli fy the
hardware huplemcn taticu. T his variation is based on the fad 1.ha t if a line has ne t
IK'CII referenced ever a re rtein t ime per iod, it is less likely to be needed next tha n
lincs ill cache memory (or in a given set ) tha t have Ol'C1l referenced ill t hat period.
~l(lre details of lire Il!a~t recently used line rcplecem ent algorit hm a rc described
ill the next section . No one best algorith m exis t! from L11epract lcal I""placement
algorithm.'! la). SOllie .1lgorith lll, compared wit h the Ut/ICC Illgori Lllllls, is Lcu er fur
I'artinl lu da!'i~t"S of prohlems mill poorer for ot her classes. However , in general ,
the LItU algorithm is clearly the best choicefor most applieatlous, since it is based
on historicalline usatc (t ile recent past appears to be II good estimate o f the ncar
flltucc), it works well, and it increases the hit ratio when the number or lim..'S is
increased.
SWI\I'pillg algorit lulls lire dcs jgned for t ransferring II. new line (rom the main
memory to t he cache when t he requested info rmation is not in the cache. 'Typ-
icnlly, t here arc two kinds of sWOl pping elgcri thmc write.throll!J" an d cOlly.back.
III the llwi lc .tlmlllgll scheme, II. processor write to cache memory is immedia tely
written t hrcugf to main memory as well.T here fore, the informati on in bot h cache
memory and maiu memory is I\lwl\Ys consistent . Pu rthcnnore in a mult iprocessor
envi ronment, it can Ilall(Uc mult iple-cache coherence in all easy way. Unlike the
27
wrile-thl'ougllscheme, the copy-backscheme (withou t line miss occurrences] only
upd at es the cop ies of req uested data in cache memory without dist u rbing main
mem ory. Whenever there is a line-miss, cache memo ry copies Lack the line to be
overwritten to mai n memory before transferring the requested line to cac he moru-
cry . It can red uce tr efflc between cac he memory and main memory. However, il
requires more comp licate d logic; am i the re is a coherence problem hdw('(~n cl,c111'
memo ry and main memory, and pote ntially between multi-caches ill a multipro-
cesse r system. In contrast, the write-th rough metho d has higher Iratfic hdween
cache memory and main memory since write operations vary from JO.pcrce ntto
30 pe rcent out of to ta l references, depending on processor architecture and t he
particular set of applications. The average percentage of write operati o ns in 19Jis
16.
T he hit ratio for a cache memory is defined as the probabi lity, or the frac tion of
times, that a memory request is found ill cache memory. If we define the pro!Jahilily
of all t he references to memory as I, the miss ra tio of cache memory is ( J· liiL
ra tio) . Th e hit ratio for a cache memor y is one of ..he most ilillJOl"LitnL Iactors for
the per forma nce evaluation of cache memory . Other important facto rs affecti ng
the cache performance are the access time for the cache memory, including thnc
to sea rch the directo ry, a nd the cache memory cycle l ime, which is defined as t ile
time t he processor accesses information in cache memory. The access rime of u
cac he memory is effected not only by the architecture, or design [including all
the a lgorit hms and pa ram ete rs selected ill cache dcs jgn and implementation or t he
28
illgorithl l1~ lnhard ware}, but also by the t echnolog y adopted (bipolar, c rvIOS , de).
The cache capacity is usually dictated by ma ny {actors having to (10 with tile
system cost and performeoce. III genera l, a large cache capacity call produce
n highe r hit rat io, and i ll turn a bette r perform a nce, H owever, t here a re some
lunitnt .ions all cache size beyond which cac he memory h a s either a high COl'lLor
perform a nce dec reases du e to tho long access time.
TIIC line size of cache memory is one o f the most important parameters which
sensitively affect cache perfonnanee. 'I'horo are a. number or trnde-cffs for a rea-
souable line size ill tenus of arch itecture a mi tec hnology. Using VLSI tcclt uology,
it larger line siZtl is preferred because it achieves a lower miss ra tio withou t much
extra ("(1S t . Hut if it is too large, it lncrcases line lrilllSfcl" time a nd, iu t u rn, de-
creases system speed eve n if tile h it ratio is increased. It a lso depends all th e datn
path width bet ween cache and m niumemory
TIle cache da t a llilth wid th mu st beconsidered d uring ti le design process siucc it
diTl'CI,ly dcrenuincs the ti me required whe n a line is transfe rr ed fro m ITlltill memory
to cache llIemory, From the po int of vie w of pe rforma nce, the cache da t a pnlh
should b e as wide as posai ble. It is clear, ho wever. t hat cache data pa th isexpensive.
Doubling the path width means d oubling t he numbe r or lines ill and out of t ile cache
nud all t he associ,\ll~ 1 circuitry. T ile pat h width is critica lly important to caches
illlplclI\(' lIkd us ing VLSI technology beca use of t he limited number of I/O pins on
a chip. llcuc-, a t.rarle-olf of the cache <la.ta pil~h wi(l~h ha s to he m ade du ring tile
cuche dcslgo to achieve a rC'l.sonahl e cost/ perform a nce.
29
Although t he use of cache memory in computer systellls can greatly rc(lnn'
direct re ferences to main memory, memory traffic is ~1iI1 a very significant pnrfor-
manoa fador, especially ill a multiprocessor SystClll. Memory tra Hk n lllsish of
t wo components: felch t raffic and write-t hrough or copy-back tr alfic. The felcll
Irallic ari ses from the t ra nsfer of data from the lllai ll memory lo the cad lC whih'
the write-thr oug h or copy-back t raffic is from tile cache tc the main memory, The
felch t ra ffic call be obtai ned by multiplying the m iss ratio by the line size to gd
tr llrric ill bylCli/l"d ercllcc. The write-throug h tralllc cnu Ill' c/(k lllaL,·.1 by urultlply-
illg a wr ite ratio (the ra t io of writes to total references) hy the numbcr vlhytca per
write oper ation . Similarly, the copy-back tr affic call be determined by llluiLiplyilll;
t he miss ratio by the line size, since a line Iniss ca uses writing of all cxistills cuche
line in the cache into t he main memory before tr an sferring the requested missing
line to t he cache . For evaluation of a cache-base d multjproccssor system willi ..
single bus , a bus utilization call be used to estimate tile memory tra mc. T Ill! bus
utilization is defined as t he ratio of time spent do ing useful wurk to t ile to tal ruu
time of t he bus.
Since decreasing memory trallic or tra nsfer time during a line miss ran increase
the system performan ce , optimizat ion of th e erguul zrdion of both t he main memory
and interconn ec t ion network is a key Factor for lrigh syste m performance alit! low
cost. For the interco nneclion network, a wide da t a pat h call reduce t he transfer
time, but. the cost is much higher. 0 11 the other ha u-l, if the main memory is IWI,lI'
1111 or se veral modules which ca n operate independently, traffic (' , I II b. ~ n'l l lIn~1
30
bcce usc mo re Ullm aile mod ule can be busy wriling a l one ti me. Fur lhennore, if
mo d ules can trenefer dilTeren l words in II. line by int e rkaving , trll.Osferring a Hue
from main memory tc cache on1:' ta kes one main me mory cy cle . Th us . the ma in
memory ba nd width ca n be inc reased while the transfe r time is greally decreased ,
3.2 T he Str-ucture o f the Ca che Mem ory
During the design o f the ca che memory descr ibed he re, algo r i thms an d pere ruc-
lt~rs II Sl,.oU ha ve been selected cnrerlllly, uud a number of Irudc-cfls be t ween t he m
lmvc been m ade in o rder Lo achieve high performa nce. 'l'hccac he memo ry syste m
described here is implemented as a singlechip . Furtherm ore, t his implementa tion
allo ws a cac he of var ia bleca pa city (lar ger tha n tile ca p acity o f a single cache chi p)
by u:-.ing several of th e each ,. memory chips. The sillglt-chip cadle memory cleo
scribed here has Il. cnpacily o f 8K hytes because of sili con area limitatiollS for t he
3 micron CMOS ll"Chnology. T he wo rd size fo r this cach e mem ory is 32 loitssince
LlIL" cache is designed for a 32-I,il computer sys tem. A word is 1I0t necessarily t ile
Slll<l l1l0:0;t unlt that t ill' proo~ssor call access. The pro cessor call directl y access I ,
2. :J, or 4 by tes (mill t he cac he. There fore, it pr ovides more Ilex ihility lo computer
systems ill which the cache i ll U~I-'t! . It ...I ~o allows for th e possib ili ty tha t thlscache
CII U be used ill 16·hil compu te r systems , provided ccn t rcleig n nls for the cache ca n
connec t wit h that o f the pro ces sor wit h reaso nable a dditio ua l logic. T wo cloc k
Ilhalll.'!l, CKI allli C !\·2. arc employed to pipel ine th is syslem . Encl. of the cloc k
I,hast':'!has a minimu m cyclc period o f 36 na no seconds (derived from simulation )
31
'to The cru
Figu re 3: T ile BASic Cache Memory Struct.u re
in whidl t he associa ted pro cessor ca n read a n ilUlrucliouor d ala.from the ca che ,
Fig. 3 dc pieu t h e struct ur e of t.h is cache memory. It is co mposed of four I.>a.~ic:
componen t.s &:!J follow s: the Address Trans lation FUlldioll o r Uircctory , tile Liue
Replaceme nt Unit. (LIlU), t he Cach e Memo ry aud t.he Con trol Unit .
During C K1, t he addres s from th e processor is la tchrJ in the eddnss reg islc'!"
of UICcache , and t.hcl1 it is ecul to t he dire ct ory to 111.1: if t he: lill': COlllniuiug, l1l,'
req uested dat a is in tile cac he . If so, the line number gcnsrnt cd bylh e /iut lIl/mllr.,.
gell enllor, t he set n umber from th e address register, allli t he word orrsd ill 1I 1 ' ~
lin e are a ll combine d 10 Iorm l\ word address for t ile a d .. ! llK'I1lUry alill la t d ll' ]
into the ca c he me m ory rcg ist.cr. In additio n; lilt! p rOpl'f h)'k>{lI) C;lII he: aC(·~SI.,1
J2
hy t he process or using both t he two le as t significaut b it.s of the address from t he
address regist e l' ami two funct.ion bits from the processor, which willhe described
lntcr, Meanw h ile, the LRU unit is upd ated to indicate t.hat t he line refe renced is
tllc most recently used one ill th e give n set Duri ng CI<2, a read / write o peratio n
is done. If th e request ed data is lIU ~ ill the cache, a lin e miss o ccurs. T he LRU
unit is asked t o send t he least recently used line numbe r ill the epccilied set to t he
directory, and t he directory uses this n umber to locate it s ccrrcs pomliug line slot
ill t he specified sel of the direct ory. Th eil the co ntent s o f th is slot are rep laced with
theg roup nu m ber and t he line number in the ad dress re gister. A fter rep lacement ,
the liftc numb er !Jcnera tol' give s the linc number ccrrcepond iug to this cel l \.0 t he
memo ry regis t er to t ransfer the rcqncated fine from the m aln memory to t he cache .
'1'0 o b tain the Iinc of in fermat.ion from main me mory, a linemiss signal is sent to
uralu memory, After t ile cache receives a "bus use" g rant fro m the system bus
controller, t he process or is Icrccd to b e idle du ring t ra nsfer. There arc 8 words
(32 hy tcs) to be u.msferred fro m the main me mo ry to the cac he memory dur ing
a liuc millS , w h ich woulduortnully take a long Limo 11l1 d intu ru deercnses system
perfonua ucc. IIIorder to reduc e the line trans fe r time, th e mai n memory ilia)' be
pnrt.if.ioucd int o scvera.l ruo dulcs , or "in terleaved " (in this case, t he memo ry should
he pilft itinlll'd into 8 rnodulee}. Whenev cr there is II. lin e wiss, t he 8 word s of t he
re que sted line call the n be sen t to the cache me mory a lm ost simultaneo usly, wit h
('neh word COIl\ ; lIg (W ill a sepa rate mod ule of t he main m e mory. Thus, t he tra nsfe r
lime cnu grl'a lly bo de creased . The main memory org an ization will he discussed
33
later in ehap tcr 6.
3.3 Th e A ddre s s Spa ce Mapping
Since the eec bc , as th e f;ul.clt. p ari of t.he uetuory hiera r chy, ill II l1ld, 1I1lHllk'f 1I,;u l
the m ai n memory, th e re has t.o IJcII m a pping f"n cl io" o t.'!.wrell the cadlC ;l(lt!rr!l!l
space and 1I,al of the main mc mory. As ds cnsscd pre v iously, t he dir<.'C l-llI:lllpl'l: l
method is t he simples t to imp lcmcor, but it [tns Ihc hi ghcs\ mi ss rat io or thre e
mupplng metho ds. 'I'I screlorc, it ls not lIsL'<1 i ll this llpp liG,tinll , AIS'I t il,: fully -
asscc intive me thod requires o ne comp a rator p.or line s lo t in till' lIin'ctory. Tllii'!
is cos tly 10 i lllp lclIll~ lI t ill A laq;e Kille c Ael le lIlemury. Al so, it mllY int.rcdncc Ill '
extra tag-sea rch delay a nd mak e tile sea rch log ic ooluplicllt.cd, The set-associative
method,,, hy brid of the direct- mapped method and t he fully-associativ e IlK:thm l,
isused in this rnallping IileChani' m. It iu vd ves o r&lInizin g theca che lIl1' lIlory inlo S
setsof Nlilles per set . When Nbctom esolle, r he eac he i s .r" lIy.a.~iative cAd ll:
i ll which there arc S sets iutctal, each consisting of II sing le line . If SbecolIIl's Ill": ,
lIle or gAniza t ion oIlhe cad le is the d i rect -ma pped cache memory, Since :lll s.way
set-as sociative cache a llows a llY one o f S lines in a refer enced sel to be rcplaCt..'d
on a line mis s, this flexibility usually introduces 11. lower miss ratio without l lll'
complexity of 11.Iully- eascciat.ive cache . Therefo re, it is /I compeuulso hdwf 'I.'11
comp lexity a nd pcrfo r msucc.
The Set-ass ociative ~lappingFigure ,1:
3 .4 The Set- a ssoci a t ive Ma pp ing
T he princip le of sft~assodalivc mapping is shown in F ig. 4. The cache memory i~
div ided into 21 sets wilh 24 line slots ill each, and the Si1.cS o f the se ts ami lines
in th e cache memor y are t he smile as those ill the main memo ry. Furt hcnuor e the
main memory is partit ioned intoseve ral groups, and the sizeof each g roup is equid
10 the size of the cac he mem ory. Hence, each group coutalns 2' se ts. Each ~d
slot in the cache memory must be shared by severa l se ts of t h '~ main I lH?1Il0 I'Y. For
exa mple, i ll F ig. 4, the first set ill t he cache memory is assig ned 10 holdthe sets
1,1 + 2q,1 +2 x 29 " " o f the main memory nnd Lhc sccoud sel is assigucd Io
hold the sets 2,2 +29,2+2 x zq" " and 50 Ionb. Lines wit hin a se t of themain
memory are associatively mapped int o any of the 2' line sluls ill the co m-spoudlug
set or the cache me mory. Thn t is, a ny sel i ll the main memory call Dilly be di rcc ll}'
mappe d to a specific set o f t ile cache memory and lines ill a ~t are associativ ely
map ped into allYof the 2' line slots in tile corresponding se t . Scls fro m dillcrent
gro ups can be iutenuixcd within the cache memory ; tll/'rd or e Hilt a ll t ill' sets or
a gi ven gro up need to be siumltancouely resident i ll the cache memory : siuulurly
lines ill t he se ts Iro m different groups, which are mapped into the SH IIW sd tI( tl[('
cnehe memory, can a lso be iutenulxed within that set of the cache memory. /;"!J'
line 1 of sel l of group t ca ll he assig ned t il lilll: slol 2 or sd I ill Lhl~ (~<LI'lll: menun'y
and line2 fI( se t 1 o f group 2 can also reside ill line slot I of se t I at tlu - Sill lll' t im/'.
In the Fig. 'I, we nUl see 1Iiata memory addr ess is 11ivilled lnto Ionr IIlll'ls: 1/1/
represents t it.?grou p number, 'lis t ile set 1l1l1ll11Cr, 5 is tllc line 1l11111hl'Tatll l () is till'
:IG
word olrsel within a Iinc
3.5 Implementation of the Dir-ectory
ln general, incrcnsing t hedcgrcc ct associativibyora cac he memo ry CIII I decrease thr.
miss ralio of a cache. In order to obtni n high performance, an 8-w1lYset-associa t ive
nll\.p ping j~ (~lll llloyed ill lhj" di rectory to achieve 1\ hig h hit ra.Lc without. the ext ra
dcluy pcualt.y while searching tile directory. This d irectory can ma p the ma in
memory ad dress space (32 b i t) to th o cache memory address spa ce ( 13 bit) ill ,1
maximum of 16nanoscccnds , including the d elay of the LllU ull it, dete rmined by
simu lation. Fig, 5 s hows the organiza tionof the add ress mapping directory with
set associative mapp illg, When thepr ocessor requests a read/wri te op e ration, the
logical address is map ped into the cach e memory address bysearching t he directory.
Thi s direc tory has a tag arra y or 32 sels with aline s lots in each set . Each set is
represented by a row of tIle tag ertuy an d the 8 lineslo ts in a gi ven set a rcindicated
by columns of the t ng array. Therefo re, this d irector y is an 8~way se t -associa t ive
directory ill which each colu m n represents on e way.
The re is a MATClJ sig n a l fOI' each colu mn of slo ts. If t.hc signal is set , the
request ed lin e slot is in this column of slots. Amo ng the 8 MATCII signals.
each of whic h connec ts to a co lumn o f the l ag array. there is only on e MATC II
signa l valid at any t ime since only on e slot may be se lected. All the MATCII
lines are connected to aline number genera tor. T he fi lle 'lumber gen erator can
trans late t he illl column Hum ber, at which th e corresponding M ATC Jl signal is
37
a,
a,
,
o
ToL RO I
~~~
Figure 5: Th e Directory
38
valid, into" bilia ry num ber i for mi ug t he request ed line number for the cache
me mory. TJI/l~ is, the slot at the p osition in the ith row and the jill colum n CIUI
pr oduce th e line n umber i of the j ill lineof the iill set of t he cache memory via the
line Humber generator . III cnch lin e slot , t he group number is concntcnatc d with
t Il(: line number ( 111+8) of" logica l address, which indicates that t he specified llue
of the ma in memory rcs idca in the cache lin e indica ted by the llne s lo t. Whe [Jcvl~r
a request ed set is selected through the 32-b il direct ory decoder after au add ress is
la tf!l('tl Into the a ddres s r(·8i.~ ter, th e ccute nts of t he 8 lint: slot s in t1w sch'ck<l sd
a rc sltuulteucously compare d wit h IId+s of tile add ress registe r. If t he contents of
any aile of the 8 line slots arc the satue as nd-t-s, t he n the requested d a ta are ill the
Hueor t he selected set, and the corresponding AIA'J'CIl signa l should become valid
to 11Ii1ke th e line n umber generator p roduce t he requested li ne numbe r (or the cadre
memory, A 111'1' flag is also generated by the line number genera to r to indicate
t hat the reques ted da ta arc ill th e cache. Th e line number is comlrincd wit lJ the
se t number and t he word offset in the line from the address register lo form tll("
required word addres s of t he cache memory. Meanw hile, th e 111'1'fla g informs the
LH U unit to upda te the records in the selected set. Otherw ise, a.MISS nag is set
t o indicate that th e requested da t a arc no t in the cache, at which t ime the re arc
t hr ee tas ks to be done:
• luv c ke the LRU replacemen t unit to findthe least recently usedline ill the
selec ted set of the cache. T h is line wi ll be re placed by a new aile when the
req uested data arc available from the main memory.
39
• Inform the CPU ~hn~ i ~ III USt be idle d urin&the line replaceme nt.
• Reques t tile Inll;n mcomory to t.ransfer the required line to the cache.
Fit;.6 shows & simulat ion for t he t.ag a rra y of tile directo ry for t he case where
a linc n.lss occurs durill& & read/ wri te 01>craliOl:I, and 11..: addrcs. rcsilling ill s lut i
o f row j is replaced with lhat in the address n:gule,', Signa" CJu lo 11: 1 rCI'fl.'S<'ut
th e nd+s from the addres s register. SE Lj jg a signnl from the deco der to select
row j of the directory. I\. WOIlD i is a signal from the Lfi U unit to upda te till'
lin e slot in row j and column i of the tag ar rey during a line Illi s ~ j a nd " IJ n j
is the ilh MATC// Hue of the directory in Fig. 5, indicatin g whether or 1I0 t t ile
corresp o nd ing COIUIlI11 is matched. Arter t he lag array is reset by tile JU~S, a uriss
(a low /lI n ) is prod uced since the j throw of Lags selected 1Jy SBI'i is empty. The
~ignal ~VOnDi fro m the LRU unit updates the slot in row j andcolumn i o f the
directory. Aller a delay of 6 nanose conds. th e lil T;sign&! becomesh igh to indic ate
t.h at the updaling hAS been fluisbcd. When l he signAlson BfI - /l 21 are rhalisco.l to
a new address and the new address is 1I0t found ill thed irec tor}', then the 1111:
s ignal becomeslow alter a 311~ del a y.
3 .5. 1 T he Lin e Slot or t he D i r ector y
As descr ibe d 1.. , Iore, l llt~ dlrectcry is composed or line slolli. Elich I,f t ile line sluls
is used to store t ile gl"OI1 P numbe r a nd line 1ll1111!JI'r (Ild+ /I)of a gil'Cll main memor y
a d dress, which il1'.lirak ll tha t the corresponding line Irom til e m:riu nunnery is ill
t he cache 11K'lll'lry. For «ach li ll~ s lot, the re is iL 22-hit built-ill com parator to be
.0
4121
4030
3121
20
20
10
10
e
e
~~r I I I I
,
.......
I ' ,
,
~,
,
•~
~
•~
•,
~
~ ,
i~
~
b==L='L-j"'"
i I i i I i i i i
hi t
hitn
b21
b20
b l 9
bl 8
bl 7
bl6
sis
bl 4
b13
b12
b l l
b l0
b9
b8
b7
b6
ss
b4
b3
b l
so
eel n
,. 1
wordn
ucr-d
'OU5",I'" ,"U5"~Vdd
" V"
n-d Vl1<l
M"t~h; o\h lch, V..
Ilo",S.1 V..
Bit V.. lJit 1:lil (I ) mi(ll)
(a ) II To, Bit ofTh . Di...,cto '1
Figure 7: T he Directory Tag
vaa
- rE~~" :.:~,,~.
Ro wSe'
Col A <IC.:"'4 'ch, F.'io
figure 8: The Tag nil an d Valid Dil or the Directory
42
used for par all el enmpnriaon of its conte nts with t he nd+s hils of a main memory
add ress. The organization of a line slot (or line lag) is shown in Fig. 7. T here
are 22 bits for ud+s (10 bits for nl/ and 3 bits for s) and one valid bit . If the
valid bil is reset, the content of this slot cannot be compared with ,ul+s and the
directory simply sots a line miss . Otherwise, if /lOWS EL Iron: t he inverted drivel
connec ted to t he :J2·bil ,1ecQ rler is at logical 0, the conte nts of t he line slot arc
rotnpnrcd with nd+s to Sl'C if t he requested data reside in the cache . If all the
Itits or a lilll' stot matc h the II d+.q OII ti ll'! line:'i ( 81'1'0, 11110) - (J3 n~" 1J 11~ 1)' the
M J11'Cl/; signal for t he Iilie slot becomes high lo t urn on t ile N type deviceso t hat
t he CUl-MATe /7 sigHfil is p ulled low. It filly I,it of th e line slot docs not match
lilt' corresponding bit of I"l+.~, the AlI1'1'C/I; sign«! for t he line slot remuins low
so tll 11t CUI. )\] A 'I'e.:Ii is high. If there is a line II li s~ (t here is TlO line slut matc hing
with 1/d+sin a selecte d set) and 1Iiat liul')slot in Fig. 7 is chosen by t he LRU unit
CIS the least recently used line slot , the LlW select ion signal LRUSEL becomes
high . Since at this t ime lW W8 El. urrus on th e pass ga te, this line slot is replaced
wit h IIlI+s so that the llIi ~si llg liue will be trans ferred from t he main memory into
thls cnrhc line.
Fig. 8 (Il.) shows th t~ r.i1·cuit for nne line slot bit. If the flUWS!!: I. line whic h
is connected to OIiC out put of lim 32-hil decoder t hroug h an inverted d river is at
logical 1 (mea ning this line slot is not selected by t he de coder), t he AIA1'CII; line
rcmnins low so t hal 110 comparis on ca ll be done ill any case (it can be considered
to mean "not-match"}. If nOWSeL becomes low, th is cell becomes a nor mal
43
conte nt addressable memo ry cell. For a com parison operaUolL, DATA is ilpplicd
all lhe BIT line and DATA on the BIT line, J( the datum mat ches till' value ill
the bit, then the matc h tra nsistor A remains turned orr so t hat MATe l l i is hig,h
for this bit, III a line slot , all the M ATC 1/; lines arc c..iscndcdtcgethcr . If HUy Ilit
of the line slot do cs not match t he value all t ile bit input , the match transisLor rill
th is bit pull s down the AlATe ll i litte of this line slo t, previously Jlrech llrw~d hy llll~
valklntlon bit , indicating tha t this slot does notmatch the nd +s. For II rcplac cmeut
opeTiltiotl , if there is 110 slot match utall in the select ed set , aline miss !IlL"occurred.
The LRU unit determines which lineslot is to be updated with uIl+,~ by M;s (~r ti llg
Ute L/lU$EL line for t he corresponding line slot in the set selected by 001\/8/,,'1,.
As shown in Fig. 8 (a), when the LRUSEI, line is asserted , the valucs O il IJI'I'
and BIT change t ile sta te of th is bit. Alter a cha nge of the coutcuts of this line
slot , th e contents of the line slot always mat ch nd+:J. T hus , the caUdA'J'CII
becomes low to ceuse the linc number gcncralol' to create a miss line number to
trans fer the requ ested line Ircm thc main memory to t his cachc line, Note thnt all
the COLMATC lllines in one column of the ilirectur y arc wire-O tted. Thi .~ l1I( 'IHl ~
t hat there is only one M ATC II line for eac h column ur ti ll' diredor y. H HII } ' 011"
or the COLM ATCII signa ls ill a column switches tc a low villue, the M A'J'CII
line of this column becomes low to indicate that one of the line slots illtld .~ column
matches t he 1111+$. Th e exact position or the liuc slul ill lhis column is locatr-l
by the 3!!·bil dil'cdm'!Jdecoder; Fo r this S-way set associativ e directory, thero an '
8 MA Te n lilie s cllJIlH:ding to t hr- /illc 'I/Ill/h" gCIll:I"tlIIlf' which produo's ill turu
the corresponding line number tor the cad le memory.
Fig. 8 (b) demon st rates the Iunctio n of the valid bit . T ile signal B IT is se t 1.0
logical 1 while tile signal IITTis set to logica.l O. Aft,-'f"llJ>plyilL& a reset siglllll La
~he RESET line, the inpu~ ef luverte r nbeecmes higb 10 disch arge the MA'l'CI/;
line 50 tha t the line slot com 1I0t be compared with the nd+so If t ile LRUSEL
line is 1I55er100, ti le illlJUL of inverter U becom es low 50 tI.a t tile M ATCII; line
is charged to make thi s slot act ive tor comparison when ROWSEL is low. (ThiJl
means t ha t the set to which t his I;lot belongs is selected] . III t his [ ,15C, t he line slot
is valid for searching .
3 .5.2 The Addr ess Register
Th e adu rl'S:'Ifl!gisb' r i ~ used to latel l add resses from either 1I1l~ associated processo r
or tlit-° system address ltUJI ill the case at a multiproces sor systcru. Fig. !J sllow5
one hit uf the address registe r. During the AliE (Ad dress Latch Enabl e) peri od
Cor the processor, (the period when the address becomes sta ble), the add ress Iron,
the processor , imposed UII illlmt s AU of tll(" reg ister , ill latched in th is "-'Sisler.
Wh('11 1I,e cache rece ives a search inte rrupt SI::IIIlCI1/ NT from other caches, t ile
clock jwlsc gCllcrt/tol'oCt lie cont rolunit produc es a pulse Cli l'. During Cli] ' , t ill'
address impost'(l 0 11 t he All lincs from the ot her cache t hrough till' address syst em
bus is latched ill this register for all update ope ration, !\rt f'r [,cillg updated, th e
a plllSl'CA'').' j!;cllemtt·t1 lIy t he clock I",lise gerltnl/Il r . At illiliali z.,tio ll, Ull~ rtU1
Al.l:; CI"I' r~ h" l "
Figure 9: One Bit of the Address Register
(b) 1j mins lli "N"'''
Figure 10: The Il-typc Rising-edge-triggered Flip-Hop
signal from the processor resets the register. Th is register is composed uf :12 D-
type rising edge tr iggered [lip-flops. Fig. 10 (A) shows the Ingic circuil for this
D-ty pe Ilip-Ilcp. Note that t he delay until an address is valid is merely the signul
propagation time ill the D nip-nop. Fig. IU[h] is the timing diugruur for Lll l: IJ-typc
flip-flop.
I: igu re 11: T he 32-bit Directo ry Decod er
3.5.3 Th e 32·bit Decoder for Set Selectio n
All a(ltIrcs~ deco der is all essent ial component lor set selectio n in the se t-associative
direct ory. A decoder has 1/ inputs and 2" outpu b . O ne and ollly cue out put will
have a value of logical I for each combinat ion of input values. In principle, a
one-level decoder could be lilly number of inputs using 2" gates with n input s.
Unfortu nat ely, ill pract ice, t ile Ian -iu limitat ions and prc page tlou delny require
t l.a ta large decode r he oft,an i1.et1 imc a mult ilevel ne twork .
This32·bit decoder is used to decode 5 bits of tile set number Irom theaddress
regjster , T ill' illlllll s of th e decoder are l ilt! s -bit set 1II.IIII her <lnd its colllplcll"Icnt
directly from the addres s register . Tile ou t puts 11Il\'t" 32 bits ami at allY tillle anll
nne hit Im~ a m Ine or logirnl 1. The decoder consists orl\ IO-bi t dec oder allel
:12 1·i nput NOll .gil les as shown in Fig. 11. A 2-iuput NO Il.gale is preferred Ior
tlu- lallt stage in t he mult i-leve l ne twork to allow fn!lt rise t ime (WI. Allufl $ s lJil1l
110 - 113 and 1Ilcir COlll p ll'Il Il'Ul!I A;-A; are imp()!\('11 0 11t he illputlillCSl of the IG·hit
!u
(M ,Aii)_ tA3,A:l)
"1
~ D;
101
Figure 12: T he lG-bil Decoder
dec oder while addr ess bits A4 and If; are sent 1.0t he 16 2-illJllll NOlt ·glll.<.'S at the
second stage of the decoder, respectively. Outputs ol the 3!J-bit decoderJJIIu to
Dt1:u arc sent to the directory to select t he corresponding set. The lG-hit (!l-t:IHIN
is illust rated ill Fig. L2 (a) where the 4-inpul NOll-ga les arc employed. Tilt: 'I·
input NOll ·gale is implemented with pseudo-aMOS logic as shown in ',"il,;. 12 (1).
T here is only a single p-type t ransistor in the circuit , with till' ~al{: coHIll 'd ,ell to
Yo, . Use of pseudO-liMOS technology can decrease the area ami deluy t.illl{~ ortill'
decoder since the numbe r of t i le slow cascaded p-dcvicce is n!fllln ~fl lroru fUll r til
one . T he t ransistor si7.cS arc rntioed care fully lo ensure correct logic fU/II :tjoll ,11 111
high speed. A si mu lation fo r the :i2-bit decoderis show n ill ViI;. 1:1, ill whkli LIlt:
48
time delay ;.'1 appro ximately a ile nanosecond.
3. 5.4 The Line N umbe r Ge ne ra t or
T he circuit for the [jlle IIllmber gCllcrfllor is illustra ted in Fig . 14.. The Iuuction
of the Iirlc lI umbcr gencl'llloI' is to translate t he lilie numbers of t he cache memory
[rom "one-hot code" (or hot code},composed of 8 MATCHlines From the directory,
into biliary code . III a "hot code" number , t here is a t most one bit at logica l I ;
Table :J is the tru th taulc Cor the a·bil "hal cede" und its corn..'lII011ding hilllll"y
code. Each bit ol thc complemen tary "hot code" corresponds to oncol S matdr lines
ColMutcho-ColM aichr Irom thc directory . If all8 bits of the "hot COlic" arc zeros,
then none of the ma lch lines is atlogical U. ln this case, the lille 1I1111lbcr gCIICIYJlol'
produces a J./N EMI SS. If any one or the Co/Match siguals L~ logical U alt er
comparison, the geuerelor t UI"IlS out the corresponding line IlUIllUCr orthe each!'
memory in 3·hi l biliary code 011111(' ontput Hnee 1,INf-:O to 1,INI.,'l; mcauwlule,
lilT is asserted. Th is binary liuo number is lald ll-:d into t he line reg ister of the
mentury register tlm"iugAl, l~' , iuunodietcly following. T ile layout or thls circuit i~
~I I()WII ill Fig. 15. A simulation for t his circuit is show n in Fig. 16 ill which t he
lime delay Cor a valid sigunl is SI:!! l1 lo he ab out 2 nanose conds.
From l he simulations, th e to Lal delay before turning cut 11.valid cache add ress
by the directory for bot h th e miss a nd hil sit uations is shown in t he Table 'I. (T he
abbreviation LNG stands Ior the Line Number Generalor.) If tile requested dala
residein the cache. th e requ ired delay of t he directo ry is a bout l-tus ( Decoder +
Figure13: Siutu!..u lon of the :J2-lJil Decode r
50
Figurc 14: T I\t.· Line Number Generator
51
dO. w=#=='IiPq;J==i l=i l==il:=H
dL w<=1jE=I!lzoI'I==I1
l~i surc 15: l .ftYIl1IL or tim Linc Numb er C(l IlCrillor
52
II .•
as
aa
, , , , ,
n
r
n
l
r---
;:=
, i j j j , j , , j j
o 20 40 60 80 100
Figure lfi: SiltlUlaliOIl of the Line Number Generator
53
Hot Code Dillary Corle
0 0 0 0 0 0 0 0 Line Miss
0 0 0 0 U 0 0 1 0 0 U
0 0 0 0 0 0 1 0 0 0 1
0 0 0 0 0 1 0 0 0 1 0
0 0 0 0 1 u 0 0 0 1 1
U 0 u 1 0 0 0 0 1 0 0
0 0 I 0 0 0 0 0 I 0 1
0 1 0 0 0 0 0 u 1 1 0
1 u 0 0 U 0 U u 1 1 1
Tab le 3: T he Truth Table llc lntlng the " lIot Code " and th e Binar y Cutlc
Operations Fo r A Line !li t Opcnwione For A Liuc fo, li~ s
Componen t Name Delay T ilJ1c(ns ) C011lllOIIcIILName Dclny Ti nw (lIsf
Decoder 1 Decoder 1
Tag Array COIllP , a Tag Array com p.
"updat e update 6
LIlU scudb ack LllU scudbuck 2
update 8 update
LNG 2 LNG 1
'Ic tal 1,1 Tota l 16
Table 4: Tim e Delay for the Directo ry
Tag Array (comp.] + LNG + LRU [up date j] . Since the updat ing oporatiou s or the
Llt U unil call overlap with the dircc lo ry search for next read/ IVI'itc operatlon (1IOt l '
t hat the cache is pipclincd], lhc actual delay can be less Lhan cnk ulated above. If
the re is a line miss, the total delay is at lcas l 16m (DccOllcr + Tug Al'l"a,r[colllp.j
+ 2LNG + LllU[selldback] + 'Jag Array[uJldatelJ.
54
3 .6 The Line Replacement Unit
In a cache syst em. one o f the related problems is to predic t which se ts of addresses
already buffered iu the cache memo ry will be needed f...rthcet ill the fu ture beca use
iLwould then be possible to determine the optimum line to be replaced by a new one
from the main me mory. Since thi s algorit hm is based on future knowledge of t he
progra m's beha vio r, it cannot be realized in a pre ct.ical cache memory, 'Therefore,
some a pproxi ma t ion IIlUS~ be mad e to this idea l. In t he cache system describe d
here, th e least ,'ea utly used line I'C]Jfflccmeut (LllU ) algorit hm is employed. Under
t his str at egy, the line to which any memo ry reference s were Illude the longest t ime
ago is replaced by a new one. T his algori t hm is based on t he assu mpt ion thal, tilt>
line which was referenced the lcogcst umc ago is the most likely uot to be used in
t he ncar fut ure; it relies all the te mpora l locality of reference chaructcrlstic of most
progrlllllS.
The uult shown ill Fig, 17consist s of 32 LR U cells which can be selected in-
d ividually by the 32-MI rli l'cc!QI'y deco/fer, It is orga nized into fon r rows, with 8
LItU cells pel' row. Duri ng initializn tiou, t ile onES signal resets all the LllU cells.
In gene ral, bot h 7JCfriY and IVl U1'E1' l{ f lU are high so tha t all t he Nd evices
eoancc t.lug to thes e sig lla l.~ are closed while all th e Pvdevlces ar e ope n , Under thi s
coudlt lcn, the II!'I' signa l call be pro pagated t hrough the Ndcviccs d irectly lo al-
feet all t ile ll/1'(d)'s. If there is a.sigllal ll 11' from t he l ine JllImber!Jcnr.rlllol' nhe r
Il"art:hillg t he d irectory during n rcatlf writc opcreticn (t he requested data. reside
in the cache), O IlC orthe 32 LItU cells selec ted b)' the 3!!·biL dccodel' is updat ed by
55
Figure17: Structure of the 1.ltO Unit
56
8 siglla ls From th e direct ory, C"e11 one co rrespondi ng to a ile line number slot of a
given set. (Th e out puts of this LIt U cell arc Jacked by a low 1"'1155.) Meanwhile,
t.lu- /l I T siglla l a lso pa sses t ill' 111\105 p ;\.~s · t r ;l ll s i s tu r to gat es (.f botl l devices 1\
and B slnc e W fUTE1'/1 f lU is high. Th e intern a l wires connecting the out puts of
8 pair s of device s A and B to the outpu ts of t he LItU cell s rcruniu low regard less
of the s tat us of ALL ZE 1W'sshowII in sub flgure I of Pig. 17 since t ill' Ncdcviccs B
ar c dosed to disc ha rge th e wires, making a ll 8 out put wires lo LlIC directory a t high
level after the inverters. Thus, the directory can be prevented Iron, updutiug. 11
1/1'1' (, ( 'Ct'IllCS low alte r the directory is searched [ thut m caus a line miss O(TUrs).
t he All SS sign a l is high so Lh1.lt ile LHU cell selected by the Jirccf01'Y dec ode l' GIIi
sen d t he least recently used line slot umuber of ."1.g iven set Lo lilt' dlrcetcry whirl'
t ile corr espond ing 11 11'(11) signal is low to pre ven t the LR U cell From up <lal iug.
Meanwhile, 1I1 t~ lo w siglla l 1111' switches devicl'S A 011and B off lhrough the N.pas~
t rans istor cont ro lled by II' IUTHT7JR.7J. Tile 8 in ternal wires are charged by de-
vices A immediately. These wires are connected to one LIt U cell selecte d by SEL
from t he decode r. As sho wn i,\ SUbn gllrC I, the trnnslstor s contr olled by the res ult
of AN D ilig the slg nuls M t SS and ~U~'L arc now closed so t hat til l' 8 intcruul win'S
rnu II(' lI s,~d tn t ~ v" l u ll 1. e values on the 8 IlLl,7,r:; no'sof t he given LlW cel l. Dilly
one of 8 AT I. /' H li V's ill the 1,IW cel l is 10 \\' to iudlrnte t hat til<' correspond ing line
is t lu- It'llS[, nx'c utly us" d line in the selected set of t he cach e. So, the corres ponding
57
internal wire is in verted to update t he asso ciated line slot , ami the rcmniuiug k(~"1 1
t heir line slots in the given sd or the direc tory uuchaug cd . Note t hat transistors
A and D arc bal anc,.J willi those shown i ll subflgurc I so l1l;lt ll H~ 1I 11"l'llt i "ll~ an'
reliably complet ed in minim um time ,
In Fig, 17, il is soeu l hat if /JI';LAY is low, the unil is p revented From Uptillting
by cut tlng off t he 1/ /'1' sig lllii. Also if IVun '1..."1'/1JW l lt' nl lll t 'S low, I,lw unit is
preven t ed fro m both u pd ating itself a nd changing line s lots in t he direc tury II}'
holding 1I 1l ~ outp uts lo ll ll' directory higl l unfil l he sisnd JiTiiTfP'i7THfJriws
D I...•LA!' is d esig ned If)handlethe situation that, dur illS il fl'lltlfwri lc O!Wrlltiun,
ti le LIl U uuit muy OUICI"wisp Ill' Up dl11Cti incorrect ly Iod tn " t i ll' di l'l ~cl(lry turns nll t 11
valid result fo r Il I l', 'I'his is 1ll' t"i U ISI' at t he 1 ll'~ i ll n i ll g of nn o!l,'rnl.io u nil' di l"l'l"l" ry
dec oder s,'I,~d.s Lulh a gin'lI LltLl ,'d l and a t·orn's pulld illl;.WII' of IIII' tl ll-\Mfar.
During t he pcr lo. l t ha t the tag a rray SI'iIfchs fur a lim- slo t in li t., s l~ '('i1iI'd row ill
which t hl' reque sted ;H I{lrL'S~ rl'sid,'s (ti ll' UriC IIIIIIIbel' !ff~lIrm/(I" hl~S ll" l l ll n u,rI \lui
a valid 1111' fo r this oper-ation and the Sig llill II 11' sl ;ll ll l.~ 111 il lligh 1"\,('1 al. I! l i ,~
l ime). ti le st at us of tile L1W cell lIIay be d l1lug t..'i.1 by li re inm)i,1 Sigllill II IT. Fur
iliSlilllCC, assum ing Urat, before A IJ: is asser t, ~I , t.lu-out p ut of the :J!1-hil deem/f l' is
8 and Jl /1' is log ical I . WlwlI AI,/~ is asser ted , a !l CW address wllfls , ~ "~I'l 11 11 1111"'1'is
10is la tched ill theaddr ess l'l'sistl' r andhroilr1m sl to lire !lI 'rorl"r i llllll,',li;ltI 'ly, 'I'll"
output or the d ecoder (here it is lU) is sent simultane ously lo hot ll tlu- t l 1 ~ arruy,
to set' if l ire reques ted data resid e ill tile cuche, a ll.! tlll~ LH U lluit..l·il ,lwl' 10 11 1"1;,1"
t he co rrespond ing l.lt V cell irti le reques ted dat a an- in t i lt ' f'lIl"l't'Sl lOlll lillp, sl't.llr til
.58
send t he [eas t rl.'(c lLlIy used line alot muubo r to the dir ectory if the requ est ed da t a
arc missing, Becau se searching the d irectory takes more time than prop agal illg
the s ignal fro m the ,"J2·/'it decoder to the LItU unit, t he LRU ce ll corresponding lo
sd I II is lIpd ate<llwfo rc the li ll ~ 1I l/lllb ~r gcnCI'<lo r'l llr ns ou t {\ new rce ult fur 1111'
sine... till' 011 IJ1'1' s till rem ai ns cllcct.ively at logical I, This rl.'snlts iu a ll error !
TIle SigilliI D HI,IIV is used to P H~\·'~1I1.1 Iic I, HU unit from lwin g 1l 1'<l;Ll,f~d,
As she wn ill Fig, 17, whe n J)/;;LIIV is l!igh, all the N- PllS5transistors connec ted
to IJHI,AV 111rol11;11:UI AN lJ .gllh.- a rc r-loscd, while til<' P ' llliSS tl'lIns islo TIlam op en ,
to uurintuiu Il1I l,he 111'1'(,/ )' 5 the slime as rhe signal JI l T frollt tire lill r IIIJlII6(T
gCl/elY/lol', Once 7JT::7:ilV I' CCOII ll'S lo w, ti le N-tJ"pe devlces aa' OpCII and the l'atllCs
to IIIT(II)'s arc cu t ofr while MISS's rcmalu as 1111'. Meanwhile, tIll' P-lypc
Ilf'r in's are close ! to charge tIll.' invert ers so th at I llT (d)'s ar e discharged via t l H~
inverters 10 pr event tll<.' LIlU cells from bcillg updated. Aftl!r a period duri ng whi ch
tlw linr '1I/mb!'I' gr"r/YjIOl' produc..s 11 vnlld re s ul t, D I:.:J~/\}J cb<lllges 10 logical I so
that t ilt! LnU IHliL ca ll correctly oper a te depe nd ing on the valid signal 111'1'. Thus ,
c" rrf-dly updating o f the L HU unit d u ring a n ~;n l / lVrile ()l'f~l'1lt io n i,~ " lIs uref!. The
valid period of the signal DF:LA\ ' is an invert ed -t-unncsocond pulse, wl~ifh is 101lg
I' uc ug h rOT 11sl'lirch or Lh( ' direc tory ami a valid resu lt of the 111'1's ignal to he
come sta hle.
The 1JI';I,A r sig uil l call ho II rndllc,~ by ,111one-shot:circ lli t, as show u in Fig. 18,
which rnu produce a narrow pulse Irom a wide !Julse. Wh f!1I the input I~ of t he
ein-nit is illl pos,'d lnl;k "l 0 , tile iU\'f'r l,l!r Ilff'dl al'gl'S ciI]lllcilor C through rf.'l'il'lor
5!J
):igurc 18: T he Olll-" shol Cteeuit
It. Meanwhilc, beca use the 2-iullut NAND-gale is locke dby V;, the o ulput (If IIII'
NANn-gate remains h;&h so tllat th e output V. is high, WIWlll 'l'l'r \ ; is . 'lL allw~1
Irom 0 tc 1. since t he NANU-gate input connecting to the c 1\jHld tor still rcmalus
Iligh . the c>OJI]Jut of the NAND gll l,c causes ~-:. lo 1J(~CO tilr. luw imllw<!iatdy. A nl'r
the capacitor C is di sehargcd below the thresholdof the NANlJ-Poah:, till' uul.put
of t he NAND.gale bc..'(OIlIt~ lIigll rl.'g;m.lk'SSof I ~ 50 thnt tlu-UUI]HI1 1-:' is l'ulll,<1 1111
La logk al 1 bJ tile b ulfer. T he pc rjod of 111.'luvertod pili!\{' u f lilt' c'in"uit <lulllli is
by a pclysiliccu wire while cil]l/Icilo r C is formed by all N·cl.· \·i...· ,;alf·. In order tn
prod uce the DEI.A )' situa l. tiL E is imposed 011 V; , a nd the OUll'lll o f Ihis r irc:uil
is a a-nenoseccnd p ulseof 7J'F:[iff'.
TIle IVJiJ1'El' lI I/O sig ll,'Il is used to ensure 111M theata tns uf the l.lt O uuiL
rema ins unc hanged d uring anupdntc opera tion r('( l ll i n~ 1 h)' other cuches ill 11 111 111·
t iproccsscr cuvironmod. T hat is, WhCIII!VCr t lll'fCi~ .1,11 update lI'l l l ll'~t [nuu oth-r
caches, the cache IIlU ~t 1I0t Chall"l ~ 1I11~ ccuteut s of 1. )111ti ll: t1illd ury (if t h . ~ d a Lil
to be updat ed arc 110 1 fount! i ll ti ll' cache) and the Llt.U unit ( if till'duta nre in t it<·
60
,-ad1l'J, I'hr an IIpd;~ I.I~ Opl'Tilt ioll, if ti l(' rct l lll'~Letl .Iala arc fOlll ll] in the cHche, L1II '
ca che only up dalcs the data wjthout ch anging records ill the LH U unit ; otherwise,
1I1c l'adlc (lo es lIotllinl; for t l l i~ request , l lcncc, a signal IV1Ul' B1JfnO {rom t he
1Ui$,~ circuit is USI:U to handle this sit uation. Whenever t here is all upda te requust,
the miss ci rcuit prod ucesthe II' lU'J'b'1'lI flU signal. T he signal WlfiTb'Tl! IW
locks the JJ1'1'sign..1 to prc\' ('lIt recor ds in the LItO 1I11 il (TOIlI ntodificut.iou. This
opcrrulon is slmlta r to lhat of D t:LAl' . Meanwhile, it is a lso used tu lock ti le pa th
tu tl lf ~ .li rcc t u ry 1.0p rcwn t ti l': din-do ry (rom JlILligilig if the 1"''(III.-:<l,)<1da til an.'
Hot Ieund ill t he cuch e. 'l'hat is, during all up date requ est, t ile IV1lJ1'B1'JJ Il U
siglla l]J('wllIes low to lork t he n1ll05 pass tr llllsisl.nr and turn 011 t ile P-t)'p... dl'-
vice lo chmg c tile g utes of t ransis to rs A am l 13. Thus , Jl·uevj(;cs A nrc open ed
while Ndc vlces B arc dosed to dischar ge the 8 inll'rnal wires 60 lIL;\tthe outputs
to th e direc tory rema in high t o prevent the dir ectory fW1l1 Ieiug modified during
II'/tJTln 'lJfW .
T he out p ut struct u re or th e LRU u nlt is or gan ized ill t tli5 \\'11)' because of lh e
output delay t ime, Sin ce the int crunl wire propagation d elays ca used by llw dis-
lrillllte{l rcsist aucc-c ep acitancc product arc larg e am] the capa ci tive load 011 each
wire is heavy (c ;,ch wire \5 co n nected to all the 32 LRU cells) , shuulutlons SIlO\\'
thal t he usc of sta nda rd C~.lO S uesign techniques call not obtain high speed.
61
3.6.1 One LRU Ce ll
Fig. 19 shows one of 32 LRU eells which is selected by the 32· bit Juodcr. l::a.ch
cell correspo nds 1.0a. set ol t he cache. An L1lU cell is an tI x Ii biliary matrix ill
which the re arc 110 bi ts on th e diagonal, ljcth t he row runubor and till' Clll ll ll lll
number rep resent the lluc num ber (illllicatcd by II to 1; frum the llirn :tur)' III
Fig. 19) in the sd selected b)' the3!!·b il dccoder . 1';'lch Illatrix wrn::>pnlllI.1 lo n
scLin the d ireclo ry. C hanges ill the slat e of a matri x arc ront ro licd by ti ll' r//HI<lff
rOlil mi cil'cu't of li lc l.IW cell. Whcu II /1'(11) Inuu lI l(: dir ec tory is high, Llu-
requested line resides ill the set selec ted by SEL lrom li re! d(' ("(>tll!t , The «uurol
circu it almultenccualy updates tire stales of the rOI\' and tile COIUIIII indif11.tt'1 l II)'
tile line number. AS!'l.ulllinl; tll <ll tlrc illl line ill l lll! selected set is u'<lu~lI;'11 ;\l lli
the line is in the cache . ell the Lilsof the ill, row of the ront."SIKlIuling matrix sm-
cha llg L'1 to I to record the fad lilal li re curr l"'l'I>ou,lin& lillr wns li lt>last Ullt· 1I...~ 1
while all the biL. in th e ill, cuhmm of the llla lr ix are n '!'iPI til l~ ·ru" I" , Il~-n'a",'
the number o f 1', of olin., rows: The number of I', ill a ruw rt1IU.... ·llt!l III" ll Sf ~ 1
tim es for t he COtfl'SI'0 llfliug li ne. T he l arr.~~ 11111111" ." ,,1I,olle ·s . illllira lrosIIrilt Ilw
corresponding [iue ill th e most recently uSI,.,<1 I\'hi!" lIll' s lIlalll'!it 1I111 11111:t . nll-ze-ru's ,
is the I cas~ recently used. III rnch row. j hill; are ("U lll Jl ilrt~t. Ir and wily if il11 till'
bits ill a 101\' arc rese t 10 logicilt Il, t he oorrcsjlu lHling f{ {) Il'i\1 /1'1'(,'1/ lilll ~ is I l i~Jr .
In t u rn it makl'll lll ~ t':orrNl[lo lldiug s ignal tl /. l.x fIU) of I l ri .~ ro w lew, wls-n 110/111
S £' I. from II II.' Se·bit /le r:ode rl'lId AlI .S·S from I Ill' lim: I/u",&r.r !//'/lr.l1Ilm· ilrt·I"Si.';,1
I, whi le t et her AT:1.ZHIlO·/j t(,lIminl/igh.
G2
SeIHil(d)Update Contro!
Figure 19: Structure of an LRU Cell
For example, Fig. 20 shows an LRUcell which is a 0\ x <Imat rix. Both the rows
and the columns are numbered (rom 0 to 3, which represent the line numbers. The
initial status is shown in Fig. 20 (a) ; line 1 is the most recently used one since the
number (here it is 3) of 1's in row 1 of the binary matrix is the largest while line
o is the least recently used one because the first row is all-zeroes. After rd~I'l'~ n r('
to line 2, the matrix is updat ed as shown in Fig. 20 (b), ltnc z becomes till' 1l10s 1
recently used one by sett ing row 2 fullof ones while line 0 st ill remains at t ill' last
position in orde r after resett ing the column 2 full of zeroes. Similarly in Fig. 20
(c), after reference to line 0, the order of the Jines changes to 0, 2, I and 3. Nnw
line 0 becomes the most recently used one after row 0 is set full of ones while HUl' :]
63
, ,
a •
ONl.".: 1, 2. 3, 0
,.)
")
a err
cre o
(I ,)
Figure 20: I\UExample of tile LRU AI~ol"iLllI li
becomes all-zeroes a(h-r column 0 is written Iull of z('ro('~ 10 l.k't:I'I·i~~" 1,111 ' tlllllll l1'l's
of "lillie I"O IV~ exce pt mw O.
It is clear tha t t he [eM! recently used line ill the "dl'c(l."{! "d u[ tlu- ril dl" is II,..
one for which the roll' i~ cutirdy equal to 0 and the column is ('utird ), cqunl I " I.
T hcl'dorc, if aline miss occur sinrhat set, t hl' LRU n·lI ~ p ('d ficd I,y lwl,h .~·/~'I. ,111 01
II/ ISS sends the least recen t.lynsed lineslot number to tlll~ d in-dol")'. IIISUIIIl ' (·,,,,·s
more than 0 111' row ( M I hI' all-zeroes, for example, iuil ial1y all tlll' rows nn- 1,'m "
after RBSET. T his means l ilat not all the lines in till' ~wl(·c1<'ol ,(' I .,f tIll' . ,;,, 111'
memory nrc full. III this ait.untion, the filII/ill! cOIllrQlwhirh is 1'lIl1lp'N' el uf \ ,\ :'\0
gate s in Fig. IU is used to pas, t he lineW hOSl' unmbcr is thr- slIHdlt'Sl or ;, 11 11w
unusedlines in the set as t.lu-lca s t recently usedline hI' pll lliliAduwlI il" "'lI lml
AI,L ZER() '~ high. WlIl'lleVCrall output is loll'[uu-a uiug thilt till' line-slut 11111111"1
im plied hy this hit is the least rerc n t ly used lilleslot 1I11l11b t 'r lntbe- ~il'l'lI sl ' l) . il
locks 1\11 ether out p uts which follow by NANDing all the /lOWAI ATCII signals
log ically behind it. Fig. 21 shows t he layout of all LllU cell.
3.6.2 One Uit o f the L RU Cell
Th e logic o f one hit o f the LHU cell is illustratcrl ill Fig , 22, The bit cell used in tim
LHU cell is a variation of a .'Italic memory bi t, If a signal imp osed on ROWSEL
is high, inverter U is pulled down to 0 while Inverte r A is brought up to 1. T hus,
the MIITCIJ signal becom e s low by closing the MAT ClI transistor . 011t he ot her
IHuHI, if t he COLS BL line is asser t ed, the A inverte r becomes low and 13 is I1iSh.
The MATCII tra ns istor is O p CIl so that the signa l MATe II of this hit is high.
No te thal thc siguale ROt-VSEL and CO/SEL mus tnot be assert ed at t he same
time, to pr e vent thi s bit From being pluced in all indc tcrminute st a le . In a n LIW
cell , all the MII'l'Clllincs ami all the ROW SEL Iincsof the LRU bits in a row
arc connected in series, respect ively , and .\11 the COLSEL lines of the bits in a
co lumn arc connec ted in series. A row can be selected by I; from the directory
when hath lJ/1' a nd SEL ar c high . If all t he MAT C/llillcs of the hits in a row
arc high, impl ying th at all the hits in tlie I'O W arc at logical 0 , the ROWM ATCIJ
lin c or this rowbec omes high to in dicate that this row has been s et to all-zer os.
The rnrhc line correspondi ng to t his row ma y 1Jc t he Il'ast recently used li ne (the
lea~ t reccutly used tine in a. given set is indic ated by a low AI,l,ZHHO slgnal of
t he given l.ltu n' II), .lepcudlng o n whether any of the row s logica lly before this
o ne have hi gh IlOH' MAT C l/ L'S.
66
Figu re 22: One Bit of the LR U Cell
Fig, 23 showsa simulat ion resul t olthe Llt lr unit . The signals from II! III I ~ ;11'1'
S input lines from thedirectory and O~ to O~ arc 8 output lilies I" ! II" <Iin·.-("l.\.
A fter the signal RESN is valid (low ), t he unit is l'l~'; l:I . It can be ,;1~'11 ill 1111'
s imulatio n in Fig . 2:1lhat ;,11 the L RU unit outputs 0 1 to 0; an ' hig h f'Xr<'l' l 0 ,
"I'hls mea ns the cache line implied by the (o w 0 0 ou tput is the least recelil ly USl.11
li ne afte r initialization, alt hough at this t ime all t he rows ill a given cell art' Il'r,l'.
sinc e the first row indicates the smallestline number in the gin'n set . \\'111'11 IIIT
is logica l 1 (note that the loll' HI T signa l implies thnt .1-1 ISS is high) alld lu i,
hi~h , tilt.' fi rst row olu given LI\.U ce ll lsupdat e dto all-ones. Al this ~ i ll ll'.n "". ar,'
no outpu ts on the output lines (fro m 0 0 to Od of the UlU unit. whirh illd i<';,' ,
that a line miss occurs, Duelng" I ISS, t he output s of the LlW uuit. ;H " ·.a li,'. I II
6,
56
56
42
28
28
14
14
e
o
, , , , , , , , , , , , , , , , -1
th8n 'l e 7
ch(lnge 6 0.
chonvt'5 J.
ch8no e4 \./,
,h,o,,' \/..cheng el! J.
.rrr>: ,--ch/lnllC'1
.\.J. ,
Chlln Qe O L.-.-J,
'10\7
, 101 6
, lo t 5
,lot4
, 10 13
, lo ll
elo t I L _
,lot0
eer tr
1:.. 116
eer rs
cel14
ct' 1 13
l:el12
ce rn
ee rre
IIlr tt hrn
hdeI evn ~-.r-Lhll '-
r e en !=I--.---, I , , , , I , , , , ,---,--,
Figure 2:): Si IlJU),ltiuli of 1I1l: l. ltV Unit
68
this case, 0, becomes low, which indicates the first row has bee n updated andthe
line indicat ed by the second row ill the given cell is t he least recently used line.
After the 3rd row is updated to all-ones during 12, whe n MISS is asserted agai n,
the least recently used line is still the line ind icated by the second row. I t can be
see n that at t he second MISS the LRU unit pr oduces a low output at 0 " Thon
afte r updating the second row by asser ting II , the LR U unit p roduces th e lcns t
recently used line, du ring the next M ISS, implied by the 4th row whos e Ollt Pllt
is 0 3 • From the simula tion we call see that th e delay time for a valid o utput is
abo u t 3nanoscconds , Nolo t.hat tho sig nal DEL AYN in the airnula tlon is used a s
the result o f ANlling DB/' A \ ' andIV R ITET HJW to prevent thc Lltl.l un it from
I>t~i ng updat ed.
In this chapter, cache algorithms arc surv eyed. CMOS implcmcn t.atiuus o r
alg':lrilhms sekc'cd for this cuche design such as the 8-way set-associa tive mapping
lind th e lenst recently used a lgorit hms AI"': discussed . A hit-ma tr ix metho d is \I~('d
to simplify t he implemcntatlon of the LRU a lgorithm in CMOS. The functi ons of
the Director y /lml t he W lI Unit have been verified by circuit si mulation .
GU
4 THE MEMORY AND CONTROL UNITS
One o f the mos t important pa rts in a cache is th e cache buller, OT ([,11a etornge,
which is used to store the most up-to-dal e dala . Its main function is simila r to tha l
of a s m all, high speed m ll tlom access mCfIIOI'Y' Another is the cont rol ulIiL which
determin -e the internal and cxten. al l iming of the cache, controls l h{~ fUllrlillll li
implem ented ill the previous ch apter, and provides the communlcutiou Iunctious
required for a multiproc essor or uniprocessor cnvl ronmcut. 'I'h!s chapter d iscusses
the design a nd lmplcmcut atlou of both t he cuche 111l1r.~r 11 11 (1 the control uuit \l s,~d
inthis projec t.
4.1 Structu re of t he Memory
Fig. 24 shows the str ucture of the cache memor y in which a row represents one
line, with R words per line and 32 bits in each word. T hc row number is ill t Ill'
range 0 to 255. An 8-word line elzc was chosen for this cache mCllIory, ass muiug
that t he associated main memory is part.it ioucd into 8 modules which call tI'lUl Sr( ~1
a req uested line to the cache by interleaving. Th is main memory orguuizu tiou is
more suit able for a multiproce ssor syste m ,
A cache memory address is d ivided int o two parts ; one part cOlltai llillg the sd
and line numbers is used to select the sp ecific line ill the {;il(,hl: IIlClllllfY tllI"ough
the m emory decoder while another part (t he orfsd) ill the J"Cgi~l c r/rlllltlt c l' is IIs. ..1
to determine which word or bytc( s) ill til e selected line is [nrc] accessed.
70
Fip,ure :'l<I: Slnlcture of the Cad le Memor y
III L1 d.~ ~ rst l'lll , the sllla'k~t clenn'u t the processor can access is not 11.word hut
11hytc. Ther e arc 'I sizes of data which ca ll dire ctly be accessed by the proc essor
-- one , two, three or four byte blocks. T he size of t ile (lata to be accessed by
the processor is determined II)' a combinat ion or tile two least significant bits or
the reque sted data ad dress , 110 and ;\1 . ami the two function bit s, Y. a nd \"2 ,
which come lroru the processor. [)ul'iIlP.;' proces sor rl'lld /lI' rite oporatiou, WI Il ' 1I the
n 'qu tosted data res ide ill the cache memory, t he cache will eith er send the requested
rlal.a to the data bus or store the dat a 0 11 the data bus iutc theCilcJIC, depending on
the specific operation of t he processor, III this case, the "cgis!cl'/coulllcr pe rrorllls
like il registl 'r . Th e word offset or the address reg ister is latched ill this special
n'g ls 1 1 ~ r , ant i t he rl~llllcsh...1word is St'!('dl't! via the (01uu11l decode r or the memory.
Otherwise, the miss J1119 is sd to illllica tt' L1l11 t 1I1c H'I!lICslc{1(Iala are not in the
D, iJ,
Figure 2rl: 0 11' Bit ur the Ml'l1lol')' 1l1'gi ~ t l'l'
cache. During n, line mlss, lhe 1 1'1\~ t nocl' ll l ly II ~Ct.l IiiII'in a gin-n sl'!.or LI1('caclu-
me mor y is o ver written with the n:l l nt~ lI>t.l l i ll l ' Iouu l lll' main IUI·Ulorr. In t his case,
the J'Cg;slel!roll ll lc l'!x 'CQIUI'S a counter , cont rolledhy till' T UAN8F J::R sign ;\1Irom
1I1e muiu 1111'11 01':", Lo choose eac h or till' ~ !lossiilll ' won ! otlset s ill ll ll' sdl'net! lin e
ill the incr easing orde r [Irum Ulo i ). Hll"t' " dl lillie at whirl I a 1I"0n i olrsd is ChOSf:U,
cue or1I1e 8 words ill lilt' refjlll's t t'ti line from the uraiu nu-mory is WriU I' 1l tu tlu-
cor respon ding lo ca t ion. Th e maiu nu-mcry SI' llIls ~ word s or 1I1l'requesterl linc, by
luterluaving , for a miss request. Each word b-hrg t.r'Hls(" l't t·11 is acc ourpaulcd hy
a valid p uls e or rho 'J'HA N SII ' /IT sig na l, which is nsrxl Lo in(J't' lIl1'lIt llll' coun ter
n nd tn rCllni l'c tho bu.• rO Il I I~11 civcvit t.tl p a ss a wor rl 011 ll w datn bus to t ire c ac he
memory dur ing t ransfer.
4.1.1 The Cac h e Me mor y Register
T he Clle flc memm'y rcq is tcr is composed or t wo flilrts: c ue is t he lit/I" memo ry
I'fgis/e r and the othe r is the COlllllfl!n:!Ji,;l r.r. Fig. 25 illu s trat es l)UC bi t of llie linc
'2
cild le IIK'llIo r}'. A lJ -t}'lll: 1li11.f1oPis cllIl,loyl'tl. TI.I ~ eigual for lak hiug a cache
Illl·lIl1.r)' .uldTl::;..~ into t he lII t:morg "[listrr, illc:ludill&001/1 lice Cliche: ii/ie: n.gi!I Cl"
lI"d the-1T:9i.d~r/rr,.,lIf1·, j. 1.1.'01<.1 cl,-:<t:r ilo...1 ill thi. figure . l.alrh illS il rildl<: II11111l.rr
il,ld fl"''' ill tl l(' memQry "gi.dcr occurs 1I1I11('r t wo cond iti ons . T he first coll.lil ioll
is 1I1l'1 J uring .1 n'ad /w rilc n pc'rilt ioll, the COtrl'SIKllUlillg (·u( hl' II ll'lIIuty address
1 1il.~ 10 he lald l1'd ill I h i ~ r,'g illt l'r artror Sr'lIrchill & the d irectory. If t his c:olltlilioll is
silt is lit', l, t ill' AI.h" lIi&llll l l'nll lun~tl IJ)' Ih!' rfock lltdn~ .tll 'flflll /m· is i1ss" 1'11"1. 'I'l l"
second fOlUIiLillll is l iial WIII' II there is lin u poah' l'l'll lwst rWlt l uLII"l" carl II'S HUll
p lllsl ':' C I\'2' " lid C /\' 2" Ic r th is rl'llllcs t . T he C l\·'I.'sig lla l is used to lMd . l lu:c adl('
il.ldro.. corr es pomliug to Ihe requested Jatil d uring t ile update opera t ion. Afte r
the o peration , C h'r rrlllrll s t he address residing in 1I1c 111('11IOr}' rl'gis h 'r vdurt: til"
operatio u. \ \ ' het hcr o r 1I01thl' request ed Jala res ide ill th e cache is indicated by t he
LJN HAI/SS sil;lIal from the liRt Illllllbt r gellerator. a fter searchiul; t he director )'
(Iu ri llg lh" " l'l l"'lc r('l IIl''lI!. Th,," C/\'2' 11I1 ~ to be Oil ed wil l. I:l7ifF:iJ7S!j IQ fon n
the ',,(chillg sig n",' for tlte IIlJllde operation. Nol l' l!..,..t rrN1!lJT'!fS is produc ed
ill' tile Ii"r: n umb er gCll em l lJl', IL tlilfl'rs Irom the ft' I SS siglla l from Ule miS!J jl "g
whir-h is !Hii'd to informthe nrainnuuucry . llat t llC! .Jilla re(II H~...I,t'(I IJ}' the assocle tc..1
pro cessor life 1I0t ill 1.]le CII('II(' .
T ile Jos ie ci rcui t fur t he cOlllll r ,./n -g;s!r:1· is illust ra ted in Fig. 26. It com.ists
of 3 )H }'pl' fa lling ' cd gl'-lt igg e rcd f1i Jl - nO~ which lite organized 11$ a eyuchronous
73
Figure 20: The J l{~gi ~ l c r /Coull ter LII8ic,Il Circuit
COIluL(:r shown ill Fig, :U; {ill, but it can ope ra te as both a count er and a reg ister ill
diC[efent eltuuti ous. The logic [unction block roT thi s circu it is depi cted ill Fig. 26
(II). T he initiali zing signa l, syste m IU';SHT UTCOUNTG'Lll Irom t he InUlsjc l'
dremnposc r, res et s Hi s circuit, T Ile word of[sel of all address 011 cir cu it inputs
lJu - lJ1 endII;, - ~ t',111 he I1ltdll't] from S E''J''s and /lE S 's of the I)-type flip.
Hops. res pedivdy, when t he 1,11 '1'(.'1/ signal is nct lvc; 1,11'1'( '// is n f'a Ll'd when
tlu-re is ol t lief a hit , dll ring a l'<'ad fwr il,e operation , or an update request from other
flldll '.'l. TI le 3·\, it olltpu ls UU-01 andD;;"-D;"nrc sent to all M· l.il lJlclI1oryrohunn
dcrodcr hnplcmcutcd wit h eigllt ;I·inpu t NOIt·g"t~, ill a way similar to the IG-bit
t he datu inputs of this d rcliit and T /l:l NS1VR'I', [rom t he 1/'fI11 .•!,,· ilfculII/lrJse,',
is impo sed 011 CH or till;' count er after COUN 'J'CI,1l reset s l he Co u. ll lc"/rcgi.~/el"
At this lime , the cOllllla/l·lg isl,." behaves like a counter. Wilen each pulse of tilt'
'/'UAN S IV1ll' sig unl is illlpOSCU 011 C K of the countcr.Hre outputs of the counter
are not change d unti l the falling edge of the pulse. Thus , Wilen a 1I'0rd is 1-I',lIJSfClTl'<:!
into the memo ry lll iring ea ch plllsr~ lJ f the 1'RANSW Ill' signal, tile corresponu ing
word olfsel selected by the memory column decoder cannot b:~ changed, which
guarante es 8 words of the requested line are lransfc!rrcd into correct places in the
meutory, The layout of the memor y column control, includ ing the counter/reqister
ami 8-bit coillmll r/ecmlc,', is illust rated ill f ig. 27 and its simulation is show n in
Fig. :18.
The circuit of t he llYllls/CI' dcr.(JmJlv.~r.r is shown ill Fig . 20 (a) . Since t ile number
75
Figure :H : Layuul uf lIw fo, l ' ~ llIury GUhllllll ClI lIlf\J 1
Figure 2S: Sitnulntion or the Mcmory Ccluum Cont rol
77
H",&ek
....... fe r
Figu re 29: Circuit of the Transfer Dl'{'olllposer
of pins Oil th is chip is limited , the 1'HANSFEU eigual from t he main memory
consists of two par ts: the first narrow pulse is used to clear t he counte r, and the
following 8 wide pulses arc used to write a Hue011 t he dnta bus into the cache
memor y, one word per pulse, The decompo ser divides tile 1'IlANSF EIl sigllal
into COUNT CLU which de ars the counte r and1'11I1NS1I'1l'l' which d rives the
counter from 0 to j an d t he t he bus d,·jviu9 cin: uif during a bus g rallt indicat«l hy
the EUSA C H from t he system UUS cont roller. lnit ially, the I). type fi~lIill g'cdgt."
t rigge red flip. flop is reset 110 t hat its outp ut Q is logical I. Q is Ied back to 1I11~ U
input of the Hip-flop. When th e firllt pulse of the TUA NS P E Il signa l (used for
de ari ng t he counter) enter s t he circuit, it is gat ed thro ug h NOR-ga lc A to crea te a
78
Figure 3U: TI le 2tili·biL Memory Decoder
Viiill COUNTC Ul sigual since the et her input of the gate is logical U at this lime,
while l'/lANSIV/lT is invalid (low) because chc 2-input NOn gale II is locked by
Q. When the falling edge of t he lirsl pulse of l' llA N SF J::ll passes gate A, the
flip.Hop cha nges its state. A truus itlc n from Uto I or its outpu t Q locks NO R-gate
Aj meanwhile, NOH-gat e B becomes unlocked to allow t he following pulses of the
1'/lA N S F Ell signal to get through NOH'sale n lo th e output 1'RA N SW llT ,
After transfer of a requeste d line from the main memory to ti le cache memory, the
cache cont rol unit produ ces all inverted pulse TllA N8}JO N E to reset the [) fij i)'
flop al ti le tran sition of BUSlleA' from 0 to I. Fig. 29 (h) is ti le liming diagr am
for operat ions of t he hY1'I .~fcl· decomposer during the t rans fer of a reques ted line
Irotumaiuutcutory,
79
4.1.2 The 256 · bit R ow Dec od er
Thi s 256-oit decoder is used to (It.'COIle 8 bits o f boLII t he set numbe r nnd lim-
number from the memol'Y l 'Cgist cl· simultaneously, 5 hits for th e sot numbe r and :1
bit s for the line numb er. IL can produce 25G-bit outpu ts , but only one hit of 0\11
th e out puts is logical 0 a~ any given time, to he used to select one out of 2!iGrows
row drivers of the memory, The decoder consists of two Ill-bit Ilccoders Ilisl'ussed
previousl y ,111<1 256 2-inplll NANU-gnLI'S as shown ill Fig. :30, T ile siumlution fur
th e 25G-uit decoder in Fig. 31 only shews t ill.! first 32 outputs of th is de(:udcr , From
t he shnuteuou, there is about 2 na nosecond delay for the d('COlIl'1" 51,lgI',
4.1.3 T h c Cac hc M c mo r y
A fast sta tic memory is used ill t he cache memory unit re t her t hall slll,dler hut
slower dynamic memory, which also needs to be refreshe d. Fig. 32 depic ts the
orga nizat ion of th e memory. ILis split into four memory a rray s, two arrays ill t he
upper row and two in the luwer row. Both the upper row an.l lower row ilr r" y,~
are connected to outputs, /W lV S J::L's, of the 256·bil row-decode!' lhrougll till'
inverted row d rivers . lute rrncdlntc buffers arc used between two memo ry arrey s al
the sam e row, When the re is n RO\VSEL signal act ive, t wo array s ill thosame row
are selected simu lta ncously. Memor y orga nized in this way call red uce delay limp
and ill tUl'II increase the memory access Sliced , Fig, :J3 show s any four adjoiuhu;
memory bits in a ll arra y a lo ng with their column «election circuits, T he circuit rOt
80
d 7 E~====3d6d5
d 4
d3
d2
dl
ao
a7 §~~a 6asa4
a3
a2
a l
sa 1='-J=Li~~'--l=l..J=!-F'-,'d-J=Lid.,'o4'--l=l..J=!..J='-,'
FigllTe:11: Simul;llioll flf the :!.'irl·hit Row J)CCO..k'l
81
Fig ure 32: Th e Memory Arrays
column select ion, aile for each column of t ill) memory, is quite silllple; only two 1'1 ~~ S
traueistore are connected to t he UIT amlLJ I1' lines of memory cells in t hat column,
respect ively, Note that 32 column select ions (one word) arc dr iven sim ultnncously
by one bit of the memory column decoder durin g access to a ile word. T herefore,
a total or 256 columnselections arc formed as eight groups by connecting them
to 8 output bite of the memory COIUl1Ill decoder, respectively. Consequently , onc
mem ory ar ray has two groups, each containing one word. Only tile cells selected by
both row and COIUlII1l selections call be accessed . Fig. ':l:lshows four static CMOS
RAMcelts. Each ItA ~1 cell conaiete of two inverters wired toge t her to runke a nip-
Ilop: they arc connected by two liMOS pass t ransistors to the HIT a nd IJIT Hues,
respectively. During a read o peratio n, the conducting eldeor t he llip. flop pulls the
p rcchargcd data line (B I T or HI T ) toward ground t hrough the pass transistors
82
Figure :.J:.J: FourOils of the Memory
83
Figure ;H: Th e Data Bus COllt1'01 Circuit
while the other side remains high. Writing is uccomphehed U)' lorciug the valueill
the cell to be the same as thnt 011 Lhc d'lt'l lines.
4 .1. 4 T h e D a b B us Cou t ro l C irc uit
The dala blls control cilY;ttit is shown ill Fig. 3,1. T he dat a bus driving circuit has
32-bit dual -po rt inpnt /ontpu t dr ivers. It is split into 4 compour-nts, ('11(" 11 of which
can contr ol access to a ile byte. The operat ion of each component is controlled hy
a pair of read and write control slguals : (If;, 'W;"). Therearc [cur pairs of ':olll rul
slguals, (}4;,% l, (74,TV;"), (7l;,'W;),and (R;,TV;) gene rated by the snt e cil'CIlil.
As ment ioned previously, dntn ill the memory call be accessed as one, tW(I,
three or four bytes, respectively, using combinations of Yo, Yi, AD'\IId Al ( ti~ Im.l
Al arc in the two least slgniflcant bite, bit 0 and bit I, or tile address regist.:r) .
During normal read/ write operat ions or the proces sor, both T JlJ1N SIV UT Innn
84
if; ii.
M
Figure :.15: One Bit of t he Gat e Cirn til
I NPUT OUTP UT
} "l Y; A, A. Z, Z, Z, Z"
0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 1 0 0 0 0 0
0 0 1 1 0 0 0 II
II 1 0 U U U U I
0 1 U 1 0 0 , 0
0 I 1 U U I U U
U I U I I II U II
1 0 U 0 0 0 I I
1 0 U I U I I U
1 0 I U I I II U
1 0 I 1 U U 0 U
1 I U U 1 1 I I
1 I U I U I 1 1
I I I II I 1 1 0
1 1 I I U 0 0 U
Table t.: T he Gill e Contro l Funct ions
85
t he transfe r decomposer and UPDAT E W RT produced by t he cache cont ro l unit
for upda t ing da l.t reques ted by othe r caches arc not asserted (bot h of t he signa ls
T /lAN S W/lT and UPDAT EWIl1' are logical 0). 'l'herelorc, signals 20 - Z3
prod uced by the gol e control circuit dominate the read/write opcretlona of t he dat a
hnscont rol compo nent s via th e Ollie circuit. There arc fo ur aubc ircult s p rodu cing
TV; and IT; in t he gll/e circuit. Fig. 35 s hows one subcircuit of the glile ci rcllit.
A memo ry write ope ration is determined by the cond itio n t ha L the re is either a
write operation require d by the nssocintcd processor , a t ransf er o peration d uring a
line miss , or all up date operation caused by an up date request from othe r caclu-s.
If ;H1yof these operntiuns is needed, the II'1'U elgnal Ironr the 3·i nllllt Oil-gale
is asse rted so th a t if ally Z; is logical I, a nd the corresponding write control bit
TV; is gated a lit to contr ol an 8·bit dala busd river to allow wdting the Jata on
tho cx l,erna l da La bus inlo the cache memory, If the re is a read request from the
proces sor, n; is valid, l'll ll sing 1I1e corresponding bus drivers Lo pass the reques ted
dilta from t he cache memor y to 1I 11~ outsi de data bus. No le t hal the operat io ns of
write a nd read am cxclueive, so there is flO case that 'fi7;" a ll!] If; arc aS~('rlcd at the
sume Lime. T he truth ta ble fer tlU'signals Zu to 2 3 is shown in Table 5.
Afler simplifying t he funct io ns spcciflod i ll Ta ble 5, we have t he out put Iunc-
tions, Zo - 7,3, as Iollowa:
( I)
(2)
86
z, A;"}'o\', + AoA;"\'1+A;;A I\'0+i1';;,h \', (:\)
Z3 'A;; \'ol'I+A;; I1,YI +,10111\.01';" (-I)
These lunctions can be Implemented efficiently hy I'L A. The logic cirnLit {or
rho galc c(mt,v( ruuccione implemented in P I,A is iIlustratcll ill Fig. aGo '1'111'111 ]1111,"
to the ci rcuit arc the two least significant hits fro m the add ress n '~i ~ ll'r and two
Iuuctic n hilS, and tlu!ir COJtl p ll~l1LCll tS if needed. The uut puls of the I' LA (,i'TlIil,
Z;;to 2;, are NANDed wuh li lt! result of NOlling 1'UI1N SW ll'1' for ha ll~fn ·\\'ri l,,'
o f a request ed line a nd Ul'UA1'B IV/l 1' for update-write, to pro duc e 111l' O1 ' tpIlL~
Zu to Z3. During tL wrltc opc ratlon bot h TnA N SWUT and Ul'll tiTBlVU1' ;Ll~
at logical 0 so t llill the values of Zo lo Z3 are determined by the outputs of the
PLA circ uit.' l f either TU tiNSIV/ll' or Ul' DAT Ell' ll 'J' is assert ed, ZIJto Z3 arc:
set high to force the bus dr ivers to overwr ite one word (:.12 bits) 011 tho system daln
bus into t he cache memo ry. Th e layout of t he gale cont fol illlll !Jute cilellil is shcwu
in Fig. 37, and f ig. 38 shows the simula tion for t hese circuits .
The data bus driving circuit is part itione d into four COlllPOIlC II~S, each for COII-
t rolling 8 bite. A compoucut is used to control 8 hit~ of the da ta to I.H~ wriLl'~H into
or rend out of t he memor y. Fig. 39 illustrates one bit of .~ compo nent. 'l'herc is a
write logic block und a rend logic Mock for each bit of the (lata bus driver. '1'111'
write logic is sho wn in Fig. 39 (a ). Wilen TV; is low and Wiis high, the valueIJIl urc
DA1'A line is gatoo onto t he lJ J1' line and the U J'1' 1i 11l~8 , NU ~I~ that sir-I'!!fir I.ILf>
transistor s ill t his circuit lU C large enough to write dllt l~ into LILl' 1IIl'llI11ry ; It Ili,;h
speed. On the c ure r hand, the rend logic shown ill Fig. ;!! ) (b ) is '"0"1.' curnpllruted
87
r-rL.-
'''''
~.J ~J I:>.l ~J f1
j1J ~J I:>.l JA,~
I:>.l ~J I:>.l IA,
~.J ~.J ~J IA,
Ii'"' ~J /.-l .A,IA,~
1.J "J ~J ~
,J lJ ~
Io-i I:>.l lJ,A, IA,
f:>.l ~ :.-I ""
~ :.-I r,..J .A,
.;'+'-
A. :t. Al AI r, 1'1 Y, ;l ~ t;J t(.
Figure aG: TIle Gale Control Logic
88
Figure 37: Layout of lIHl Goh' Logic
Figure 38: SimulaUon of the Gale Logic
90
(..J WrilcOpe. ..llo"Cn" tro l
Figure 39: The BIl~ \Vr i t(~/ HI·1td Opr-ratlo u Contro l
since the signals from memo ry cells arc usually weak. 'I'hercfurc, then~ IIl1.\'e to 111l
amplifiers to increase the signal stre ngt h for the mem ory cells, This rend cirf uit
is a two-stage a mplifier which detect s t he state of a memor y cell. '1'111: first s taW'
of the am plifier is it dHferelit ial sense nrupliflcr which ca ll scusr- small (lill'l:rcn t:t'N
between voltage levels 011 t he BIT and HIT linea Irom the memory and amp lify
this to provide very fast sensing, The second stage is 1111 inverter which provides
further dri ving ca pacit y and mak es the rise time aud Iall umc c l t ile IJATA signal
short er. Note that the clilTercmtial seus c amplifier is evaluated while the IU...'AIJ
signal is acti ve. In order to obtain a co rrect out pu t rap idly, the sense amplifl ur is
precharg ed t hro ugh transistor A to eliminate the chargc.slwrill!Jcllcct d uring 110
read operations . Simul ation s for both the write awl read logic contro l of a COil] '
ponc nt arc shown in Fig , 40 and r ig, 4l. In Pig . 40, when IV lUTE is hil;h and
/lEAD is low, dab 011 the DATA lines arc ga ted to t he IJI'/ ' 1i 1l(~S , 'file delay
lime is about 3 nanoseconds. In Fig, 41. whcnlV IUTH is tow lL1 I1 I Il t' AIJ is lugh,
da ta on t he B IT linesar c quickly placed onto the /)ATI1 l i11l~s .
91
60
60
40
40
20
20
e
a
, , , , , , , , , , , , , , , ,
r:; ':-
./
---<: .~
r::t -; -:
--- :
~ <,-:
----...:
i~ -;
-:
---
"
.
i , , i i i i , , i , i i , , ,
bl tn?
bit?
bitn6
bil 6
bltnS
bi t S
bl tn 4
b i t "
bll n3
bl t3
bitn2
bi ll
bi tnl
bitt
bl tnB
bi l e
wri t.
read "
dat a?
dauS
da la"
da l a3
dc la2
data l
da t ll[l
Figure -10: Silllllla tiull of the Write Cont rol
92
60
60
40
40
20
20a
e
, , , , , , , , , I , , , , , ~
»: ;.
~ <,:..-
~
.:.-
----
'-
~
r-t-t-t--r i , i i i I i , i , i , ,
d /ll (lG
dl laS
da l a4
dat a3
da ta ?
data l
dllh0
bit"?
bit 7
bltnS
bitS
bltn4
blt4
bllnJ
bH3
b!tn?
blli?
bltnl
blt J
blt n0
blt B
tld ltn
r u dn
Figure 11: Siuurlaticnof the HCHII Cont rol
D3
4.2 T he System Cont rol Unit
'l'ho system control unit is used to control nil the ecnuuuuicaticns among cache
memory and the associated prcccsscr , the main memory, as well as the multiple
cnchcsin a multlpmcessor systom. Therefore it is the most inepnrtunl, Ilart of this
cache lllelllory system. III terms of the conuuunlceuon operations, thi ~ unit is
[ogically parli tiuuctl into three parts: the normal read/wri te operation, the Ilpdal.c
operation, alltllhc miss operation. In thissection, morcdctnils about tbeInncticns
or these three !larls willbe described.
4.2.1 T he Icegular Head / Write Operation s
IIIorder to control the differenl operations, the d ocks and ALb' have lo produce
several dock pulses. A clock genera/or is used for lh is purpose. First, let us
discuss the circuit in Fig. ,12(a), the sillgle pulse pmducel·. T his circuit produces
ouecomplete pulse aud its complctucnt on thc cirClit outputs (JanelQ fl'\}lll ;1
SC':IIlCllCCof pulseson IN after receiving an invertedpulse 011cr.The behavior ur
this circuit is as follows: lj is iuilially at a high levelsince the2·iJlput NAND·gale
illlJlIt conncctlng to the llip-IlcpthroughllMOSpass.translstor A is [owregardless
of I N. When there is an iuve rted pulseun CP, theoutput of the Ilip-filiI' connected
to [I~S.~ l ritll~ist(lr A is set 10 l ogk~ 1 1 »nd it in turn enables the 2·illllut N/\NIJ·g1l11·
since tho11l1SS transistor i ~ closedhr the highQsignal at thls thnc. If IN has ,I
trausltionIrom Uto I, QUCCOlllCS low, and i ~ iut um both locks pass trauslstur
A and resets the Ilip-nOI" Althollgh the llip·nop is reset, the NAND·gale Input
.~Q' - ij
""
'"
Figllre 42: The Siul,lcrol llJwi ll~ Pulse I'I'00uO:I'
cOllllcd cd lo l llc pi\S51r1llsisLor rclI1i1ins al logicai l L<'1:l\lISC lll l'llil.lh 10 lIu.: illlllll
"r the NAND,p,illl' i ~ blocked by ]In.~~ .l rilll~ i~loT A. Thcrl,rOW, t i ll' Migllal ,.t IN
ll1\Slies the NAND-gale to Q and lj . whe n the IN signi\l ltlakl's ., transition [Will
I Lo0, the 7J signal c1l1\n g('ll lo logical I from logiCtlI 0, which re! (! l!SI'S th e l ),'\.~
trilnsi~tor /I.50 111.11" lowoutput frolll IllCflip·flop is lmposcd all the NAND-gal :
1111'111 ,.j" lrallsisloc A. Thus the 2-inpul NAND-gate i510ckrd 50lh"lllle rollowjllg
pulses Oil IN C'UlIlQ\. be prOII<l&atcd to outputs Q 1I. lu.llj. T Ill!l iming I.liaj;ram ill
Fi~ 42 (b) depicts the operation of thiscircuit.
When(!\ 'cr there is An Ulld" u- reqtK'Sl from olhe r caches viII lIle "ysla ll bll~. liM:'
.prlaJedrnit pmduees 'UI UPDA7'Es ignai lo itl{onn thr.cad le 1.0 u[I,laU!wil.lIllll'
uata on the system bus. III lh i~ case, the dllClr: g~ ll trqlo ,. II/mIlia,. S(~vcral IIIIIM'N
Laaccomplish the update 0l'l' ralio!ls. The cirellil ill Fig. <1 :1(A) is llst'fl La /lrOllllw
iL pair of pulses Glt' I' aliaClI'l' from GH I . At the bl'gilluillg orCH I, the drrnit
alwa)'s chc<:k!l if there is an UPD ATE ~igllal. U lhe UP DATE signill is fOlll!l lo
be low".flerthis check, Gl( I' is set lew whileCJ\ii is 5CllIigl. Dthcrwec, l/II~ hi!>!!
signal UPDATE is locked by ll. P.l ypc pass-Iranslstcr 011 lhe iUl'll l oJ l ile NAr;U-
95
I-I
..r.HlC" 1~
CiITl
loJ I",
.~C" I
c;;(2'
'"
.~C1\;
;m;
I,)
Figu re 4:1: The Clock Pulse Generator
l~ig' l r(' 4-1: A T iming Ui;lgra lll for the Clock Pulse Gcuora tcr
ga.te150 lh at during CK l any d 1Allse of the UPDATE sigilal c"'nuol a lfeel the re.mll
of the circuit , and C/O i, ga.~ out to creat e beth C K l' end C hl', Her-ceoev..'1
irUPD ATE is d lAlIgcd duri ng C /O ,lhe pube (ro m Clf l slill "asses lIH~ 2.i nj'llt
NANO-gat.c to male CK I' and CKP be complete pulses. Fu rthe rmorc , t he pMl,,~
gcn er,dor also produces et he r contro l pulses for an update operation using li lt' J'Ill~r
prodll(,crs. Three Plllsc pl'odllcct'5 a rc employed 1.0 produce !..elrs o f signals G/i l"
and C1IT", C li'l and rJ1(i ', Il3 \vcllu Cl i2" and CK2 " hy imposing c1ilrel'l'lll
signals o n inputs or tho dl"Cliil s , ~s sh own in Fig . 4:1(h), (c ), nnd (J), l't ·~l' ed i vdy .
CKt' is usedto latch addressesfrcmthc eystem bus illto the Ildd,'I; ';.~ I'cyislrJ'Jurillg
1111up d at e requcsl; Cl\"'J! is 10 lalc h lh e corrC!lpolidillg cliche ndd rcsscs Iro m th e
directory into the mcmory rcgistcr tc update lhe requested J1It.\. Glf t " IIn,l C"" 'l"
are used to return LlIC addresses Wllicli reside ill bot h the adbe..'<S nfisl t:r IIl1d
memory rt'gi.'/er Ld'orc t he upda te operlltion into th e respective rcgislcn aft er the
upd ate. TIle pube prvJ"llcer .aogcl'lerAtaI /In ALE"IIU~, iI.5 shawn ill fig~.j :J Ie},
1.0lA1ch a valid cadl e AddrCS5from th e direc tory into th e memory rcg isler d llring
a. no rm al processor IIrcrAA. In orde r lo p rod uce A l,b~ o 7l'l:E is huposed en [;ji uf
the p ulse prodru:er while C K2 is present at IN . Fig. H slJow.~ a ti lllill,!;d ia&rlI ll
for t he pU/8e g eJleJl.l!or in whichwe can sec the rcla liOliSamong t! ICSC sigJll\k
Note th at all theinputs from thc p rocessor &.Iecont rolled by the chip scl,_oc l f.'S
sign'l l from the associAted p rocessor . T he cliche memo ry ca n be I\CCCSScU only when
the cache is selecteduythc proc essor with ?:S . Theupd Ate 0f>Crat iolls, how,~w ~ r ,
ar e not controlled by ?:S. Th is makes the cache memory 11."'-'(( in a eachu- h..M... 1
97
com pute r systemmore flexible.
1,2.2 T he U pdu t e Operations
When the cache is uscd in a multiprocessor eyetcm, there mus t be a "bus watcher"
to watch the system bus to sec if there are any other caches requesting to update
copies of data. T he circuit in Fig. 15 is designed Cor t his purpose. If there is
lU I update request on the system bus, the bus update walcher must interrupt the
cache to upd ate t.he ltnta in the cache at the proper time. T he prop er time shou ld
he when either nil iK CCSS by the processor to thc cache is finished or ~he cache is
wnitlng for the system bus, i ll t he cases of a write operation or a line miss, The
Hl1 ish of <Ill access ca ll be detected by a high WI R signa l which call be obtained
by NORing HI find n. from trw processo r, 1\ request from the Cliche for use of the
system bus can be made by BUSREQ from t he bus cOII/l'01 geli emlOl ', Inh ially,
Illp-Ilop A and Ilip-flcp B are reset whe n SEAnCIJ is at logical 0, and UPD ATE
Cro m the blls Ill/dal e walcher is at logica l 0, (That means the circuit does not
work :l~ long as SEARC JI is at logical 0. ) Flip-flop A is I'. It-S Ilip-flop while
Ilip-flop U is a Ialling-crlgc-trigg crcd Uotype Ilip-flcp. Whenever t here is all update
request Irom othe r cncltcs, the SBAflG'/1I N 'l' s iglll\l is ga ted i ll so t lll\t SE ARCH
becomes h igh when l1USIJUSV is low whic h indicates that t he cache memory is
not Ilslhg t he syste m lurs. The outpnt of the exclusive-or (XOR ) gate has it level
l.ra ll~ ili()1l h om u til I since 1I11! value or the XOR-ga tc input F is logical O. T he
o lll pulur l.he XOH-glltl' JllIS~I'S t hrough tll(! :I-input AN D.gatl ! lo IJOt.1ICP o f f1ip-
08
(oj
I"~
Figure 45: The Circuit of the UIlSUpdate Walc her
nop l.Jand the 2-input AND-gate since the ot her inputs of the 3-inpuL AND-ga te
(SB ARCIl and Qof flip-flail 0) are at logica l 1. Since nip-nap 0 is falling-edge-
t riggered, t l~ c sta te of tIle /Iip. /lop i ~ net changed a t th is time so that CP Irom
flip -fl op B is high. Th e result of the 3-illpu t AND-gate is gated through the 2.inpul
ANI.>-gate to UPD ATE if either a cache access is finished or the cache needs to
usc the system bus (that is, either ll jW or IJUS ll EQ is logical 1). Th en, the
UPDATE signallJ<:colllcs valid. This mea.rs the update operations can only he
done when either the cache memory is not used by its associa ted processor or the
cache is wailing for usc of the system bus. Tile UPDAT E signal may be red back
to dear the R-S Ihp-Ilop ill the case that the data to be updated arc not found in
the cache during the update operatio n, in which case LI NEM ISS from the line
JI1lmber gell cm l or becomes valid. In turn , a change of the XOH.-gate from 1 to
o t riggers the D nip-Hop since the D of nip-nap 0 is always logical 1 ((J t ransits
to logical 0); and this causes the UPDATE signal La become logical O. In this
case, the cache does nothing since the dat a to he upda ted do not reside in the
Cliche. If the dutn tu he updated are in the cache (LI N EM ISS is low), tlip-Ilop
A is nol reset until Clf 2' is asserted, t hen UP DATE is pulled low, When th c
5EAIlC Jl I N 1' signal changes from Uto 1, both flip-flops arc reset simultaneously ,
which means one upda te operation is finished. Fig. 45 [e] also depict s a dual-
direction switch which can either gale in all update request on SEARClI INT
r1"ll Hl ot lll 'I' ruches, 0 1' ga te out all lIpda t!: request on U f'DA'J'r~UBQ hom the bus
COIIII'tl1 gcuclYllor to othe r caches via the system bus. T he dua l-direct ion switch
l OO
consis ts oCtwo tr ll.nsmiu ion gatocontrolkd by OVSDV S} ' aml 1JUSBUS }' frulll
the bus control ,gcncrator. When BUSB USY is h igh, UPDAT E REQ II;\ S5C:I one
o( tr ansm issio n gat es outo SEAllC 11IN'J' while UK'ot he r trauemissicn ga te locks
the pa th betw een SJ::AncfllNT and SEARe l/ . WhcnlJ USJJUSY is low, the
signal on sEAncJ11NTIrom the sys tem bus pas.o;cs th e /tatc to SE ARCll while
the pe th tc UPDAT EllEQ is locked by t llr. nUSU US}' lines. Fig. 15 (h) lilll/w"
a. timi ng diagram Cor the bus \lIN/ale wa lch cr. From t he diagra.m we ';1.11 !ICC til"
opera tion or this cin'ui t rOT two cases: OIlC when no IIpdn li llS occun: 11.111\ t IL,! nl lwl'
when updating occurs. In t ile first case, the data tv be UptIH I.('(l nrc no t ill thl!
ca che . ln rhls CM !' , t ill! l! I' IJA'fE "igna l has n shorter valid pcrim l and i~ rt'Sl't1.11
logical 0 by Ll N EM I SS as shown ill the timing diagram. IIIthr- second case, t il t'
dat a reside ill t he cac he 1nf'lI"Iory. I.. th is case, t ile pe rio d or th c U I'D,lT /-: si&lla l
ls longer and V PilATE is d eare'(1by C I('1!. T he Iouger UI'UAT "; signa l is uwx]
to prod uce an VPDATEiVllT signal fo r writi ng the d a~a on t he eystcru b us illtu
the cach e memory.
The circui t ill Fig . 40 (I\) is used to produce t he U Pl JA'I'f.'IV If f siS" .11 Wllidl
overw rites the d ata. to he upd ated ir th e data rcsl de ill tile rad l<~. Ch"'l'Il\l d ll'l> t l",
UPDAT B sigunl into the ri ~i llg l't lbt!.t riggt'f('(1 Oijl"nUI'lf lhl~ UPIJA'I'H s ig llill lll1s
a long period. The UPDATl~WU1' signa.l is lo c ked ror the dura tion of G WJ.'by
all inver ter docked by CI\ 2', since the up dating writehns tc wait untilthe Vjfl·bil
memory decod er lilli.shcs its oper ation. Meanwhile, the I)A'I'ACON'l'Il0 /~ siglla l
is sent oITthe chip to control the pa.th to the aystem 1)115. CJ\'iii is useel to rr."d
10l
Wm"",,,,,,,f'DATEW lrr" QI '1J ll"
III'UAT& CK'l' "iJ7{i"
t-,
--lL...IL.-.JL.
CK2 rL..J'L.Jl....J
ere -----Sl---
CIII:2'"' : fl--
urDA1E~
DATABUSCfIIt. --..rr-l--
Q ---u-r-
\l 1 'UA1-EWIIT
'"
Figure 40: T he Update-Write Gene rator
,-, (",
CI( I ' ~
CK'· : Il--
U"".l'Q ~
w;;T.'i'li;;
t-,
Figure 47: T h.. Miss Circuit
the flip-flop and in turn UPD,ITE WRT. T he tim ing diag ram of this circu it ill
Fig. 40 (b) shows the operat ions described abo ve.
1.2.3 Th e Mi ss Op erations
1\1 discussed previously, t here is 11 circuit which crea te. a.MIS S signal when th ere
is l\ line miss J uring a read/wri te operat ion. Tile MISS siglll\! is produced by
the mi$$ circuit shown ill Fig. 47. In F ig. 47 (a), in iti ally, a high WiU1ETlIn G
!lignal forces th e pllSS l UlIIs illlor 10 he closed50 tha1 ti le hlgb-lcvcl LTfiTIJJfJSS
lligUill Irom the line "li mber 9t"t'~llor (which mean s 11/1' is higll) can be imposed
102
on an inpu t of thc Ilip-Ilop while another input of the [lip-flop connected lo ti lt'
transfer clearcircuit remains high, At t his time , theout pu t AII 5S of the circuit is
low. If there is a II I T caused by a rcgular opera tion fro m tile associated processor
and there are no upda te requests from other caches (in t his ense, IV}l/ 'J' 1~'J' J1 JiU
is high), the state of the Hip- flop remains unchanged, If there is n line miss, nit'
LINElIf /SS line is low so thntthc flip-nap sets its output MI SS high. T Ilt'
M I SS signal is used to make a rcqueet thnt t he use of t he system bus he grante d
to t ransfer a missing line from t he main memory, and ill turu the main Illl'lIlmy
responds to t his signa l by sending t he requested line, a long wit ll t~ l e 'I' JitlNSFh' n
signal , to the cache after t he system UUS arbi te r grants the cache the use fI ( t ill'
system uns by sett ing HVS AU}I of thal cac he valid (logical UJ. At thi~ t illl<l
T ll AN S DO N E from tlie tHIlI/f l' cln lr eircuit still remai ns high until /JUStl e'/\'
makes a trans ition Irom 0 to 1 which indicates t ha ~ t hc line t ransfer is finished.
T heil the Im us/ er deer circlIit produces the 1'UA NS UON E signal , i\ na rrow pulso
of about ananosec onds , to reset the Ilip-Jlcp so t h",t the M ISS signal chnuges lrom
1 to O.
As we hav e seen , t he M l SS signal is used lo info rm the main mem ory ora
IiIll' miss. Note t hat not on ly a rellli/w rite operation call ca H~l~ the li llc 'ilImbc'"
gellwdor to produce 11 LINEM15S sigml1 if the requested datn arc not found
in the cache but also an update request from other caches in the mnltlprocessor
system requir es thnt a LI N EM /55 signal be gcncrntcd. These lwo kindsor li l ll'
misses must be handled ill di fferent ways. T he way t he first situa tion is lmndlcd
103
has bC'C1l UiSCIiSNetl previousl y. For t he second sit uation , absence of 1I1c d ata to be
upd ated in t he cache s hould not cause any cha nges in t he cache . Therefor e, in this
case, the ~t ;ltc of the m iss j l ll!} should1I0t be ch anged a s a resu lt of searching t he
directory even though the search ind ica tes the req ueste d data d o not res ide in t he
cache, On ot he r hand , if the d at a to he updated reside ill the ca che, the cache only
upd ates the TC'I"C'stl'd data without changing t he sta l ll!'l of the cache. TI1\l~. til ' ~('
has to be a cir cu it to e nsure t ile cache contents arc not changed during a n upda te
operationexcep t for n :lllad ng the du t u to be upda ted in the melle if the requested
data arc found . In o rder 10 handle th is case, th e circuit show n in Fig. H (b) is
employed to produce t he lViUTET Il iW signa l which prevent s the missjII 19and
other ecmpo nc nte of t he cache from changing du ring all update operat ion, for the
miss j l ll9. UIC W llITE1'J/JW eigual is used to lock the path between the flip-flo p
input and LJNE AliS'S to prev ent the miss nag Iromchanging t he Mi SS signal by
the LJNHM i S5 signa l during an upda te operation . As di scussed previous ly, G[( l '
and CA''}.'' a re produced fer an update operation. e!l'l' la tches t he addr ess for the
updat e iuto the addr e ss register and C K J" re turns the address before updat ing
into t he add ress regts tcr . Hence, t he WRI1 'BTIf/1U signnl lllust be at logica l
o to guneantcc that t he miss 1/119 is not changed duri ng the peri od b et ween the
beginn ing of C K I' and end orCII·I" . T his circuit employs a rising-edge-tr iggered D
llip-Ilop. T he flip·nop is trlggured by the rising cd gc cf th e C [\' ! ' while U PDAl'E
is high. C [\ ' 1II is used to clear tile flip-lloll hdore the lIl:xl up date reques t and
1.0 lo ck lilt' pass Irnnsi slor 10 guara ntee that t he 1I'1lJ'l 'f.TJI nusigna l remains
10,1
1Jf----~
'-+-....---.....,
Figure 48: T lte Circuit for th e Ous Co ntr ol Si&llal Gene rator
unchanged unLil thc cud ul CI\" J" even th o ugh t he mll-nop is rese t byCi\"IH. '11lC
operati o n of th is circuit. is shown in the li millg di a gram o f fi~. '17 (c). Note tbat
the IVRUETH lit} sig nal becomes low allhc bcg hming of CIt }' arnlret u rns to
the high level at the cud of CI{ I " ,
T he ~ lUI cont l""tgtJlcmtor in Fi&. '18 gcneee tee 11 uuml er of signals lisa! fot
conunu uicaticn wiLh tile processor anll SySh,111 LUIs cont roller. In Fig. 18, /l and
IVcom e {COlli t he processor and arc used to access La t he cache memory. lIlW
is scut t o the bus ulltlat e Ultllch cI'[u check the tlJldll \ (~ wfjw 'lIL011 L11l:systc uthus .
Only when both IVand /l ll re low is ti le signal /l / IV set Iligll. When AL I::{rom
the proc essor is a sserted d uring a write o per Atioll , th e IV signal Fro m the p rQCl'SSOf
105
iH latched into flip-flop A [ called t he wrife fill g), A reques t I1f/SHl~'q f" r liS,. " f
the system bus ca ll on ly be m ade by either a write operation (kUOII'!1 by tIll' ,~ l ; I (l '
of the wn'le }l lIg) or a line miss (i n dicated by the miss jl «f/ as long ; I ,~ Ilip·f1op
Il, till! 6us busy jlllg, is reset). ln th e case of a line miss, 7J[J!;]}US I" is hi p,ll.
while BUSBUSY is low, so that BUSREQ is low. T he signal JJUSfil~(/ i.~ "" lit
out to reques t the usc of th e system bus. Note that. a t this t i lll{ ~ IJUS/lI rs v is
invalid [loglcal l] since there i ~ not a valid B USACf{ signal from the " y.,111II ,n /.,
con/ rol/cr. A fter the cache sCll(ls the sign al7J[fSjff{Q t o thosys tem bus rnutrolh-r .
t he bus con tro ller re s ponds 10 the cach e req uest for thc usc o f till' s,,'s \l 'T11 l>lIs 1,.1'
gener ating a low auSACh" signal t o inform the cache that it can IIS(' III ,' ~~'s t " Ill
bu s M SOOIl M the sys tem bu s arbiter makes a gra nt to this (";It·lw, Oll rl ' " \'ali, !
HUSACK signal (l ow) from 1I1e blls cOlltrolferis receivc <1by t he fad ll' d llrillg ( ' I,',!,
t he 611$ busy J//lg la t ches BUSACH an d in tur n makc!lI1U S IJUS\" 10 11" 10 I'<'p l ~
to the bus cOlltrolle r thul the cache memory is using t he system bus. lIh-lII IW!Jill'
7JUSiJlJSY also elimin ates t he bus request made befo re by pu lling lip lur S" I;?),
T he high nU sn USY signa l can gate out the MIS S signal fro m lile 1I/;".,j1I1'1til
form a new signal AI ISSEX')' lind /or the sign al fro m the w r i t, }la!J 111 Ir; I!l,..lllil
U P DtiT E Rf.'Q tc t he sys t e m bus. If only t he miss nag is sc t . lt II ll'al l~ t llO ' ca " lll'
memor y is doing a Iniss ope rat ion caused by a read opera tion. If unly ti ll ' weiu- II;,!!,
is set" the cac he is d oing a wri te ·through operation for it wrltco purnrion. If l" ,t ll I li,'
mi ss flag and the wr-ite flag arc set, the cache is work in g on 11 m iss op,er atioll r i \\lsl ~ 1
by a write o peratio n . When t he miss nagrise s, the va lid MI SSHXT sig llid i.... svnt
106
'wq~ c .
o DP
A
lJPD AT£ REQ C "" 2
(0)
'"
F igu re <\9: T he U pdate IlClPH'lll C lear Circu it
valid MISSeXT siglllll is !'lent nul, tc inform the system bu s to sati~ry the miss
operation du ring lJ USlJUSY i and when the write nag is held , the UP DAT E Ji.I::Q
s ig nal causes a SE ti ncl/ 1N l' sigllai lo ask the main lllemo ry anda ll other caches
La update t he da ta 011t ile system da ta bus . Furthe rmore, t here is ano ther signal
CACJlEBUSS' to be SCll t out to inform t he processor to be idle 1I 11d~ certain
co nditions . The cc nd iticns which m ake CACIlt:UOSY valid arc UI C following:
1. to upd ate other cach("S(UP IJAl' EIl EQ),
2. 10 be updated bj-allier caches (Ul'V,1TE).
3. to ha ve a Iinc-miss operation ( /1'15 5) ,
It the wr ite rlag is sel by W. it has tc be rt:1ICl by G for lied operation .'trler a
cer tain period, durin g which the cache docs a write 0IM'rat io n,
107
Tile circuit in Fig. 19 (a ) is employed to produce a clear s ignal C to reset
tile write nag. Initially 1I1e output Q of the falling edge-triggered D Illp-Ilop is
at logical 1 10 lock t he 2.illp ut Ott-gate 50 that the out put V is high. If there
is a high UPDA1'EREQ sign a l before the llegllllliuS o f It puls~ of CJ<2 [nole
that UPlJATE fl f.:Q can only (Il'gill to he high during CI(2, see Fig . 48), the
UI ' lJ ATf.,'UEQ 5igll,11 remains onthe /) il/pul of tlw Hip-Ilopuut.ilthe cud of t ile
(.'1\'2 pulse CI 'C/I tJlOugh the UP DATEllEQ signal change s during the C /(2 pulse.
The nrs~ pulse of CJ(2 passes the 2.illpu l NAND-gale to produce all inverted pulse
011 one input of lhe Olt-gatc whi le it is also imposed on the inlHlt C l' of t ile Ilip-Ilcp.
During th is puls e, V is 1101 changed since Qlocks the Oil -gale . The fallin g edge of
tile I1rsLCK2 pulse lat ches the high U Jl llATEIU~'Q sigllal into the Hip-flop which
makes Qswitch Irom 1 tc U, Thus , aft er the flret C1\' '}, pulse, Q is low $I ) lil al
t ile 2 -input OR-gate now becomes active , Dul'illg thesecond pu lse from CK 'l., t he
inve rted pulse from Cl\2 pilSSCSthe 2- illllut On-gate to make t he clear signal C
valid while C/(2 is imposed on CP, The G signal resets the w,itejlag to slop the
UI~DJiTEIlEQ sigua l. The U P DAT ER EQ signa l is chan ged (rom I to 0 to end
the u pdate request on thesyst e m bus. Note that the low UPDA1' £'IlEQ signal is
1I0t reflected on the D input o f thr Hip-flop siure it docs not get throngll the pass
tra nsistor until the end of the second C l\ 2 puls e so that the C slgual is complete.
Allheelld of tile CA' '}, pulse, the flip-flop is reset by the low UP D A1'EIlB Qsigllal
Lamake Q(ogical I which locks theC Sigllill again, Tile cpcrnt.iou of thls circuit
i5 s hown ill the liming diagra m or f ig, 4V(b),
108
1 1~l\ i.l o" Co" "I '"
~1b>- c.
A.
A
,
1t-IVOIlid
-----mt-...-Jm- "'-1(jjL-
-..r-::-U- :-----""
-....r -....r~
-- -U- --
Figure 50: Th e llea d Valid Cir(u i~ for Head r-.1i!l~
10'
Fig. 50 (a) depicts a special circ uit called read vlIlid circuit which is used to
cr ea te a sig nal /lE:ADV AL l D to inform the processo r t ha~ th e data on the da ta
bus arc th e data requeste d during tr ensfcrrlug a missing line to the cache. The
pr'o ccssor cnu receive the data froUlt he bus wit hout reading t he data from the cac he
IIJPlIlury aHer lrausfcr of t he missing Iinc. Thu s Ute delay tlmc for t runsfcrrhtg a
missill).\line fora read operation is decrease d. During the tra nsfer of a missing line,
t he c01l11 tc lj l'('gislcr opem tes lU a counter. For each pulse of t he TRANSFWRl'
sig Hal from the /n m sjer dccol1l/1oscr, the cou nter increases by 1. T he 3 hi ts of
tile counter Me cum pared wit h the corr cspoudlng bits, bit 2 to bit 1 inclus ive, of
the jl\ldl'l'~s rt'gislt~r shuultancouely. If ull a pairs of the comparator iuputs arc
m at ched, the outpu t of the 3·i11l1llt NANlJ-gatc is logical 0 at point A. If the tine
mi ss is ca used by a read c perutiou (RE AD is high), th e inverted T RANSFWRT
sigll ;Ilar rivcs at point U, In th is case, the IlE'ADVALJD signal is valid (low level],
T h is /l/!,,'ALJVJlLJ1) is sen t to the associat ed processor, an d when t he processor
receives th e IfllAlJTTifU1J, it reads the da ta on th e data. bus immediately. '!'he
com parator consists of th ree complementary Xo ft-gatcs. Fig. 50 (b) shows the
lo /;,ic circu it for a ile complementary XOR-gate. Tile output becomes high only if
t wo inputs of the circuit ha ve the sa me va lues. Th e operat ion of th e l'Cud valid
ci rcuit is illustrated ill Fig. 50 (c).
in this chapter, till' memo ry an d cont ro l unit arc designed, impleme nted and
~ i ll ll ilil l\·, 1. EHlIli1\nl,ing bot h cache re writing during a. write miss end data bypass
10 processors durin g n real! 1 11 i~s will reduce line-transfer thue.
110
5 USE IN A MULTIPROCES SOR SYSTEM
In a large modern compute r system where there arc often several i lltl"IH 'lI lkll l
p roc essor s with a shared memory, co mpetition bet ween intorccnucct ed prol" ' S ~ll r,
for access to the shared memory may become a serious problem sinn ' SPWI'ill of IIII'
high speed processing elements may t ry to reference the shared 111;l ill l1Wl l lnry 111 Ih,-
sa me time . The per formance of such m ultipr ocessor systems is lilllit.·d by 1,110' sl ",.. 1
and band width of the bus and the main memory. A key to dlki,'III, opcmrj,," is I..
red uce both network traffic lind direct references to the main IlU ' l 1l0 1'y . TIll' 1,,111;
me mory reference lat ency ca used by t he net work can be gl'(!ilU y redun:d hy III< ' 10,...,1
mem ory for each processing clement since t he majority of rcforeuccs r.o 1111'nmin
mem ory can be capt ured by a local me mory such iUI a cechc memor y (!l. 101. l:i/!. t,1
illust rates 11typical cache-ba sed mult iprocessor syste m with l~ sha red Ill"lll" ".\', ill
wh ich each processor has an a tt ached cache memory Alt hough the USI ' o f l'ndws ill
a multiprocessor sys tem can grea tly reduc e t he bus traHic and sl'l.,'d up L1lt' s .\ -S I" III.
suc h a. syst em can cause a cohe renc e problem because ll1ultip h! cnp i,os Dr ,lal;, ill
the shared main memory will likely reside in several different. cllch,'s nr IIll' ~ i llll'­
time.
5 .1 T he Coh erence Solut ion St rat egy
Since the use of mult iple private ca che memories CU ll C1tllSl ' a l';,("!lc rulu-n-uo-
p roblem , a reliable stra tegy must b e found to keep <lata in lhe sysh~ lll '·OI Wl"' ·III .
111
Figure 5 1: A Typ ical Cache-based l\lultiproccssol' Syste m
Mnuy different solutions lmvo been proposed for this p roblem [5,8, 10, 12 , 141. A
memory sys tem is co herent if the vnlue returned Irom a read ill the syste m reflec ts
exact ly th e la~l valuewritten ill the referenced address by any processing clement.
The re arc two kinds of dala incoherence IiI:
1. Arter t ile datn in caches i111~ updated by til e pr ocessing elements, they a rc
not consistent with t hose in th e main memory.
2. Mult iple copies of a given line of data call exist in several caches: updat.i ug
any copy of t his line by a processing clement will cause the values in caches
associat ed wilh other pro cessing clements to be obsolete .
1('
To eliminate the lin t case, a wf·ite ./1u'ollgh policy is chosen in this system to
keep the iu fonna tion between the mai n memory and caches consistent . \VIll'ucn'r
ther e is a write request fur a given add ress, the cop ies orthe fI'(jlll 'Slf'd 11;11.11 ill
both the cac he and the mai n memo ry arc up dated ainmltaueously wit h tho new
value. Th is scheme has some advantages 15]: first, it can be impleuu-ute-d wit ho ut
complicated logic. Second , constant updat.iugof bot h the ca che and the muiu
memor y at every write request keeps the infor mation ill the main memo ry alwa ys
consistent with that in the cac hes. li enee, if the re is a write reques t for nu addruss
t hat is not in the ca che, the system can simply transfer the reques ted linc from the
mai n memory to t he cuche to satisfy the request of the proc essing eleutcnt usi ll/;
a repl acemen t policy witho ut wl'i!iuy·back the old li ne before it is re p laced with
the new one since t he data in main memory arc always clr/III. T herefore, i~ is an
effect ive way to han dle this type of coherence problem in a mnlt lcnchc system wit h
a shared main memo ry,
In the second case, an lt l!d/l l illg algo rithm is emp loyed rather thnu ill t'l/[jd/lliQlI,
That is, whe never th ere is a write req uest to a cache Irom Ihe p rocessor, t his request
will be broadcast to nil t he caches ill the sys tem to cause eac h one to sCilfdl fur
any copies of the req uested da ta. If th ere are any, t hey arc upda Led wh ile the copy
in t he mai n memory is rewritt en, Oth erwise, nothi ng is done in the caches . 'I' ll\:
major draw back is t hal it docs not tend to minimize communication net work .HIlI
maiu ruerno ry tra tl!c cause d by write 0p/·r.,.tio1Jsa m i fO r(~I~f; nil LlII: cadws to do
up da te opcrntlona, even copies ur t he daln to bl~ up dated do 1I0t resido ill most of
ua
the caches.
In this system, the wl'ite-through policy and IJpdating a rc combin ed as au alga -
rithm lUlile·Uu'Ough with uprlalillgLa handle both coherence situations. whenever
there is a writ e request from a pro cessor , t his request is broadceet tc all 1I1e caches
to inform the caches to update t he data being written , if app licable; meanwhile,
the main memo ry will receive tho correc t value for the data. Th e lmilc-lh f'tJlJg h
with " llall lillg is based 0 111I1e cxpcctntlon 1I1at , if the data arc act ively sha red , the
cache!' that ha ve copies or updated data willusc 1I1e cop ies before t hey a re purged .
Data can be classified as shal'edami lIuslialocrl as \\~11 as rca dnhle and writab le
[IU, 15], T ile d ala Me (Idi lled as slull'rd, lucluding l'l'l\,la lJlc or wrhuble var illhlCl:l , If
lI11'y cur rently reside in more lh all one cac he, white the term lms hal'c,[ dnl u mea ns
the data call on ly reside in one cache at any time . Therefore , there arc four kind s
or data:
1. Shared rcad /w l'ite data are Hie da ta whlchcan be either read or writt en by
aevcrul processors at the same li me, suc!. as share d rcad /write variables,
2. Shared read-on ly da la.a rc those which ca n only be read by several procosacrs ,
such as shar ed only-read varia bles and instructions (assuming that programs
arc Hot selr-modifyillg).
3. Unsharcd read/write dnta, meaning the data can ou ly be read or wrluen by
aile p ro cessor at auy time,
114
4. Unsharcd only-read data a rc defined as those which call onl y reside ill one
cache at any time.
Thi s policy is more app rop riate for t he case where much of the shared dat n
(a number of caches share t he same data ) is to be p rocessed concurrently am ong
processor s. When a processor rewrites the shared (lata in its cache, the copies in all
othe r ca ches are updated immediat ely so th at ot her associated processors do not
need to tr ansfer the upd at ed data From th e main memory when they have lo USt ! the
updated copies. An exa mple to illustrate th a Lt llis policy is efficient is 1I111Hlgel!lp1\1
of t he common sha red queue. It is assume d t hat thi s queue with a semaplwn~ ex ists
in each of severa l caches at t he same Lime. When one of t he processors int ends to
update the que ue, it firs t checks tile semn ]JhOl'e to see if the queue is " ('iug used
by the other p rocessor. If the queu e is not USI..'<!, th e processor sets tile SCI/IfII,/um.·
in the correspo nding cache. Mean while, thi s upd ated s emapho re is broadcust 10
upda te t he se mn phol'e ill all the caches, 10 preven t the queue from bcillg USCI] by
any ot her caches at this t ime. After updating t he queue. t he processor reset s the
sema phore, The reset semap llOI'e is also broadcast to upd ate those copies in other
caches. I£ the queue is UHI'I! by li lly of ot her caches, the pro cessor must wail until
tile se m aphore in its cache is reset , Anot her example is the calculation of t he sum
of p rodu cts of t wo sCIIUellCCS of numbers: SUAI =A, IJI +A1 1J1 +...+ ANIJ""
and t he corresponding progr am is as rollflws:
115
SUM;= 0;
for i = : I to N do
SUM ;= SUMtA(i). BU);
Assume that there are N processors for lids calcu lation, llm variable SUM is
sharc<.l and t here if. a. semaph ol't init in]j?,cJ. 1'0 execute this prog ram concur rent ly
with N processors, first, nll thc processors compute products of two numbe rs (tha t
is, processor i caiculntcs A{i ) .. JJ(i) , I $ j::; N), and the results arc stored in their
corresponding 101:a1 vnrlulrlcs. The il the intermediate results ate added together by
scr ia1i1.;~lioll . Any one of the processor s in tending to add its intermediate result
into the glcbal-eharcd variab le SUM must check t he sema]Jhol'C to see if it has been
set by anot her processor. If so, the processor must wnit unt il SUAI is released by
the operati ng proces sor by rcsetiug 1I1esemaphore. Otherw ise, the processor sets
the sCltlaldlOl't.to prevent SU AI From being updated by other requesting processors
at t his time, and then adds its local intermediate result into the SUM variab le.
Thc resu lt of the ad dilinJl is broadcast (caused by the write operation for the SU AI
vnrinblo] to all et he r processors, updating their copies of SUhi . Then the processor
resets th e t1ffllrl/JfIOI'C for th e next SI1 I1loperation Lo be done by one of the o ther
fNluesling processors. Final ly, ill allthe caches, there are consistent copies of SU It!
116
which may be U!Il..'tI for the ned pll.u .lIc1calcul1l.l tolls. Unlike this updat i ng policy,
t he in w liJ al ing poli cy simply inVll1 idalc!l all the eoplcs of SU AI in ot her ca.ch('ll
as the operatin g pr ocessor wnt cs the partial rcsult into the copy or SUM iii it...'
cache. Th us, when a ny et her processors want to continue the following 011l·r1l.lillll
of t he Slllll, they ha ve to tf1l.lIlirer tile correct partial rClmILof SUAI Iron, tlw 1II;.iu
memory before tl., illg the sum operatio n. Hence , ill Lll i ~ casl" tl \l ~ u/III IJli llg plIlic'y
is more elllclent tha n the in valiJ fll ill!}policy.
probab ility Ula~ each of t he update d data i ll the caches call he USLod before t hey
arc purged since low miss ratiO:! indicates fewer purges of cache lines. Therefore,11
larger size and a hig her set associativity of the cache are preferred for this policy •
On the oth er ha nd , this policy incurs the cost of updl\~ing aU tile caches for each
write operation, and ouly a few upda ted copies may be used by respective caehcs
before the lines co nta ining the copies are removed for requested lines. The worst
care for t his policy is tha t 110 updates are useful for utller caches; th is 111l1'1lC llli .
for examp le, when all the processors execute independen tly their own I'rOC(~'R"S
wiUlOut use of shared dat a.
5. 1. 1 The P rotocols between t he Bus nud t he Cac he
In th is cache system, there is a mechan ism fur t he cache tu communica te with
th e system bus; lUI aeyuehronous single syst em bus is !\SlIlllllell. Generall y, lIl1:
cache has to COllllllunicate with 1I1c syste m bueill three Cil l;l $ ; rho lirsl i!l iI. wr i tf~
117
Figure 52: Communic at ion between the Cache and Bu s for a Write Op m·atitJll
opera tion in which the cache has to scud the data to be writte n onto the system
bus to update both o t her caches and t he main memory, the second is the transfer
of a missing JiIlC ill which the missingline is transferred to tlie cache via tlte system
bus, a nd the t hird is all update request from anoth er cac he.
Fig. a2 s hows a t iming diagram for a write opcrerlou. In t he diagram, all t he
contro l signa ls arc adive low except t hose from the pr ocessor, like IV, IVJUTE,
and AI,Eo W hen H' is asserted, th e processor is doing a write operation 0 11 t he
cache, along wit h a valid address a ll the address Lus of tlie cache. ALE latchs t he
add tells into t he address register or t he cache memory. After e-a.ching it s director y,
th e rnche sys tem mak es a bus request BUSllb'Q for t he usc of the system bus La
118
t he sys tem bus controller. Mceuw hlle, t he cac he chec ks SEA RCH 1NT to sec
if there is an update req uest From any of ot her caches d llring lJUS/lEQ. If th l'
sys tem bus gra nts the syst em bus to the cache via tile bus ar biter, lJUSACA is se t
low which removes th e request nOS llb'Q, Now th e cache sends (lilt the bue hus) '
signa l lJUSB U SY, along with the address and dat a a ll th e SystC11Ibus, to reply tll
the bus controller tha t the bus ill being used. Since this cuche clIllLbi ll '~S the writr:·
Ihrou9/i policy and all uptlnlill9 algorithm to sim Jllify t he control, it also sigllals
all update request S[~ AflC)llNT cute t he system bus to have all ot her cachf's
do an update operation. Arter a two-cycle period, the cache removes t he request
SJ~ARCnJN 1', whic h make til e bus controlle r lnv alida tc LJUSACli. luvalidution
of~ dears th e signal 7JUS1JUSV, a nd t he cache informs t he PI'OCCS ~OI
that t he writ e operation has finished, which will make ti le processor remove W
for next operation. As soon as the bus controlle r receives a BUS)JU5T signa l, it
selects one bu s request from a bus request que ue by sending a valid llUSACli to
the selected ca che.
Communicat ion between th e cache and the syst em bus lor a line miss is lllOro ~
complicated t han that for a wrltc cpc ra t icn. F ig. 12 illustrates the conuuunlcation
operatio n between the cache a nd t he syste m bus for a line mlss caused by a writ!'
ope ration. When the cache receives a write requ est fro m the processor, it makes
a bus use req uest BU SIl£Q to the bus cont roller since the datil do uot reside ill
t he cache. After t he ca che dete-cts lJlJSAC1\', it removes IJU ,c;1l1~'C/ lUlIlSC!llds 0 111
JJUSJJUSY to the bus controller, requesting use or t he syste m bne. MeallWllilf!,
us
---ri
-< : ',
v, >:- ... -aD- .
Figur e 53: Com munication bet ween t he Cache a nd nu s for a Line Miss
it also gales out the updat ing request SEAIWJllN1' to all other caches and the
mainmemory and the trans fer request /If l SS BX'l' to t he main memory, After tile
up date ope rntlon for a write request, the main memo ry issues the requested line,
accompanied by the TIlANSFER signa l, to tile cache. As SOOIl as the transfer
operation is finished , ti le ma in memory informs the bus contro ller so t hal t he bus
arb iter clears JJUSACH . T he high UUSlaCK sigm..1 in t urn removes the requests
IJUSLJU S)' and AJ/SSEXl'. The cache will inform the processor to terminate
the write oper ation by mak ing CAGIJEl1USY high. Note that ther e is no need
1,0 updat e the requested da ta in t he cache memory sincc the line, being tran sferred
Irom t he main memory, cont niua tlmt dnt n upda ted .
120
~.
~
\ n.,,;.;., /
J
JJr---------
Figure 5-1: Communication for all Update Ope ration
If a line miss is caused by a read request, uo SHA Re / l INT signal is required,
since no update operation is necessary, Only the line transfe r operat ion is done,
as shown ill Fig. 53. Note that the shared main memory typically consis ts of
interleaved modules so that a requested Iiue can be tr ansferred in a short time.
Communicat ion between t he cache and the bus for all update cpcratiou is shown
in Fig. 54. When the processor sends a read request R 10 t he cache, along with
the requested address Al 0 11 the address bus, the cache docs the read opcretlcn.
After the read operation is finished (indicated by change of R from 1 to OJ, the
cache allows the updati ng address 011til e system bus to reach the cache address
bus an d latches t his address into lim cache address register for the directory search.
121
At the same time, the cache rends out tile da ta DI (Corres pondi ng to A1) d uring
t he fin t pul se of th e /lEAD signal, since the cache is pipcliued. During t his t ime,
the eaebe informs lin! prOCC9S0r to wai t for aile cycle. In thi s cycle, th e cache can
determine if the dat a to be upd ated are in the cache or ne t, an d at the end of
t he cycle the direct ory has finished and is ready for lite next request. T herefore,
in the next cycle, the processor sends the second read request , lind t he d irectory
is searched for th e second request while t he cache memory uni t is updat ing the
da ta requested for t he up date operatio n 0 11 t he cache data bus if drc upda ting dat a
res ide ill t he cache. In t he following cycle, the cache sends Lllc data to satisfy the
second read request of the processor. Since the cache does not intend to use the
syst em for read op erat ions, IJUSllEQ, JJVSAC/\ . as well as 7JfJSlJiJSV rema in
high dur ing tile update opera lion. Note that lhe address anr] the as sociated dat a
Ior lim update op eration arc placed onto the cache address bus and dntn bus from
th e syste m address bus an d dala bus, respecti vely, under t he con t rol of four signals
which will be discussed in t he following sect ion.
5.1.2 The Prot.ocol s between t he Processor and t he C a che
In order to cOlIIllLlInk a le with ti le processor in some specia l cases , the cache has
a signal e llcn C' /JUSl""10 inform t he processor, which is directly connected with
t he cache, to be id le. The courfitious for a valid sigllal are:
122
1. Occurrence of a line miss or an update request to oth er caches.
2. Updating of t he cache.
3. Wait ing {or lise of the system bus (tile system bus is beingused 1,)' another
cache).
Whene ver any of those three W1H 1i1.iOIiS a n~ true, C AC 11/·:OUSY is valid, \\ 'I l idl
makes t he processor remain id le unti l C AC IlB lJUSY is invalid. lIow long tl1l'
proCCSSOl' is idle IICIH!llls 0 11 tile Il/lftlcular condit ioll; fo r oxnmplu, updatlug t il..
cache needs at most two cycles.
Becauseof the limlteduumber of pins UI1 a chip. thissystemh as only a'l pins fur
add resses anJ 32 plusfor data. 'l'heso pins arc used by both t he processor and tlH'
syst em bus. Th erefore, the addr esses both from tile processor for access opernt ious
and Irom the bus for updating of the req uest data have Lo be lntchcd ill t Im address
regist er of the cache aLdifferent times, Th is is realized by two bi-direc tiona l switch
array s, BSAI and BSA'l , as shown in Fig. 14. Each array lias t wo parts , pn.r ~ I
for t he data bus and part 2 for the add ress bus. BSA! is IISc<1 t O cont rol the path
frem the cache bus (bot h the address bus and da ta bus] to the processor; IlS I'.2
contr ols the path from the cache bU5 to the system bus. Tile cont rol signals an'
from the cache, and they arc based on different conditions,
Usually IlSAl is on and DSA2 is off since most of the time the cuche ccnuun-
nicatea with it s processor. During a wri te opera tion to th e cache, the dil lil heiul;
written are broadcast all the bus to upd ate other eaclrca and rewrite the lI1"in
123
g ".Swilr hA IT"ft(aSA1)
l--J L---' ,- TileAdd.... a lii
r--, "'=I- Th. D~I. IJ ".
a ... Swlld , AlTay2(8SA2)
Figure 55: T he Processor Subsystem
memory after the cache system receives the BUSACJ( signal from the bus COI1-
troller, while SEttRCI/ INT is sent onlo the bus to inform other caches and the
main memory. OSA2 switches on al this time to gate the data andaddress to the
system bus. Arter the write operatio n, USA2 returns to the off state. When there
is anupdnte request 011 the bus, the system has to do the upda te operat ion. aSA!
switches olf to cut off the path to the processor while DSA2 turns all to connect
1I1e pllt h to the bus for the update opcrnt lou. Arter updating. the ar rays ret urn to
tln-irot lgiual states ,
If there is a line miss caused by a read operation and if the data is in the
main memory, the requested line is transferred from the main memory to the
,2<
cache , one word at a time by iutct lcavlug. III this case, 08A2 is on while OSA1
is off. Furthe rmor e, once th e tr ansferring word on the bus is the word need ed by
the processor , th e signal REA lJVALID is genera ted by til...cac he to inform the
processor to take the data on the bus to 5atisfy tile processor ins tead of reading the
requested da ta lrom t he cache after the full line has bceu trnnsfcrrcd. U the Iille
miss is caused by a write operation ami the cache is grll11tcll U SI) of tlU!1J1l!l, first
BSA2 switches a ll while U8AI remains Oil, to gate th e data directly onto th e bus.
Bot h the other caches and the maln memory nrc updated. 'l'hcn USA I switches
oITto cut off th e path tc th e processor so tll1lLthe updated line is only t ransferred
to the cache via DSA2, witho ut llpllali ug thc requeste r] data in the cache afll.'r
t ransfer. Thu s th e delay for tr ansferring a new line Jurin g a line miss ca n be
decreased for both read and write operations.
The four signa ls, AD D BUS I, ADDIJ US'l., DAT IWU SI, and DATAfJ US2,
control operatio ns ofthe DSAI aud I3S1\2 described above. Th e AD LJIJUSI sigual
cont rols operations of the add ress bus of USAl wh ile ADDDUS2 determines op-
era tions of the a ddress bus of USA2, Also DA1'ABU S I is used to cout rolIhe dala
bus of BSAl , and the DATABUSz signa l is used tc contro l the data hus IIf IISA'l.
The timin g diagram for operat ion of t he signals for comnumic atlon bet ween the
cache and the pro cessor are shown in Figures 57, 58, lind 59 ill the lII~xL Hl~d io ll .
125
5.2 External Interface
To be used in n cache-based compute r syst em, the cac he needs to inter face to othe r
compone nts ill the computer system, includ ing t he associated processor . system
1J1I ~ , l1I11.iu memory, ef.c, T his sectlon contains a brief descrip t ion or ti l!: cache I/ O
sisnn ls and Iimiug .
5. 2.1 T h e Interface Signals
T he cache's extern al interface has 86 signa ls as shown ill Fig. 56, A summ a ry of
t he pin func tions is given below:
All - A31> Add ress bus lilies (inpul). During execut ion of t he write/ rea d op-
era tion, th ese inpu ts nrc the address Irom th e associated proc essor via part two or
t he Uus Swi tch Array 1 (USA1). During execution of an u pdate operation , they
contain t he address Iroru other ca ches in th e multiprocessor sys tem t hrough part
two of t he Bus Switch Array 2 ( USA2).
Du - D31, Da ta bus lines [a-state, bidi rectiona l). These signa ls provide the
data path between the cache and the pro cessor a s well as the syst em bu s. T he
da ta bus can tra ns mit and accept da ta using the dynamic bus sizing capa bllules
of the cache memo ry; the dyunmic data slz e 1111\Y be one, two , thre e or four bytes,
llqll'lllliug ti ll the (lata requirement. During execut ion of Lite write/ read oper at ion,
these inputs / out puls ,H C th e data from/ to t he associated processor via part I of
HSAI. During exe cution or au update opernticn, t hey contain t ill! data to be
updated h)' th i ~ cuclu- Irotu ot her caches through part I or BSA2.
120
v.,
".,
"I<
W/lIT e
I/~;M)
''"CII'!
CI, 2
III!;SK/C,
,
l'SSJ'OUT
lJo-V'1
Figure 50: Pill Functions
127
Yo, l-l, Accessed data elac (input ). These in puts fro m the processor indicate
numbe r of bytes of ti le data being accessed in one processor access cycle (Sec t he
previous sectio n).
HI, Write operation (input, act ive high). Th is signal indicntcs to the cache tha t
the operation is it write opcmt ion.
n, Head ope rat ion (illlJllt, act ive high ), 1l is used to ludlc utc ,l mudope ratio n.
W,'itc, Write ateobe [input, ac tive high] . T his signal is used to write the data
0 11tile d.'lta bus into t he cache, If the write opera tion cau ses a line miss, t llis signal
do-s no t appea r,
/lead, Head strobe (input, act ive high), T his siGna l is used to read the data
requested by the processor frOtll th e cache. If the read operation causes a line miss,
this signal docs not appea r.
ALE, Address latch enable (inp ut, act ive hig h). H ind icates that the address
011 the address bus is valid, and is used to latch th e address into the address register
o f the cache.
CK!, Clock phnsc 1 (input, act ive high). It is used to generate cache cont rol
signals a nd pipeline the cache.
C!\'2, Clock phase 2 (input, active high ). 11.is used to generate cache control
signals a nd pipelin e the cache.
CS. Ch ip select ion (iuput , act ive lligh), 'I'his signal is used to indicat e if this
("rlle is selected dur ing processor cpc revicns . lt is very useful for multiple cache
ch i p~ used in a contpuler subsyste m.
128
RE SET, SysLem reset [inp ut, act ive low). H clea rs the intern al logic of th e
cache memory.
VDD, System power (input). It is a +5 volt power supply.
Vss , Syste m grou ud (input).
BUSREQ, Bus request (ou tput, act ive low). This output is asserted to indicate
that the cache requests usc of the system bus.
BUSAC K , Bus granL acknowledge [input , active low). Th is signal iurlicntcs
~hat the syste m bus now is gran ted for usc hy Lhccache.
1JlTS1JUSV,Bus busy (Olltput, act ive low). The ou tput indicates to the systr-m
bus contro ller Uml t he cache is using th e syste m hus.
SEIIUCIl i NT , Search inte rrupt (bldirccticuel, act ive low). If there is a write
ope ra tion, t his signa l is all outp ut which informs t1w main urctucry and ether
caches to upda te t he data on the syst em bus. Otherwise, it is lUI input which is
checked by t he cache to sec if there is an update request Irom ether caches ill th e
multi processor system.
AlISSI£X'J' , Line miss [output, active lo w). It iudicatcs UlaL th e dal.a n :·
queste d by th e processor arc 1101 found in the cache a nd eska tIle main mumory to
tra nsfer the m issing line 10 the cache.
TRA NSFER, Tr ansfer of a missing line(i n put, act ive high}. T his sigllnl Irom
the ma in memory responds to th e request for transfer o f a mis.sillg line to t he rnrhe.
It is used to w rite the missing linc int o the cac he.
CAClIE B USY , Caehc busy (ont pul , active low). T his signal is used to infor m
129
th e processor to Le idle w hen i ~ is valid.
ADDBUSI , Address hils cont rol 1 (output, active high ). This signal is used
to contro l part 2 [ Icr tile address bus) of DSAI . DSAI is employed to contro l tile
cache bus path to the pro-cessor (See the pr evious section).
ADDJ3US2, Address bus 00lll rol 2 [out put, act ive high ). It is used to control
PilTt 2 (for the address h ils) of n SA2. 1351\2 i.~ emp loyed to (outro l the cnche 11II~
path to t he syste m bus.
DA'l' AlJUSl , Data bus contro l I (out p ut,lIctive hig.h). This signal is used to
control pa.rtI (for the da ta bus) of DSA1.
DAT A J1US2, Data b us cont rol 2 [out pu t , act ive higl, ). This signal is used to
control pa rt I (fo r the data bus) of DSA2.
TEST IN , Te st data input (input, act ive high). It is used to shift out the cache
memory addresses for te st ing only. Bpulses are inp ut for each address.
TES TOUl' , Test da ta outpu t (outp ut, active h igh). It is used to shifl ou t the
cache memory add resses for tcsli ng only. T he out puts arc an B·bit sequence of 11
cache line address (rom hit 0 to bit 7. After onc tcs l.lng [m is e input from'l' F;5'T I N,
one bit of theca che line address call be obse rved o n TES TOur.
11. 2.2 T he 'r imi ng O perations
As des cribed pr eviously, operat ions of the cache can be di vided into five types:
normal read, normal wri te, read -miss, write-miss, 1L11t! update operations . The
operat ions of th ese types callbe depicted by Figures 57, 58 , and 59. Fig. 57 shows
130
n.ad O potMI",," : W,;I. O... '~l i'",.
CK I ~~
CK>
: : :
",~ ', ",r:=
llttMJ
W/iITE
M ,E
IJV/) llBSS
~
BUSIII-:q --=7==:=--'-., F= = = :::;-C""'---;-
IIt/S,\C; /(
i:iiiSiJUS'Y
C ACIIEBU5Y
A/Jm W S I
ADD8US~
l)ATMJUSI
DAT..I 8VS~
Figure 5; : 1\ Timing Dingrntu for H(~i\cl/Wrilc Dpcrntiona
131
A/I /1l WS S
V A1'A
~~
' fL
~
,~
IB1
: :
n
Figure 58: A T illling Diagram for Read Operations with a Line Miss
132
Figure 59: A 'I'hulng Diagram for Write O!lcrati ollllwit h a Line Miss
133
IIOW t ile uonual read a nd write operatio ns, includ ing update operations required
by other caches, a re processe d. Til e operalions o n the left. hand sid e of the figure
arc the read operations with all update o peration , while 0 11 the righ t hand side arc
the wr ile cpcre.t.ions wit h /LII up d ate opc ratlcn. W e Cl\lI sec thll.t t he oper-ations
arc pipcllncd fro m the figure in which A I , A2,etc. are an address seque nce on
the ad d ress bus while 1)1, D2.. . nrc the corresponding da ta sequence Oil t he datil
IJUs. T he valid p eriod o f tileH£,ARC Il JNl'signal labclc d Receivingis is for the
cucheLoupdate the dat .~ after receiving <til \lpd l~l.c request Immnnothcr cache on
St.'AllelJ/NT , and t he period labeled Sending is used 10 send an update request
onto t he SEA IlCll /N 1' line, ThclldJress J",oded UD is i ll ! IIdJre:-ls forupdating
nud similarly t he datn la beled UDare the data bci llg updated, F ig. 58illustrates
the sig na l operations o f t he Cache Juring a line m iss caused by a rend operation .
Theda ta labeled ULo 7 O il the data bus are eight words of tile missing lineobtained
fromthe main m emory by interleaving. Fi g. 59 showe opcra ticue d u ring a lin e miss
caused by a write operation. In the figu re, the period labeled U/ldaling Operalion
is lliida ting the cache d uriug the request for the usc of ~he system b us caused b)' a
linemi ss. The period la beled Sem! 11I1 VIII/a/jng Request is used t o send t he data
to be writ ten and the corresponding address en to the system b us for u pdetiug
other cachesan d rewriti ng them aill memory, Th e period labeled Line Transferis
transfe rring t he miss line from th e main memory into thf' cache.
131
5.3 Consideration of the Sys t em Bus and Ma in Mem ory
A typica.l mult ip ro cessor system usually consists of a.set of processors am! or a
set of me mory a nd I/O mod ules linked togethe r by means of lUI intcrcounc ctiou
n etwork. Inform a tion ex change be tween either t he processors t h clII.~,' l vl'S o r tlw
pr ocessors and shared mai n memo ry is acc omplished by th e intcrconnc ctiou net -
w ork. T herefore , the in t e rconnec t ion ne t work is a very i l1l1Xll'l illlt par t of th,'
sy stem. No gene ra llyacceptedst a ndard fo r lUI inte rconnect.ionnetwork exist s, ami
s ince the interconnection network co sts arc a elgniflcaut pa.rt of the s ys tem cos t , the
in tercon nection n et work is norma lly designed accord ing to t he requirements of the
s pcciric application . l lcre t he BltS~Orie li fed network (the s y st em hus) is dlsenssed.
There arceevcral typic al lmplcmcnt ariou policies for the single-b us arbiter. Ollt'
impleme ntation o f the Neuser l -scrvc r bus arbiter is based 0 11 a Ilrs t- rcqucat lirst ·
service poli cy. III t h i~ w a y, Lite b us arhitcr always serves t he req ues t which was
made th e longest, lime ago 1L11l0U g Ull~ bus reques ts . III the c ase l hat thew is 1II0 rt'
t han one request being made ILt t he same time, th e arbiter satisfies the c ue madl '
by theprocessor whose log lceltuuubcr is smallest . Fig. 60 d(~pkts ILblock di i'gmlll
o f thebus arbite r. Itmei n ly cons is t s of N circuit blocks, each or which corresponds
to a cac he , alltl a slate ato rage blo ck, Signals nO-ll N_1a rc Lim bus reques ts from
N differen t cach es for use of the system bus. For allY hloc k i , t here is a signal
B U3Ack ; inform ing the corre sponding cac he thal il may li se the bus. Vali lla tju ll
o f JJu3Ad·, depe nds on t he reques t JI" the rl' l j ll es t grallt G j from t ile storage, and
C._I' All lhe cir cui t block s arc co nnee tod in a da isy chain hy Cu to CN _I so tl lilt
13.')
Figure 60: The N-User I· Server Bus Arbiter
block i can be invoked i( and only if block0 tc block i -I are not invok ed by bus
requests, and the Co will lock the following blocks (block i + 1 to block N - 1)
not to be invoked. Therefore, none of the succeeding requests can be responded
by the bus arbiter at this time. Arter the system bus is releas ed by tha t served
req uest , the next request will be granted use of the system bu s under the same
st ra tegy. Th e state storage h olds the information about the time the requests arc
made. Whe/lcver the system bus serves a rcquest .Lhe storage select s t he one which
is m ade the longest time ago by asserting Go. After a request is served , the block
i will be reset and the sta te sto rage updated for next service.
I II general, devices in a muliiprcccescr system have dillerent priorit ies for lise of
the system bus. They call b e grouped in terms or thei r priorities. T hc scheme as
sho wn in Fig . li1 can be IlS00 for the arbit ration unit ill which there is a two-level
136
Bu. n"'l,... t Sip 'Ala
Fig ure 61: T heMbitra~ioll Ullit
parallel bus arbitration. Th e first le vel is organized with arbit ers shown in Fig. GO.
For each group of requests with the same priority, an arbiter can be employed to
select one request . The Cll,'lIN signalHues (rom all arbite rs arc co nnected a nd
t he second-level arbitration selects th e highest priority arbiter Il ~i ng a (Iaisy chain.
As indicated previously, t he main memory can be divided into modu les which
a re connected to ti le system bus. It is ass umed that the shared maill 1lIl:l 11ory
for the system under cousidc raticu is I'al'tit iollcd into eight modules as shown in
Fig. 62. The data bnndwidt.h oreach 1II0<lnl<:is OIW word (:12 blts ]. Th e memory is
or ga nized in such a way t ha t. 8 words of a line arc store.] in 8 ruodulea, respectively
T hat is, t he first word or Jille i reside s in t he modu le 0 while the second word of
Iilie i resides in the module I am] H ie third is in the nxulule2 , and so 011. llenc«, a
line ran be t ransferred easi ly by iu tcr leaviug. Wlwll there is II. reques t for trans fer
137
or a. miss ing line, each module se nds a word in that line an d the system bu s deliver.
all of them by inlerlc~vins so t hat the delay is reduced .
Al so lor ea ch mcdulc the r e is a bu ffer queue lor write req uests. Thu s, the
specified ca.UIC only uccJ s II. s hort t imc lo send tile wr ite request (indudilll; th e
data 1.0be wri tten ami the co rr espondi ng add rcs.,) to t he given modul e without
waiti ng rur t he main memory 1.0 romptete tile request . Wllellc\'cr there is a write
request enterfng 0 1lC mod ule, t he modu le controll er first checks t o see if there is
il. wri te request in t he buffer qu eue which is ecccselng to t he snm e locat ion il.ll the
entcrlng request . If so, the data of t he request ehcady waiting ill the fjllellc will
be repln ced b)' Lhat or the entering requ est, a nd the entering request removed .
Othe r wbc, the clllcrillg U"llllcl':l is inse rted at tile cut! o r thc queue. Thu s, t he
IIlrile - r rile competition is el imina ted in the memory module. When there is
a CAche miss, a line t ran sfer is required, Ilnd all tile modu les t ra nsfer the missing
line im mediat ely without inserting th e request in the qu eues. In the case that
tile line miss is caused by it. write ope ra tion, first the module being evcrw eiue n
checks the queue to see if ther e is a request in tile queue Icr th e same location.
If so, the request in th e queue is remo ved. Ot herwise, t ile queue is unch anged.
T hen t he mod ule serve s tha.t write requ est causlug a line miss by updat ing the
feflllcs !ed memo ry loca t ion with the da t a 011 the system hil S: the up dated word is
SC1lt to the reques ting cache wit h the o ther 7 words Ircm respective module s by
lnterfenving. Mcenwhile , the ot h<>r 7 mod ules serve the tralls rl~r l"Cquest as they do
fur a t r;lll~rcr relj llClltc."lused b)' a read operation . Whell th e liue mi ss is cau sed by
138
Ftgurc 62: The Shared Main Mcmory l' nrtlticned into 8 Modules
It read operation, each module chec ks ita buffer queue to sec if there is a request ill
the qu eue, for a write into the lo ca tion to which the transfer roque..~t will access. H
th ere is such a request, it is removed from the q ueue; then it is served lnuuedintcly.
Th us, th e so-called n:ad - write memory competition is han dled. T he modu le theu
sends the requested 1V0rd onto t he syste m bus. All 8 words [rom different modules
are sent all the bus by int crlcoviug . T his call be done by t he bus controller. Note
that there is 110 !'Cad - !'Call memory competition since the usnlu memory ulily
serves read opera tio ns during a missing Hill' transfe r and the slngte system hus
only se rves c ue request at a time.
5 .4 Simulations of t he Cach e-based Mu ltiprocessor
In prev ious sections, functions end structures of the cache 1ll l" IIlI lty a nd a mnltiprc.
cesser environme nt were described , based on a lI!1'ilc.tfmJ!ly" wilh Ull/lll li,lY tache
139
coherence protocol, ill which the multipl e caches ar c used. In thissection, a typ ical
simulation model , as sho wn i ll Fig. 62, will be used 10 study the efficiency of such
a shared-memo ry multipro cessor syst em. T hal is, we would like to determin e hew
many processors can be used in the system without rea ching sat u ra tion of this
syst em. For slmpliflcatlon , a single bus is employed as the inte rconnection netw ork
ln-twren multiple caches and the sharcd rncmcry, a lthough t1sillg a more com plex
inte rconn ection network such as a multi-bu s system makes the syst em more em-
cicnt . Furthermore, ill the model all the processors arc iden t ical audeach processor
has a private cac he.
Th e model COllllisl.sof a process for each processor, a pro cess for ea ch cache , and
a process for the single bus. Each processor generates a memory refere nce seque nce
to the as sociated cache. Memory reference stre ams iu thcsys tem are p roduce d with
a spcd~ed write opera tion rat io a nd a given cache miss ra tio in t he steady slate.
Write opera tions are produ ced a t random with a given read / write ratio. Each
cache is impleme nted as has been describ ed. Fo r each pro cessing s ubsyst em , if
the reference is a road operation a nd the requested line is present in t he cache,
th e cache spends one cache cycle and t he processor cont in ues. If the reference is
a write op eration and the requested Hue is found ill the cache, the cache puts an
update request into t he service queue of t he syste m bus, and spends two cycles to
update all Lhecaches nnd the main memory via the single bus as soon as the up date
request is nckuc wledgcrl by tIll' I' IlS. 1Lis assumed LlmLt here is a huffer queu e in
each memor y module for write req uests Sf) tha t a write request can OC sent to nn
140
app ropriat e memory module ill two cac he cycles, othe rwise, a miss occurs. III LIds
case , the cache needs to insert Ihe t ransfer request inlo the ser vice queue of t i ll:
system bus, and the bus ta kes 11cycles to tr an sfer a. requested line into the ruche
wh en the re quest is served by the bus. T he t ime required to tfltllsfCf n Illissing
line is based on an assumption that a main memory cycle time is [our cliche cycles
a nd the main memo ry trnus fcrs the missing line by iutcrlcnv iug, cue cache cycle
per word. So the trausfer of a missing line req uires one malu memory cycle thnc
(4 cache cycles) for the first word pl us one cache cycle for each alldi tio lli11 word
(7 cache cycles). When the cache receives an update request from the bus , if the
req ues ted data are found, t he Cliche spends two cacbe cycles. Otherwise, it ouly
ta kes one cycle to search t he directory Nole that those caches, waiting in the bus
service queue for lise of the syslc m bus, must hall unt il t hcy arc rclensol from t il('
bus queue after service, T he bus pro cess receives service reques ts from all cad II'S
and serves them in first-in first-o ut order, implemented by a first- ill first out service
queue in th e simulation model. [0'01'o nc requ est in the queue, th ere arc four ite ms:
Cac.he Num ber from which the request i3nuule, II'l'il r:/ Miss, "lbll1': s.~ , all ,l (,'flr:},,,
Cycles to be used by the system bus. 111 this model, it is assll llled th;t1 l.1u'l"(! is 110
delay when the bus services a request.
Fig. 63 emumar-iacs the results or simulat ion lor {'\,nlol"tioll orthe 1II11!l.i p l"OIT S·
SOl' system with th e proposed caches as priva te caches. SilHulatiolllJlltl'llts incl ude
bus utilization flgurr's and cl.hcr pa mll1c1 f' r s 1l11lIl'J" whir l! 1I1l' siuurlutions111'1'1"1111.
The system power is Ildi llCd tiS total slim of the p r O Ct 'S SOf uti li7.iltioll ill l lw lIlulli ·
111
processor system , multiplied by lOa, and processor utilization is me asured by the
rat io of time spent doing useful work in a processor to t he tota l runing time. In
each figure , the simulat ion results cbtelnr..xl with t he indicated parame ter values
arc show n with from one til flltccu processors. Th e para mete r Cycles gives the
cache cycles executed dur ing simulation. Fig. 63 (a) shows the simulations with an
overall cache miss ra tio of 0.115 while FIg. 63 (h) gives tllf: slmulution n:.~lll ts with
au overa ll miss mt io of 0.0:.1. 1/1 cnch ligur e, there arc five curves, each of which
lndicetc s a simula liun with write opr-ratinn rat ios vnl"yiug Irom l U percent La 30
pcrccur for memor y referen ces. The system power rises until the system bus begins
to reac h s'\l nrlltion. Wilen the bu s utilizat ion approaches 100 perccnt c the system
power levels ont o Fur each Ilgnre, it is seen t hat th e system power becomes Iliglwr
and the minimum numb er of processors at which t he single system bus reaches
Sil.lurati on lncrenses as the write opera tion rat io decreases. COJnpa risoJlof th e two
figures indicates lIml a decrease in lhe overall miss rat io of tlw caches increases
syslem power.
Alt houg h usc of the specified caches ill lll i ~ given multip rocessor st ruct ure (lUI
grea tly reduce references to the slow main memory , the single bus system seems to
be a bot.tlcucc k ill the multiprocessor since write operations and transfer of miss
lines under the llJrilc t1I1'O!lg/1 prot ocol still make the bus busy. In order to furth er
improve the system performance, an increase in syst em power and bus abilily call
be achieved as follows:
l~i r s t , <lata call bc lab cted as either private or shared , If there is a writo opera tiou
142
Sv slell POIIer
C\cles:IOOOOO
300 M, ss R. I,, :O, 05
250 Ir ile R,'e: lOX,' 15X.. 2OX,' 25:1, ':JJ7.
200
150 '§§§§~~§§100~
50
o
I 2 3 4 5 6 7 a 9 1011 121314 151617
r~llbe r of Processors
(. )
SyslM Pcve-
Cyc les: IOOOOO
300 MISS Rol ,, :O,03
250 Ir il. Rol e: lOX,' 15:1" 2OX,' 251, ' 301
200
150
1 0 0 ~
50
o
1 23 4567 a 9 1011121314151617
~lIberorProcewrs
Figure 63: Siumlatious Ior ll u' Multipr ocessor Sysll 'm
on priva te dat a, the cache does not issue an updat e request to all othe r caches while
it sends an overwrite request to the main memory via the system bus. In this case,
et her caches can do useful work withou t being intercepte d for an update operation .
T hus, the sys tem power would be increased all hough t he system bus lias tile same
traffic as that without t his enforcement. Note t hat in this scheme, t here are two
...lguals required, instead of SJ:JAllCJJJNT , one is used to inform all caches to
update copies of tile sharer! dat a and the other is used to request the main memory
for a write opcrul.iou.
Second, usc of caches with larger s i~l!s can increase t he system per formance ,
since a lower ove rallmiss ratio flH t ile caches l'ClI ults ill bot h a higher system power
end a higher bus util izat ion Ircm thc simulations. Also, the larger cache size 1lI,Lkcs
the cache have a lower miss ratio. The multip rocessor syste m can take advantage
of till.' elllci ent Cliches since a large size cache memory for each processor ca n casily
be formed hy using several cache chips selected by cM],selectswithout decreasing
ti le cache spee d.
Third, th e IIsC of multiple busses would aigniflcautly increase system perfor-
1lI~IlCC, because the waiting lime of each CAche for usc or the system bus would
n·r l.llinly be decrease d.
5. 5 Testing the Cach e M emory
'lbtillg the circuit is tile final step of a VLSI design, to deter mine if the circuit
beingtested has bo th correct logic and circuit ope ratlcns. Although tes ting will be
141
don e only af ter fabrica tion, ha l' to complet e the test illg t".<lk has to be considered
durin~ circu it and architect ure design.
Test ing of the cac hememory CnII be done USi ll~ II. microcomJlu ter. All the 1/0
lines of the CAChe memory chip to he tested arc couu cctc..1 to the microol mplILcr
via all interface. All the signa ls for cont rolling OJw.tll.l ioll!l or the cache ml'lUl>ry
IIIItI addr es ses and co rrc." I'0 luliltA d.ltA Me IlCltt to t lte telitoll cache al,,1 all LIlt·
corres pond ing resul t s arc received by th e rnicrocornput er , T he microcomputer will
chec k to see if the cache operations a re correc t . Th r mlc roccmputcr, :;lep by slep,
will test all the fund ions of the cache memory.
In orde r t o test this cache memory chip, t here iSll almple ;ulilit iollalltlg ic hlnc!;
in th e chip to shirt o ut tILe cache memory add ress Ircsn t he memory registe r. There-
lore , ror ea ch main memory addres s referenced , the corresponding cache 1I11:mo ry
add ress and its conte nt CAli be obser ved. T hu s, \1'C CAn cre-.ltC' a tal,k cuntaillillg
the main memory Ad dr CMCS for tcsti U1l. , t he correspond ing Ci\d lt~ IIIl'lnor y addree...'!I .
and the corresponding dat/\ d uring tt'Sting.
Fig. 6-1 illustrates the t('5t circuit which ca ll shift out tl«' cael l!" 11l'.·ll\UfY :111·
dr es ses. The g·b it cache1I.ltlr('S!\CS Me irnllnSt·d 011bitll tttt; - 111'17 ;l1lt llh.~ 11 11 11·
t lplcx is co nt rolcd by 11 J·llil counter. Wlll'lw\ 'cr there i ~ II Ilulsl' at T l~ST - f N .
the counter is increa sed h)' I. Ouc of the 8 lJ it~ of ll w cuche address pilSSI'S thr-
multiplex to t he o u t plll 1' F:S l' - OUT . ' I'hus , lor shift-ont of a cache ilf hln ~s:; , M
pul ses arc hupc sed 0 11 the 'J' f~'ST - I N lIud g I J i l~ uf t l l'~ ,,, I,lr,.'S'! Gut I, , ~ rt't', ~ i vl :< 1
l1t 'r EST - OUT Ollt· II)' 0111".
115
v'" '
Bil.
I
r
I
r-'
tJ. SoH, HlllI Hh,.rt CO",,,"
T • • l- in
Figure 601 : Th e 'Ieatlng Circu it lor Shift ing-c ut Cach e Line Addresses
In thls chapter, usc of the cache ill il mult iprocessor euvlroumen t is described
ill which a system bus and a shared mai n memory lire assu med. A write.throllgh
IlJilh IIIJdllli ll9 strategy is proposed and employed to keep data ill the sys tem co-
IIl~ rcll t . T he system bus and share d memory st ructures arc discussed . A que ueing
1110tld is crent ed end t he system simulat ions have been done lo evaluate the syslcm
perfo rmance.
,<6
6 CONCL USIONS
This cache memor y sys tem has been laid out within a chill, using the :1 micron
NT CMOS3 techn ology, and simulated. It has an 8K.byt c cache memory (4 byh.'l:I
for ea ch word, 8 words for each line), uud it is orgnniacd as all 8· Wi lY .~~ I - tl$.~or:ifllilJf'
cache. Th e cache memory is directly accossnble to OIlC, two. three , or four bytes (0111'
word) once by t he eeecci etcd processor. A two-phas e clock is used to synchronize
and pipeline th e system. T he clock period is 40 nanoseco nds.
In the directory, there nrc 32 sets , ther efore 8 linc slots for each se t cnn be
simultaneously compared . The address t reuslntlon can be finished in 18 n a n OSI'C-
onds. Thu s, t he cache can safely tr un out a result in 20 nanoseconds during read
ope ratio ns wit hout line misses.
The leas! recm tly used fille l'tplncerncnl stratcgy is employ ed ill the rcplaccmcu
unit , T here are 32 ccmpoucnta, each one lmplcmeute d by a Lit matrix corre spond -
ing to a set.
T his cache memory can be used in a multiple proc essor sys tem to improve
the system pe rforman ce; a Iiwil e.thl'ollg h willi 1Illl/fi lin g policy, combination or a
u"i le. tfll'Ougli aud a uluiliting algorithm, is employed to kee p the informa tion ill tlu:
main memory consiste nt with that of the rnches and 10 make the rnulticaches ill
the sys tem cohe rent. T he IIit ra tios of this cache memory, ill ter urs or t ill' Cncln-
Design Target Miss Ratios Table , arc predict ed 10 he over Wiperr cut.
Compa red wit h on-chip caclu- II K' m Ol'}', this ra.dte memory r hip lias a l ll rW ~1
1<Ii
cache size and a low miss rat io. Unlike a cache system consisting of a cache con-
troller and RAM chips, it is more flexible to build a cac he syste m whic h has a
larger cache capacity (marc t han 8K byles) for one processing element wilh scv-
eral of the proposed cache memory chips by using the ch ip select signa l; t his docs
1I0t decrease s t he syste m speed . It is a lso easy to imp lemen t a cache sys tem wilh
scparille cache memotice fur data lind instructio ns. Thi s lllulti·dl ill cache syste m
also elimin ate s delay time caused by wire connections between the cache cont roller
and IlA t.l chi ps (of -chip delay) ,
Allhough t ltis cache 1111.5 runny ad vllllt.ages, th ere are several draw backs, due
to limitations of the VLSI technology used. It dot'S Hot have a "S ll OOP " directory
whh-hfall be used to 5110 0P the SyS l 1.' 1I1 bus for update operations, and in tum
to elimi nate the directory searf h titHe for upda tes. Il doe-s not further red uce
tile references to t he main memor y m used by writ e ope rations, which ca n make
heavy interconnect ion network trullic und er the wd /c·t/uough policy, especially in
a sing le-bus sha red-memory umltipr occss or system.
All implementation nslnga more modern process tech nology, say a 1.5 micron
tcclmology, wOIII(1 !lcrlnit a larger Cliche memory chip , 01 " rat her a large on-chip
cncho memor y, with d ua l-directories and faster ad.lress t ranslation. Also it could
allow lite isupleuiuntntion of both thc wri fc·f !lm!l!Jh and writ e·back policies in II
ca rhc metuo ry, which could make the cache ha.vea great perfor mance improvement"
For higher perfo rma nce, nssuruing the <I,lla arc classified into shared and lin·
slltlrnl as mentioned before , if there is a request fOI" writing a sha red read /w rite
148
vllriahle. the write- ihrollgh po licy is used to keql lIle lIIuhic acl ti.'S and main melllury
cons istent since t he , h.reU read / write varia ble mey be 1II0cJilied by sevc ralproces -
50111. Ir t he accessed dat AAre Anunsha red read / write vaeiable , th e .. rii e·back )lolicy
is emp loyed to decrease the lIetwork tril.ffic eluee culy O Il C tI<lIi d cOI'Yof thi, \mi al.M'
ca n exist in cne cach e. IIIth e case that t he line containi ng the unsha red reatl/wrilt,
\'ar iab lf'S is to he replaccci by l\ !lCW ff"flll cstl~1 (I II I.' a t a cae-he lIliss, if this l i ll'~ llils
been up da ted since it existed in the cac he, it is wnttcn-b eck to tile main memo ry
before tra nsfer of the IICW IiIll.' to nrake ill rOr lllntiul1 ill lll' ~ lIl a ;1I memor y cor rect .
H it has been unchan ged since it resided ill the cac he, it , like the line which con,
talus the read-Dilly da la (incl utling t he shar ed and unshnred rend-o nly \'ariillJkll ,Ill
well as ins tr uctions). is simp ly overw ritte n by the new n.~11I(.'S1(."t1 line , since it is
consistent with that in the main memor y.
119
R efer ences
[t] Derek De Solla Price, "A History of Calculating Machines" , IEEE MiCl'Q, Vo\.
4, No. I, lV84 pr. 21-30
{2] Harold S. Slone, lliy/l-pelfonullnce Computer Architecture, Addison-Wesley,
Heading, Mass., 1087
1:11 It. E. Mutlck, Compuler SlolYl!Je S y• lems and 7'cchtI01Q!JY, John Wiley &.S OliS,
NewYork, t!J77
H] T. Kohouen, COlltrlll-Addressable Memories, Spriugcr-Verlng, Berlin, H18U
[5] A. V. Pchm and O. P. Agrawal, l li!/h-s/lfed Mem ory Sys /ellls, Prentice-Hull,
Hes ton, Virginia, 1083
Ilil J . gIller, ct ul., "Issues Hc1alC(1 lo MIl\ID Shared-memory Computer: the
NYU Ult mcouiputcr Approach" , 1'he 131{, A'II!. l ilt '1 SYIII]I. on COlllputel'
AI"Cllil ecl ul"e, HlllG, pp. 127-1:15
l7J W. C. Yell, D. W. L. \'('11, and IC S. FII, "Dala Coherence Problem iii a
Mlillicll.clJeSystem", IEBB Tveus, 0 11 COlll/ll lle l'S , Vol. C-3'1, 1985, rp . 50-65
(8) S. Frank and A. luselbcrg, "Synapse Tightly Coupled Mnltiprccessorae a New
Approac h to Solve Dill Problems", ,I FIPS COIlf. l' voc., Vol. 53, (!lIH, !' p.
'11·5(1
150
(9) A. J. Smith , "Cache Mcm or ies" , IICM Compu li' ,g Sllr"t!l~, Vol. 14, No.3,
1982, pp. 473-528
(101 A. Golt lieb, ct aI., "The Ult rllCtIlUl'ulcr - Dcsisnins a Mlhl U, Shnecd A!t:/ll '
or}' Pa rallel Mad .inc", 111t 911, A nn . Int'l Conf . OIl Co II/lll1.lrr Arr:hilrchl rc,
1982, PI). 27·42
[III A. J . Smi~h , "Liue ( Block ) Size Choice (or CPU Cndw ML"11II1ry" , IHJ.:I~·
1'm ns. 011 COllll' U1c,'S, Vol. C-3G, No. 9, W81, PI'. 1063· lu11
(121 R, II. l( alz, ct aI., " Imple menting a Cache Consis tency Pro tocol", TIl t tnt.
AIl II. 111/ '1SYlIlp. 0 11 Com pultr Ilroc/,ilu Lurt, 1!l85, PII. 216-283
(131 P. Sweazey and 1\. J . Smith, "' ' Class or Compa tlble Ca chc CUIl~is l;,'''I:Y
Protocols and their Support bJ the lEEr;; ~'u~url:bus" , The I f /I, All n. bl l 'l
Symp . on Cornplder Arc!lil cd ure, 1985, PI). -415· 01 23
(l4 ~ : . It Ocodman, ·Us ing Cad lc Memor y Lo Hed uec Processor-memory TralJic" ,
10th Ann. bill Symp. 0 11 Com/.u/er ,hochilull,rc, Hl8J ,l'p, 121· 131
(IS) P. Bita r a nd A. M. Despa in, "MIIILiprOCC!isor Cad le Sylld ltll ni"'.<'lt iuu ._- 1:1.'111I" ,
Innovafione, Evolut ion". Tht 12th ,1'11. Int '/ Symp. 011Cf)1ll/JlllcrA,'ChilccLul'c,
1985, pp. 425·133
[IGI D. It CllCrilon, G. A. SIa.vcuLurg, and P. D. I1(JY"~ , u;;urt ·cu ulrolleu c..:iUJ Il,.
in the VlI l r Mulliprufl!!lso r", Tile l !!Ih ,lull . 111 / 1 8ylll/" (}II C f1l1lpldtr AII:I.i.
t t cfb re, 1985, pp . 3GG-3H
151
[171 M, Dllbois, C, Scheurich , and F, Driggs, "Memory Access Buffering in MulLi·
pro cessors" , Tile 12th Ann. blt '( Symp. on Computer An:hiteclure, 1985, pp.
<1 :14-443
/I8J 1\1. D. 1Ii1l, "A Case for Direct -Mapped Caches", Computer, Dec., 1988, pp.
25-40
[19J ~t Dubois an d F. A. Driggs, "Effect s of Cache Coheren cy in Mult iporo ccs-
sOl's",IEEB 'l'ralls. a ll Computel'll,Vol. C·31, No. ll, 1982, pp . 1083·1 U99
[20J N. JI. E. Wcstc, 1(. P rinciples of CMOS V[S[ Dcsigll, Addison-wesley, Head-
iug, Mass., 19B5
[21) 11 Randell and P. C . Trelenvcu, VLS1 .tl'cliitecllll"C, Prentice-Hull lntcme-
tional, Loudon, 1983
{22J F. J . IJiIia nd G. R. Peterson , Digital Systems, John Wil ey & SOil, To ronto,
l!J87
[2:IJ M. A. Marsan, G. Balbo, and G. Conte , P CI!onliflll cc Modd.~ of MulliproccssOl'
S!J,~t C IIIS, The 1\11'1' Press, Call1hridg{~ , Mass., 1!J8G
I:!'IJ X. Luo , and P. Gillard , "An Elliden l Cache Memor y Mana gement Unit " , Pvoc.
of A PIes COIll!'u!el' S cience Coujel'e/lcc, 108B, PI" 1-12
125) II. J . Mitchell, 3fJ·/Ji l lI1icl"fJ},,·oces.'QI·s, William Collins Sons & Co. Ltd, Lon.
don , UK, 1!J8G
152
[261 M. Stansberry, "Cache Memory Design in 32-bil Microprocessor Systems" ,
VLSI Systems Desigll, H188, pp . 32·42
[271 B. Merril, "370/16 8 Cac he Memor y Per forman ce" , Share Com puteI' Metisnrc-
lIIen! alld Eva/tlll/iall Nclt~~lrllcl', No. 20, l!.l7-l , pp. 98-101
[28J D. Phillips , "The ZSOOOO Microp rocessor", IEEE Micro, Vol. 5, No. 0, 11185,
pp .23-30
[29J D. Alpert, ct al., "32-Lil P rofessor Chip lntcgrnt cs Major System Funct ions" ,
Elcetrouics, Jul y, 1!183, Pl'. 113-I I!J
(30J It. Gregory, "Caching Designs Eliminate Wail Stales to Relieve Bott lenec ks" ,
Compulel' D es it/Il, Ocr., W8S, P[J. us-jJ
[31] D. MacGre go r, D. Mo thc rsolc . And O. Moyer, "T he Motor ola MC6802U", IHEt:
Micro, Vol. 4, No.4 , 1984, pp . 101-118
[321 II. Scales, P . Harr od , "T he Design and Impl emen tation of t he ~IC68U30 Cache
Memories", IEEE COllf. 011 Computer DesiYII, 1087, pp. 578-58 1
(331 Goldman, " F irsl Look al Motorol a's La tcst32- uil PrOCe!WJI ''' , Electronics, Vo!.
59, No. 31, rsse, pro 71-75
(341 D. W, Clark, "Cache Perform ance in VAX- I l/78U" , ACM Trans; ( III Omlllll/IT
Syst ems , Vol. 1, No. 1, 198:1, pp . 21-37
153
(35) It E. Malick, "Fuucrlounl Cache Chip for Impro ved System Perfonuence",
IlJM J . lies. Develop., Vol. 33, No. I, W8!!, pp 15-32
l:lG] P. Stenstrom, "Reducing Contention in Sharc d-Memory Mult iprcceescrs" ,
COl/l]J llle r , Vol. 21, No. 11,1988, pp. 26-:17
(37] S. T hakkar, P. Gillord , and G . Fictlend, "The Balance Multlproces aor Sys-
tem", [BEe Micro, Vol. 8, No. I , \!J88, PP' 57·69
1:181 J. Archibald and J. L. n llc r, "Cache Cohe rence Pro tocols : Evaluatio n Ueiug a
Multi processor Simulation Model" , ACM 1hlllsadiolls 011ComlJut er Sy~terlls ,
Vol. 1, No. 1, 1986, PI" 273·29$
I:m] A. Ayyad ami B. WilkillSOIl, ~h l u l ti p rocl',B~o r S c! I(!IIIC with Applicariou to
Macro -dntntlow" , Mic1'o}lrocessors /Illfi I\licmsystems, Vol. J I, No. 5, W87, PII.
255·263
[au] J. IC l\..J uPllala. and I.. N. Uhuyan, "Arbiter Designs for Multiprocessor Inter-
ronncctiou Network" , Mia o}l1'ocrssillY rlllJ Aficmpl1J9l'a/llmilly, No.26 , 198!l,
Pl" 31-4:1
15,\




