How to Use 1000 Registers by Sites, Richard L.
HOW TO USE 1000 REGISTERS 
Richard L. Sites 
Department of Applied Physics and Information Science 
University of California, San Diego 
ABSTRACT 
The advent or VLS! technology will allow the 
fabrication or comple te computers plus memory on one 
c hip. There will be an architectural challenge in 
the very near future to adjust to this trend by 
designing balanced archi tectures using hundreds or 
thousands or reg lsters or other S!lall bloeks of 
memory. As the relative price or memory (vs. random 
log !e) drops even further, the need for register-
heavy architectures will become even more 
pronounced. In this paper, we discuss~ spectrum or 
ways to exploit more registers In an architecture, 
ranging from progra11111 er-managed cache (large numbers 
or explicitly-addressed registers, as In the Cray-1) 
to better schemes for autanatleally...,anaged eaehe. 
A combination of compiler and hardware techniques 
will be needed to maximin effective register u•e 
while minimizing t ran omls•lon bandwidth between 
variou• memorle•. Dlscu.,ed technique• Include 
merging activation record• at compile time, 
predictive cache loading , anrl "dribble-back" cache 
unload lng. 
!. INTRODUCTION. 
VLSI technology will soon make It possible to 
put an entire COI'I'tputer plus a large nl.lftber of 
•torage locations (perhaps 100-1 000 register•) on a 
single chip. On a larger •cale of a computer 
occupying a few pr1nte6-clrcult board•. VLSI 
~emorles will allow economical designs with a number 
of localized memorie~ that are "clo~er" to the 
proce•sor logic using them than the larger main 
memory (figure 1). Ho w can computer architects m~ke 
effective use of such short-term memories? 
figure 1. A computer with two localized~ 
term memories: 
1 processing J 
, logic , 
--------------
' 
' 
: registers 
\ processing i 
, logic , 
cache 
main ml!mory 
short term 
memory 
long term 
menor y 
In this paper, we first pre•ent a framework or 
issues for compar ing short-term memor y designs, then 
we present 30me new techniques for facing these 
t .. ues. Our basis or comparison is a simple computer 
with no short-term memor y, but only long- term 
main memory. All operation• are done memory-to-
memory, with no Intermediate registers. Our que•t i• 
to find economical way• of adding •hort-term 
memory(le•l to thi• ~ ~· 
A generalized •hort-term memory cell (STH 
cell) consist• or three field• . some--or-an or whi ch 
may be physically realized: a •hort ~·a~ 
~· and some data, •• •hown in flRure 2. In 
this model, a set or eight ordinary general-purpose 
registers \ooOu ld be represented by eight STMs , hav lng 
•hort n.,.e• 0- 7 , no long names, and one word or 
data each. An ordinary 1K word cache memory with q_ 
word line• (I.e. groups or four contiguous wor d• 
are moved into or out or the cache) would be 
represented by 256 STMs, each having an anonymou• 
•hort name (the physical cache memory locet!on), a 
main memory addreos for a long n.,.e, and four words 
of data. Other STM• will be pre•ented below, 
figure 2. Generalized 3hort- tenm memory cell (STM 
cell): 
Our spectrum of discour•e ranges from one 
extreme of ordinary regi•ters, whose u•e Is totally 
explicit and a~~pl.,tely visible to the machine-
language programmer , to the other extreme or 
ordinar y cache memory, who•e u•e i• totally 
auta.atic and Intended to be invisible to the 
programmer, We will examine two middle-range 
concepts of partly- explicit, partly- auta.at!c 
management , as •ho wn In figure 3. 
figure ). Spectrum of discour•e: 
Registers Ren..,lng 
Cache of 
Reg l•ter Set• Cache 
:-------------:-----------:------------: 
<----explicitly 
managed 
II. DESIGN ISSUES . 
automatic ally ----) 
managed 
!here are three major motivations for, and three 
related de•lgn ls•ue• In Introducing •hort-term 
memory i nto a computer de•lgn: 
1. fewer address bit•. 
2 . faster access. 
3. Lower bus bandwidth to long-term memory. 
q, Load/store Instructions for short-term 
memory. 
5. Access to the most recent data. 
6. Usage density or short-term memory. 
Each of the•e Idea• 1s briefly explained below. 
fe wer addreos bit•. If a rrequently-u•ed dat• 
ltemismov~""iiiafn memory to a general 
regi•ter, then the register number (short name) c an 
be u•ed to refer to that datum Instead of the main 
memory addre•• (long name). Thi• re•ults In 
in•tructlon formats that are more compact than 
formats for our pure memory-to-memory base machine , 
and thlo compactneos i• a major advantage of 
conventional regt~ter-machine arch1tecturPs . Pure 
527 
CALTECH CONFERENCE ON VLSI, January 1979 
528 
stack-machine architectures carry this concept even 
further by using zero instruction bits to spf'clfy 
the top of stack. An ordinary cache memory does not 
save any address bits In Instruction fonnats. 
faster access. If a frequently-used data Item 
is m~r~n memory to a short-term memory 
location, then access to that datum Is often speeded 
up by a factor of ~-20 (or even 1000 when 
considering main memory to be a cache on a paging 
drll'l'l). This faster access canes about through a 
combination of faster circuit technology for the 
short-term memory cell . shorter wire delays , less 
bus contention , and simpler address decod lng. 
~dl, explicitly-addressed register files will 
always offe r slightly faster access than cache 
designs . since the cac he must always take some time 
to find a match for the long name presented. Today' s 
fastest cache designs have two-clock-cycle access, 
while registers often have one-cycle access. 
!.ower bus bAndwidth. If most memory accesses 
are to theShort-term memory, then the bandwidth 
needed for the long-term memory bus can be 
s ubstantially lower than in our base machine. This 
allows a more economical design if multiple 
processors or I/0 devices are connected to a single 
long-term memory, as in the PDP-11/60 design [ 1). 
This may also allow use of a serial-multiplexed bus 
Instead of a more expensive fully-parallel bus, thus 
saving wires, pins. or silicon area. 
l.oad/store instructions. If moving data into 
and out of the short-ter~ memory Is done explicitly 
via .software instructions, then t\oi!O costs are 
suffered: first, a progr ammer or compile r must 
decide lotlere to Insert the data movement 
Instructions; and second. Instruction bits and time 
are consumed by these expllelt commands. This 
overhead runs counter to the addr ess bit savings 
discussed In the first point above . cache '!lemortes 
neither save instruction bit s nor cost data movement 
Instructions. On the other hand , explicit register 
loads are easy to Implement and can be positioned to 
pre-fetch data so that It arr ives at the short-ter'!l 
'llemory just before it is needed; demand-fetch cache 
schemes cannot do this. Pred lctlve cache hardware 
Is just beginning t o be Investigated [?] . and has 
recently been Implemented in the Amdah l ~70/V8. The 
address stre~ prpsented toa cache can b~ viewed as 
a group of interleaved arithmetic progressions. If a 
simple algorithm can be used to decompose address 
str~Bms into thes~ progr~ss1ons 0 then a each~ could 
prefetch data In each progression. 
There is another kind of load/store overhead 
associated with using a short-term memory: when 
calling a subroutine , switching tasks. or starting 
an l /0 transfer, It is often necessary to sav~ the 
current machine state. o r to force it to be 
consistent. This means explicitly saving and 
restoring all the programmer-visible register s in 
a machine architecture, and perhaps explicitly 
copying a cache to main memory . or purging a 
translation lookasi1e buffer (TLBl or some otherwise 
11 hit1den" short-term memory. As short-term memories 
become larger and more prevalent , this load/store 
overhead will become a dominant spe~factor. 
Already . machine< such as the IBM 370 have Introduced 
partial purge Instructions to avoid invalidating an 
entire TLB of 128 entries, and the Cray-1 software 
has to struggle with trying not to save all ~ ·8·6~ . 
6~•512 = 656 registers at every subroutine call o r 
Interrupt [ 3] . 
ARCHITECTURE SESSION 
Richard L. Sites 
THE ADVENT Of LARGE, CHEAP SHORT-TERM 
HEMORIES DEHANDS BETTER SOLUTIONS TO 
THE LOAD/STORE OVERHEAD PROBLE~. 
Stale data. If a datllll Is copied to a short-
term liieiiiOry:-and then one of the two copies Is 
changed, subsequent access to the other copy will 
result In fetching stale data. This is obviously a 
d 1 saster. For a hardware~anaged c ache memory, 
simple preventatives for the stale data problem 
Involve notifying the cache of the addresses of all 
main memory cells changed by an processing or IIO 
log lc In any part of a computer system. This r uns 
coun ter to the lower bus bandwidth issue above. for 
a prograrm~er- (or compiler-) managed reg lster, the 
stale data problem Is often prevented by storing the 
register just before some operation that might 
access the long- term memory copy, then reloading the 
short-term memory after that operation. Such 
operations are surprisingly frequent unless an 
exhaustive analysis of the program Is perfonned. for 
exEII'I'Iple , if a program makes many references to one 
element of an array, say A(3l, then It is desirable 
to keep a .££EX. of that element in SO'IIe register. 
However, any other reference to the same array o such 
as A( I) , potentially accesses the third element, so 
the register copy of A(3) must be stored before 3 
fetch from A( I), and reloaded after an assignment to 
A( [). Depending on the language Involved, It can 
take extensive flow analysis on the part of the 
compiler or programmer just to discover wh~ther a 
reference to A( I l Is possible during the time that 
A(3l is in a reglster-:-ro;=-example, In fort ran it is 
possible that A(3) Is kept In a register Inside some 
loop , but the loop includes a bronch to some far-
awa y piece of program that changes A<Il then 
branches back Into the loop! I f a compiler or 
programmer is not willing to do this sort of flow 
analysis, then IT IS NOT POSSIBLE TO KEEP A(J) IN A 
REGISTER without being exposed to the stale data 
problem. This Is the fundamental reason lotly simple 
compilers rarely make effective use of registers, 
and lotly many assembly-language programs are 
difficult to modify by someone other than the 
original author. A copy of A(Jl could be kept In a 
cache memory with no stale data problem, since the 
cache monl tors all accessed addresses for • possible 
match. and supplies the most recent data 1 f one Is 
found. 
The stale data problem also forces the saving 
and restoring o f almost all registers across a 
subroutine call on register machines: continuing the 
above ex ample, the subroutine m l g ht refer to A<3l , 
expecting to find the most recent value in its 
allocated main memory location, not in some 
register. 
There Is one more aspect to the stale data 
problem-- aliases. If a long-term memory location 
can be accessed via more than one name, either 
because two distinct virtual memory addresses are 
mapped to the same physical address. or because t~• 
distinct high-level language variables in fact refer 
to the same location (e.g . one is a global variable , 
and the other 1s the sane variable passed to a 
parameter) , then it Is possible that neither a 
hardware nor ~ soft ware (compiler) mechan!S'!I wilt 
detect that an assignment to one name should update 
a copy of the other name kept In some short-term 
memory. In virtual memory systems, avoiding this 
problem Involves either prohibiting aliases by 
software convention , or building cache hard ware that 
compares only physical addresses . not virtual ones. 
In compiler systems , aliases are either detected 
How to Use 1000 Register s 
through extensive analysis o r a program, o r no 
copies of var iables c an be kept in registers across 
references to global variables. parameters. pointer 
assignments, s ubroutine calls, or a number of other 
such common occurrences . 
AS SHORT-TERM REGISTER MEMORIES GET LARGER , 
SUBROUTINE CALLS WILL GET SLOWER. UNLESS WE 
FIND BETTER SOLUTIONS TO THE STALE DATA AND 
ALIAS PROBLEMS . 
The stale data pro blem in all its forms !s 
probably the hardest design problem to be faced In 
any system that creates co pies or data. The 
extensive compiler analysis required to ta ke full 
advantage of fast registers is one reason that c ache 
rru~mor ies have become 30 popular - - the hardware 
substitutes continual address comparisons during 
execut io n fo r com pile- time c omparisons. Thus . we 
have a trade-<>ff: for simply-compiled code an N-word 
cache memory performs better than an N-wo rd register 
memory, while for carefu lly-<> pt 1mi zed ~ode •n N- word 
register m""'ory per forms faster ( bec•use of the 
Inherently raster access mentioned above). and Is 
simpler to build. 
Usage density. I f an architecture provides ?.00 
words o f short-term meooory. but most progr ams use 
o nly 50 or these words, the memory is under-
utilized. One "sol ution" in such a situation is to 
make the short-term memory smaller, but in the long 
run the opposite is preferable -design the 
so ftwar e to make effective use o f "'ore short-term 
memory. One ex,..ple is in order: the Cray-1 provides 
6U T- r egisters, each 64 bits wide with 1- cycle 
access (compared to 11-cycle access to a random word 
in main memory). To ovoid load/store overhead , some 
system :10rtware uses none of these r~gi sters. One 
compiler that does generate code that uses the T-
registers Is the Pascal compiler at Los Alamos. It 
places local scalar variables into T- r egister s, but 
the short subroutines encouraged by clean Pascal 
coding style o ften have fewer than five such local 
variables. Thus many Pascal programs use o nly about 
lOS or the available short-term registers. For such 
a machine, we need software designs that use more 
registers. One such design Is described In Section V 
b elo w. 
III. CACHE MEMORIES. 
We will briefly Sl.lllmarlze ho w ordinary cache 
memories fare with respect to the above six design 
Issues. Fig ure 4 sho ws a cac M memory as an STM cell 
associating a long name with some data. 
Figure 4. cache memory as an STM cell: 
, short 
' 
name 
long 
name 
data 
Address bits. Cache memories save nothing in 
inst;::u;;tiOn formats. 
Ac cess time. Faster than long-term memories, 
but not quiteas fast as registers built out o r 
Identical circuits. 
Bus bandwidth. As effec tive as registers with 
the me load/store character istics. Often cuts down 
bandwidth by a factor of 10 (see e.g. ( 1]). 
Load/ stor e overhead. No Inst ruction overhead, 
except for rare "purge the cache"-type 
1 nstructions. 
Stale data. The forte or c ache design - - once 
the v lrtual address alias probl em is dealt with . 
cache memories completely solve t he software-level 
alias problem. 
Usage density. This !s also a strong point or 
c ache systems - blindly doubling the sl ze or a 
cache will usually have a much better performance 
Improvement than bl1n<ll y doubling the n""ber or 
equivalent registers. 
IV. REGISTERS. 
We will briefly s....,marlze how ordinary register 
memories rare with respect to the above six design 
Issues. Figure 5 shows a register memory as an STM 
c ell associating a shor t name with some data. 
Figure 5. Reg !ster memory as an STM cell: 
data 
Address bits. The forte of register desl~ns --
Instruction fo rmats shrink. 
Access time. The s!mplic I ty o r ex pllc 1tl y and 
d l rectly addressed r~gl sters gives an lnher~nt speed 
advantage over caches. 
Bus bandwidth . Similar to cache in cutting 
down-aata accesses. 
Load/store overhead. 01 many register 
machines . 25S o~r al l Instructions are Loads 
o r Stores (see ( l,p. 35 11 or (4] for examples). The 
Instruction bits for these must be balanced a~alnst 
the address bits saved In other instructions . Data 
may be pre-fetched. 
Stale data.No hardware o r execution time Is 
"wasted" intr"ying to detect stale dat~t, but 
effective use or registers d""'ands compile-t!"'e 
analysis. 
Usage density. Aga in , c areful complte- ti~e 
analysis is needed to take advAntage of more 
reg !sters. Chang !ng assembly language cod e to use 
more registers cannot usually be done automatically. 
V. TECHNIOUES FOR EFFECTIVE USE Of LARGE SHORT-TERM 
MEMORIES. 
Notice how complementary the above two lists are 
(compar ed to our base machine): 
address bits 
acces:s time 
bus band w!d th 
load /store 
stale data 
usage density 
cache reg lster s 
Can we find some way to use the best features or 
both schemes? Ar e there techniques that are a 
merging of the two extremes? How c an we trade-<>ff 
CALTECH CONFERENCE ON VLS I, January 197 9 
530 
compile-time analysis vs. run-time analysis? Some 
possible solutions are discussed below. 
Ren•lng. We can separate the Idea of short 
nnes from the Idee of fast access by defining a 
RENAME operator : RENAME x. Y means that the short 
nne X wi ll be used to access the long na .. Y until 
another RENAME Involving the same X occurs. RENAME 
Is like LOAD of a register In that subsequent 
accesses to Y can use just the short nane, but 1 t 
differs rroro a LOAD because no data 10ovement Is 
Implied . Hence. we get the short name without 
necessarily setting faster access. So what Is t he 
advantage of RENAME over LOAD? figure 6 sho ws a 
RENAME mechan!SII as an STH cell associating a short 
name with a long nMe. A similar Instruction \4S 
Implemented for the index resisters o f the I~ 7030 
(Stretch) (5). 
Figure 6. Rena10e memory as an STII cell: 
: short : long data 
: name : name 
First, RENAME can be Implemented i n conjunction 
with a cache memory, such that RENAHE gives strong 
hints to the cache to load (or pre-fetch) the data 
at location Y. This restores the speed Improv ement 
of LOAD. Second, no explicit STORE instruction Is 
associated with RENAME-- the use or the short name 
X Instead of the long nMe Y Is Just discontinued at 
some point In a progr... This saves a little 
instruction :space, and it means that, ror ex•ple, a 
compiler does not have to do the now a nalysis to 
detect all branches out of a loop In order to find 
all the places to Insert the STORE X. Y to match a 
LOAD X,Y at the beginning of the loop. Third and 
most Importantly, a compiler does not have to do any 
alias anal ysls. As d lscussod above, when using 
LOAD/STORE to keep a copy of Y In register X, all 
other references to Y must be round and chansed, 
o;:-y must be appropr latl!l y restored to Y and 
refetchod around any constructs that potentially 
touch Y. With RENAHE, the Implementation must 
ensure that references using the short name X and 
the long nne Y both access the same actual data. 
Under these circ'i:m'Stances, use or the short nane X 
does not require any flow analysis to find other 
uses o r potential uses of Y. 
Cache £I register ~. Consider a machine 
with 16 general registers In Its architecture. 
Then registers are normally saved In main memory 
when cal ling a subroutine, and reloaded CTom m•ln 
memory when returning rrom a subroutine. As 
discussed above. we desire to build machines with 
many "'or~ than 16 registers. but we don't want to 
slow down all subroutine calls. Assuming that 
almost all registers are in use at the point of 
call , and atmost all will be used by the subroutine 
(so that we cannot avoid sane sort o f save/ restore). 
then one way to speed up the c all linkage Is to have 
duplicate register~· Say there are rour sets, 
0-1, and that the callin~ subroutine is using set 
1. Then the c alled routine just starts using set 2, 
and no data movement or set t to main memory is 
needed. This makes the subroutine call quite fast, 
and it also makes the linkage overhead no longer 
proportional to the number of resisters. When the 
.sub routine returns, the machine just .switches baek 
from set 2 to set 1. 
Thf!re are t\10 flaws in the above .seheme that 
need fiXLng . First Is the obv1ous question o f ho w 
ARCHITECTURE SESSION 
Richard L. Sites 
to do the fifth nested subroutine call. The answer 
is that after switching from register set 1 to 
register set 2. a coche-llke mechaniSII i s needed to 
dribble-back register set 1 to the place In main 
memory that it ..,uld have gone In the simple 
machine. Dribble-back means that the requisite 16 
STOREs are queued at a low priority. so that 
whenever the running subrout in e (using set 2) does 
not need a bus cycle to main memory, one of the 
queued stores Is done. After the first 16 unused 
memory cycles pass, all of resister 1 Is properly 
stored in main memory, so more ne.sted subroutine 
call s can reuse that register set. This scheme 
stands In st•rk contrast to existing machines that 
provide multiple register sets, such ~s the RCA 
Spectra ij5 ( I~ 360-llke). or sone models of the PDP-
1 1, which have four register sets, but they have 
dod lcatod uses (operating system, kernel, real-time 
interrupt, and all user code Is a typic al allocation 
of the four). and have no automatic recycling of 
data to m•in memory. 
The dribble-back technique also stands In stark 
contrast to the usual STORE MULTIPLE of registers at 
time of call, because the called subroutine need not 
walt until the stores finish before starting its 
execution. In fact, by making the priority of 
dribble-back stores lower than th•t of other stores, 
the register saving always uses otherwise wasted bus 
cycles, I.e. the register saving Is coropletely 
free In terms of execution time. Since reg ister 
save/restore Is already a significant overhead o n 
m~ny machines with general reJ!lsters, •nd since the 
trend is toward more r~i.sters and more short 
subroutines as a programming style, dribble-back 
will become even more significant for saving 
subroutine linkage time . The Amdahl ij70/V6 alre*'y 
uses a form of dribble-back to implement the PURGE 
TLB Instruction (which must lnvall~ate all P~ 
locations In the virtual address lookasirte buffer ) 
(6). The implementation Involves a duplicate set of 
VALID bits for the TLB . so the PURGE TLB instruction 
simply switches to the other set, which has 
previously all been set to "invalid". During the 
next 128° 3 machine cycles , each bit of thl! just- use<! 
set Is changed to "invalid". So long as t.., PURGE 
TLB instructions are separated by at least that many 
machine cycles , the implementation Is extremely fast 
(in direct contrast to the IBH 370/168 
Implementation). This matches the operating system 
software, which only rarely executes a PURGE TLB. If 
a second such Instruction Is Issued too soon, the 
ij7Q / V6 CPU just waits until the previous 
invalidation cycle Is finished. 
A sub routine return can simply .start using a 
previous register set, unless thot set was dribbled-
back to main memory then overwritten with registers 
for a more deeply nested subroutine call. In this 
case. the reg tsters need to be reloaded from th" 
data In main memory. In general, it i s easy to keep 
a COPY bit assoclatPd with each r egister set, such 
that the COPY bit IS on if the reg t ster set Is an 
exact copy o f the corresponding data i n main 
memory. The ~opy bl t is turned on when ~he last 
rl!gister has been dribbled-back to main memory, and 
it is turned orr again if a nested call reuses that 
set. It is al30 turned on when the last register ts 
reloaded rrom main memo,.y. Then subroutine ealls 
and returns can just use a reg lster set if 1 ts COPY 
bl t is on, end must walt for the moln memory data 
movement to catch up if the bit is off. 
Consider a deep subroutine nest of A calls B 
call s C calls D calls E calls f, with four sets of 
How to Use 1000 Registers 
registers. uses set 0. B set 1. C set 2, D set 
3. and E uses set 0 after all of A's daU !s 
dribbled-back to memory. Similarly, fuses set 1 
after B' s data is saved. When f r eturns to E, 1 t ls 
possible to star t reload !ng B' s data, so that when C 
!s later ready to return to B, there will be no 
delay. On the other hand. H the very next 
Instruction after f's return to Eisa call from E 
to G, set 1 is needed for G to use, and any of B' s 
data loaded into set 1 is wasted effort. 
The thoughtful reader will have noticed that we 
are just running a top...of- stack buffer for a stack 
of register sets . For such buffers, an anount. of 
hysteris!s !s useful: once a register set !s 
stored, do not relo8d !t immediately. Instead, wait 
until the probable time to reload matches the 
probable time until the reloaded data will be 
needed. In the case of nested subroutine calls 
above, we would like to start reload !ng B' s 
registers into set 1 exactly 16 main memory cycles 
before C returns to 8 ( assLming no outside access 
interference). In general. we cannot exactly 
predict when to start reloading B's data, but we 
can perhaps safely wait until E returns to D and D 
returns to C. Similarly, we could apply hysterls!s 
at the other end of the buffer by not even starting 
the stores of A's registers until 16 cycles befor e 
D calls E. 1 .e. until just before that register set 
will need to be reused by a deeper subroutine. 
The major effect of intr oducing sane hysteris!s 
along with multiple reg !ster sets is that we 
diminish then needed bandwidth to main memory. In 
fact. instead of asking for a given bus how much 
bandwidth must be suppl led, the computer designer 
could ask "here !s a fixed bandwidth: how much short-
term memory and hyster!s!s must be supplied in order 
to exceed that bandwidth only rarely?" If we delay 
storing a subroutine's registers until . say, two 
more levels of subroutine call have been done , then 
we never even bother to save registers of ;~ 
subroutine that only calls one level down then 
returns. For a sort.ware system that rarely nests 
calls three deep, it would be possible to run for 
hours without spending any time or bus b~ndwldth 
saving and restoring registers . yet the occasional 
call c hain that Is 12 deep is handled gracefully, 
and never with more data transfer than the simple 
scheme with only one register set. 
A few paragraphs back, we mentioned two flaws. 
The second flaw !s that after subroutine A calls B, 
but before A' s registers are dribbled back to main 
memory, B may try to fetch from the place in main 
memory that !s supposed to contain one of A's 
registers. Alternately, after A's registers are all 
safely copied to main memory, B may change the 
contents of one o f those memory locations. If B 
then returns to A wi thout callinR anyone else , the 
simple description above would have A use the stale 
data in its register set, without ever reloading the 
changed wrd !n main memory. The solution to this 
flaw involves us i ng standard cache techniques: the 
unused register sets that contain copies of main 
memory data are exactly cache locations, and all 
accesses to the corresponding main memory locations 
must update the cache also. Thus, four register 
sets look like a four-line cache, w! th a main memory 
address tag associated with each line ( r egister 
set), and with an ossoc!atlve lookup of these four 
tags whenever main memor y is referenced. This 
scheme effect! vel y ties together the t w ends of our 
short- term memory spectrum. 
VI. CONCLUSIONS AND fUTURE RESEARCH. 
Registers are simple to build, fast, and snall 
n umbers of them are easy for programmers and 
optimizing compilers to use effectively. Cache 
menorles are more complicated. but easier to use. 
Providing many registers Is an attractive way for 
the hardware designer to use VLSI technology to 
support economical short-term memory. Providing a 
combination of hardware alias resolution and stale 
data prevention via cache-like address canparls!ons, 
along with many registers, may be the best total-
system design for effective use of 1000 or more 
register locations . 
Cached register sets are particularly attractive 
for implementing fast subroutine calls, but the 
same Ideas also apply to implementation of hardware 
stacks or queues (contrast the Burroughs 7700, with 
32 top.-of-stack b uffer registers, the automatic 
saving and restoring of which significantly slows 
down subroutine calls [7)) , and to the 
implementation of task s witching. In the latter 
case, complete duplicate machine states could be 
kept in multiple reg! ster sets. 
One line of future research ls to measure 
existing software to discover how much short-tern 
memory hardware would be useful, and what are the 
parameter s for managing !t (for example, car efully 
g~thered statistics on dynamic subroutine 
call/return activity could help decide an optimtD 
number of register sets, plus the parameters of the 
hysterisis algorittwn) . 
A second line of resear ch is just the converse 
given a fixed arbitrary amount of short- term 
:nemory hardware. how can software be automatically 
re-done to take full advant•ge of that amount? If 
only a few levels o f subroutine nesting can be 
handled quickly , automatic Insertion of subroutine 
code lnl!ne at the point of call wuld decrease the 
number of levels of call In a software packag... If 
many subroutines have only 5 local variables 
available for short-term storage and a machine 
provides 30 short-term registers, then a compile-
time mapping of the local variables from sh 
subroutines into one merged activation~ [ 3) 
could provide a much better match to the machine --
the usage density goes way up, and calls between 
subrouti nes in such a group wuld not need to save 
or restore registers at all: each subroutine just 
uses a different five of the 30 registers. 
Both lines of r esear ch must be pursued 
s!mul taneously 1 f we are to take full advantage of 
the short-term memory architectures that V~'I 
technology makes economical. 
REFERENCES 
[1) C.G. Bell, J.C. 1'\ldge, and J.E. HcNamara. 
Computer Eng !neerln' , chapter 13. Dig! tal 
Press, Bed ford RA, 978. 
[21 A.J. 9o!th, "Sequential Program Prefetch!ng in 
Hemory Hierarchies", IEEE Computer, December 
1978 . pp. 7-21. 
[3) R.L. Sites , "An An alysis of the Cray- 1 
Computer•, fifth ~ symposiua ~Computer 
Architecture , April, 1978. pp 101 - 106 . 
531 
CALTECH CONFERENCE ON VLSI, January 1979 
532 
(U1 R.P. Blake, "Exploring a Stack Architecture", 
IEEE Computer , May 1977. pp. 30-38. 
[5) IBH Corp . • Reference Manual: 7030 Data 
Processing System . form A22-653Q: 1960. 
[61 Amdahl Corp. , Amdahl U70/V6 Hardwar e Reference 
Manual, 1976. --- --- ---
(71 E.I. Organick, Computer System Organization , 
Academic Press , New York NY, 1973 , p. 91, 101 . 
ARCHITECTURE SESSION 
n1charct L. ~1tes 
