A systolic LRU processor and its top-down development  by Luk, Wayne & Brown, Geoffrey
Science of Computer Programming 15 (1990) 217-233 
North-Holland 
217 
1. 
A SYSTOLIC LRU PROCESSOR AND ITS 
TOP-DOWN DEVELOPMENT 
Wayne LUK 
fiogramming Research Group, 0.xford University Computing Laboratory, II Keble Rmd, &ford, 
England OXI JQD, United Kingdom 
Geoffrey BROWN 
School of Electrical Engineering, Cornell Uniwrsit_v. Ithaca, NY 188S_~, USA 
Revised June 1990 . 
A&tract. We present a novel systolic processor that implements the least-recently-used (LRU) 
policy for multi-level storage systems. The design is developed by successively refining a high-level 
description of the algorithm. The effect of varying the degree of pipelining on perf’ormance is 
discussed. We also show how the design methodology used for the LRU processor can be applied 
to the development of other systolic systems. 
Introductioa 
In multi-level storage systems, data are partitioned into pages with frequently 
used pages kept in a small, fast primary storage and with less frequently used pages 
kept in a large, slower secondary storage. It may become necessary for performance 
reasons to move a page from secondary to primary storage, because the frequency 
with which pages are accessed varies with time. The replacement policy determines 
the best candidate for replacement among the pages in primary storage; LRU dictates 
that this candidate is the page that has been accessed least recently. 
An LRU implementation needs to perform two tasks: first, to maintain a sequence 
of the pages, ordered by most recent time of access, that currently reside in the 
primary storage; and second, to provide a mechanism for detecting the least recently 
used page in this sequence. Given that the next page to be accessed is p, updating 
the sequence of pages consists of deleting p from the sequence and prepending p 
to the result. If p is not an element of the current ~~~~~~nce:, t;hrrr +C +_Y 2~03 
of p is accompanied by the removal of the last element of the sequence, 
Our proposed LRU processor is based upon a nonsystolic algorithm originally 
described, without proof, by Dijkstra [I]. It consists of a chain of identical com- 
ponents similar in both size and speed to the cells of a shift register. The novel 
aspects of our development of this design include: 
0 the architecture of the processor is obtained by successively refining a hi 
description of the LRU algorithm; 
0167~6423/90/%03.50 @ 199!I--Elsevier Science Publishers B.V. (North-Holland) 
l the design is captured as a succinct expression with parameters that can be 
varied to give implementations with different performance trade-offs; 
l the development method is quite general and has been used in deriving a 
number of word-level and bit-level systolic designs [4,7]. 
We shall first introduce the notation for describing hardware and the associated 
algebraic theorems for transforming designs. We then indicate how the LRU 
algorithm, expressed as a set of recursion equations, can be recast in this notation. 
The resulting representation is further refined by algebraic transformations to pro- 
duce a parametrised description from which a range of designs with different 
performance trade-offs can be generated. The application of this development 
method in deriving other systolic processors is also briefly discussed. Finally we 
summarise our work and provide the proofs of relevant theorems in Appendix A. 
2. Notation 
A simple notation for describing recursive algorithms and for expressing such 
algorithms using combinators will be presented. To deal with sequential systems 
some additional notions, such as relations and streams, will be considered. 
2 1. Recursion equations and combinators 
Objects in our notation are either atoms (such as numbers) or sequences of objects: 
for instance the object (0, (I, 2)) is a 2-sequent, rv__________~ p mnr=linino the number 0 and the 
sequence (1,2). A sequence ts an ordered collection of elements with the empty 
sequence denoted by ( ). Sequences are appended using the operator bbr**, so 
( 1,2,3,4) = ( 1)A(2,3)A(4)a( ) 
#x denotes the number of elements in sequence X. The function lust is used to 
extract the last element of a sequence; so last(s,,, xl, x2, .x3)=x3. The kth com- 
ponent of an N-element sequent:, p can be extracted by the projection function 
rrk ( 1 S k G IV>; for example w~(_K, (y 2)) = (?; z). 
Notice that function application is denoted by juxtaposition, and this can be 
extended to two or more arguments. For instance, the value of a function f with 
arguments x and J is written as fx y and means (fx) y. To illustrate this style of 
description, an algorithm for summing a given number and the element 
sequence of numbers is as follows: 
sums0 ‘if s 
sums((x)^xs) def sum(s+x)xs 
While recursion equations like these are adequate for describing algorithms, a 
proliferation of such equations tends to produce unstructured descriptions. It is 
often useful to recast a recursive algorithm in combinators which are higher-order 
functions encapsulating common patterns of computation. For instance given the 
combinator reduce where 
reduce_fa ( ) dc‘ a 
reduce fu ( (x)k ) dcf reduce .f ( .ftz .Y 1 .xs 
by matching these definitions with those for sunt it is clear that 
sunt = reduce add 
where add xy de‘ .x +J 
There are two obvious benefits of expressing algorithms in combinators. The first 
benefit is that the absence of bound variables in combinatory expressions results in 
useful algebraic properties. These properties enable designs to be optimised by 
equational reasoning, and we shall illustrate that in the next section. The second 
benefit arises from the structure associated with a combinator which indicates how 
components can be connected together-for instance 
reduce add s (.q,, x, , x2, x3) 
corresponds to the connection structure in Fig. 1. 
Fig. I. reduceadds (x,, . x, . x2, x,) = y. 
Two combinators will be needed in developing the LRU processor. The first 
combinator is (reverse) functional composition, 
The second combinator is row, a slight generalisation of reduce, 
row f (a, ( )) def (( A a) 
VW $ (a, (x)~..s) def ((y)^ys, 6) 
where (~1, ij ‘ef J (a, x) 
(1) 
(2) 
(ys, 6) def row f (2, xs) 
which corresponds to a linear array of components with connections on every side 
(Fig. 2). 
Combinators such as reduce and row provide a target template for recasting 
algorithms in the first phase of our development process. 
a 
7 3 Relations and stremm _.W. 
To deal with sequential circuits, a combinatory expression is promoted to a binary 
relation that relates a stream ian in5niie sequence of data) in its domain to a stream 
in its range- an approach first suggested by Sheeran [7]. Our main motivation for 
using relations is that it allows a nonconstructive description of circuits with feedback 
loops. We shall illustrate later how this description facilitates the statement and 
proof of useful transformat;on rutes. 
Different brackets will be used to indicate that the operations and data are “lifted” 
to the corresponding stream versions; for instance denotes a stream of 
sequences formed by interleaving the sequence of streams (a, 6, c). Hence if “*” 
denotes the stream version of l *+ and y, represents the value of stream y at time 1, 
then for all t: 
For instance, given two streams s and y, (x, ?‘) represents a stream of pairs such 
that for all t, (x, y>, = (x,, _I-,). 
We shall write binary relations in infix form, so that an adder can be defined by: 
(x, y) Add z def V t. z, = add x, r; _ 
= vt. z, =x,+?; 
We shall follow the convention to denote a stream representation of a combinational 
circuit (such as Add) by capitalising the first letter of the correspondin 
expression (such as add). 
We can also define combinators for relations on strcdlrls “1+o components with 
a common interconnection can be described by the combinator relational umposition, 
which is similar to the functional composition combinator defined earlier: 
x(Q; R)z dSf +.(xQy) n (yRz) 
For combinational circuits Q and R, it is the case that Vt. z, = (4 ; r) -u, implies 
x(Q; R)z. 
Top-down development qf a systolic L R U processor 221 
A homogeneous pipe is obtained by repeatedly composing the same component 
using relational composition. Given that 10 represents the identity relation such that 
_x Id ?’ def x = j’ 
we have 
R 0 def Id 
R n+l dgf R;R" 
Another common combinatot 
operating independently on the 
x-rr?.NQ(I R)& tJi) St 
(3) 
(4) 
is paraffel composition, which describes two devices 
components of a stream of pairs, 
(xQu) A (vRo) 
We shall adopt the abbreviations 
f’stQ dcf Q(lfd and sndQ dcf Jd IIQ 
The stream version of the row combinator wit1 be needed, which can be defined 
as follows: 
SC (a=b) A (x=,tr=()) W 
The subscript denotes the number of components in the row, and is omitted when 
the meaning is clear. 
This stream version of row can be obtained from the non-lifted version of row 
(equations ( 1) and (2)) with the appropriate decomposition of streams of sequences 
into sequences of streams, and vice versa. Some discussions of this method can be 
found in [3]. 
A delay is given by 
XSY Sf vr. X,-l =y, 
An anti-delay, 9 - ‘, is such that 9 ; $3 - ’ = 3 - ’ ; 9 = Id, ‘lltis kkntitv holds if t belongs 
to P- a& of positive and negative integers. A !:P _-br c, _ l‘VI be mscieiierl by a delay with 
forward dataflow (such that x is the input and ~7 is the cw e above d&itiQn 
for 9) and by an anti-delay with backward dataflow. 
Note that delays and anti-delays can be used on signafs of any type, for instance, 
222 
Given that a component is 
concerning pipes and delays 
R AfiB = ( R 1, ; 2 1.n ; 2 
delay-commutative, 
is 
??I 
that is R ; 9 = 9; R, a theorem 
(7) 
which can be verified by induction on nr. This theorem corresponds to pipelining 
clusters of k components by inserting latches between them. The expression 2 m 
corresponds to the latency (the number of clock cycles needed to produce the resuCt) 
incurred as a consequence of pipelining. We shall use this theorem to pipeline our 
LRU design. 
Another combinator that we need is a looping construct, defined by: 
x(loopR)j d$ 3:. (x, z> R (z, y) 
( Fig. 3). loop( R ; fst lir ) corresponds to a circuit with a delay on the feedback path, 
a standard state machine configuration. 
r(loopR) y d2 3:. -cr.:> R 4: 
Fig. 3. The loop comhinator. 
A useful result concerning rows, pipes and loops is: 
loop(row, R)=(loopR)“, 
. Y+ 
(9) 
which can be verified by induction on n (see Appendix A). An instance of this 
theorem is shown in Fig. 4. 
This theorem is important because it allows the designer to concentrate on 
developing the state-transition logic of a single state machine and subsequently 
decomposing it into a cascade of state machines. The alternative-designing and 
synchronising individual state machines from the outset-is usually more complex. 
loop (row3 R) 
(loop w3 
Fig. 4. An instance of a theorem involving loop, row, and pipe. 
Top-down deoulopmenr of a sysmlic L RU processor 223 
Note. The loop combinator is defined such that equation (9) is expressed in its 
simplest form. In Ruby [7] the loop combinator is defined by: 
so that, given that 
the theorem (IoopR) ’ = loop ( R ’ ) holds. The relationship between the two looping 
constructs is given by: 
loop R = loop ( R ; Swap) 
where 
swap (x, y) *CT (y, x) 
3. Developing the LRU processor 
We are now ready to develop the LRU processor. There will be two phases in 
this development: in the first phase we specify the LRU processor and transform 
the specification into a combinatory expression to obtain a preliminary design; in 
the second phase we optimise the preliminary design by algebraic theorems to obtain 
a range of designs with different performance trade-offs. 
3. I. Speci$ving the LRU processor 
Our goal is to develop LRUO, a sequential implementation of the LRU algorithm. 
LRUO should have the following characteristics: its state is the sequence being 
maintained, its output is the last element in this sequence, and its input is the page, 
if any, to insert on the next clock cycle. This circuit will be formed by adding latches 
and feedback paths to a purely combinational circuit, InsImp, which implements 
the state-transition logic. Hence we define LRUO as follows: 
LRUO ‘kf loop ( Jnsfmp ; fst 9 ) 00) 
We adopt a specification which requires inslmp, the static version of InsImp, to satisfy 
In this equation xs and XS’ are respectively the current state ano the next state and 
are sequences of pages. The output y is the Iast element-the least recently used 
page--of the current state XS. The boolean input 6 issues a request to insert page 
p in the current state XS, 
ins p true xs *ef insert p xs 
ins p false xs *ef xs 
224 W. Luk, G. Brown 
The functions insert and delete capture the LRU algorithm: insertpxs prepends 
p to the sequence xs and calls delete to modify xq, 
insert p ( ) ‘Cc ( ) (12) 
insert p (( x)Lxs ) ‘if ( p)-delete p I (X)~XS 1 (13) 
and defetepxs removes the first instance of p from X.S or the last element of xs if 
there is no such instance: 
deletep() def () (14) 
delete p ( ( X)~XS ) dcf if(x=p) v (xs=()) then xs 
Notice that insert preserves 
#xs. This ensures that the 
#xs’ =#XS in (11). 
Expanding the definition 
insp true{)={) 
0 (x # p) A (xs f ( )) then (x)*delete p xs 
fi WI 
the size of the sequence of pages, since #( insertpxs) = 
system state is maintained at a constant size: that is 
of insert using ( 12) and ( 13), we obtain 
ins p true (( x)k ) = ( p)^delete p ( ( X)~XS ) 
ins p false xs = xs 
(16) 
This completes the specification of the LRU processor. 
3.2. Obtaining a preliminary design 
There are many ways to implement the LRU algorithm. Since a systolic 
implementation is desired, we shall implement the state-transition logic specified 
by the function ins by a linear array of N identical cells, where N is the size of 
the system state. The state machine LRUO (equation ( 10)) can then be constructed 
by adding latches and feedback paths to the array of cells; developing a systolic 
version of this machine will consist of distributing latches between the cells. 
In order to make use of the algebraic theorems for the combinators described in 
Section 2.2, we need to transform ins into a form compatible with the row combinator 
which describes a linear array structure. This is a crucial step that demands insight: 
like conducting other inductive proofs, the difficulty is to find an appropriate 
generalisation of the induction hypothesis. Iln this case we generalise ins to a function 
update by introducing a new argument 9 which replaces the second instance 4% p 
on the right-hand side of (16), 
update p 9 true{ ) def ( ) (17) 
ZJpdate p 9 true ((x)~xs) def (p)^delete 9((x)Axs j (18) 
update p 9 false xs def xs (19) 
so that 
ins p b xs = update p p b xs (20) 
Top-down development of a sysrolic LRU processor 225 
We now show by induction that updare can be implemented by a linear array of 
cells which will be called insCeIL The definition of insCell will be chosen so that 
update p q h xs = ‘ITS (row insCell (( p, q, b), xs)) (21) 
A schematic of the structure on the right-hand side of (21) is shown in Fig. 5. 
zs 
I. l , I , l + _ * 
b . . L c . . 4 
Q - IC t - IC - IC ’ - IC l Ic: . 4 I a . . c . 
’ T T T T T l ; 4 
ZS’ 
Fig. 5. xs’= n, (row insCeII(( p, 9, h), xs)); IC = InsCell. 
Consider first the base case of equation (21): 
update P q b ( > 
= {equations (17) and (19): definition of update} 
0 
- {?r,(x,y) def x} 
n, (( A (P, 9, w 
= {equation (1): definition of row} 
nl (row insCclU(p, q, b), ( ))) 
Consider now the induction cases of update. 
update p q filse ((x)~xs) 
= {equation (19): definition of update) 
(xyxs 
= {equation (19): definition of update) 
(x)^update x q false xs 
In Appendix A we show that 
update p q true ( (x)~xs) = (p)kpdate x q (q # x) xs 
hence for the induction case of 
crpdate p q b ((x)~xs) 
= {combining the two 
(u)?4s 
equation (21), 
induction cases for update: 
where u = if 6 then p eke x fi 
us=updatexq((q#x) A b)xs 
= {induction hypothesis} 
~,WAUS, 4 
where u = if b then p e 
( us, c) = row insCell (_r, ss) 
J dfr (x, 9, ( 9 f x) h 6) 
= (define v to be the second output of insCell) _ 
37, (( u)^us, t’) 
where (u,y) = (if b then p else x fi, C_x_, 9. ly f XI A h,~ 
‘eC insCell (( p, 9, b$, s), 
(us, v) = row insCell (y, xs) 
= {equation (2): definition of row} 
zI (row i8rsCeN(( p, 9, b), (x)k) P 
From these calculations we have proved that 
updatep9b.r.~ = ‘IT, a row insCell (( p, 9, h), xs) ) 
= (row insCell ; 3rl ) (( p, 9, b), xs) (22) 
where 
insCell((p, 9, h), x) 
SC (if Q then p else x fi, (..., 9. t 9 f x1 A b)) (lW 
It remains for us to show that insfmp is realised by appropriately combining instances 
of insCell so that lnstmp is realised by combining instances of InsCeH (the stream 
version of it&Yell): 
inslmp (( p, b), xs) 
= (equation ( 11): requirement for inslmp) 
(ins p b xs, lust xs) 
= (equation (20): ins generalised to updute} 
(update p p b xs, last xs) 
= {equation (22): updute expressed in row) 
((row inscell ; ml ) (( p, p, b), xs), last xs) 
= {given dupfst( p, 6) dcf (p, p, 6) and fst(x, y) def (f _., y)) 
(( fst dupfit ; row inscell ; nl ) (( pF b), xs), lust -11s) 
From (23) we obtain inscell; m; rI = n2, thus 
last xs = IQ ((p, p, b), last xs) 
= ( inscell; n2 ; IQ ) (( p, p, b), last xs) 
= ( fst dupfst ; row insCell ; nz ; n, ) (( p, b), _u; 
Given that 
sndf(x, _v) def (x, fy> 
and since 
Top-down det~elopmenr t$ a systolic L R V processor 227 
we get 
inslmp (( p, b), ss) 
= (( fst drq#st ; row insCell ; nI ) (( p, b), xs), last xs) 
= ( fst duEfit ; row insCeIl ; snd 7~~ ) (( p, b), xs) 
Hence 
ins Imp der fst dzqjkt ; row insCell ; sad 1c, (241 
will satisfy equation ( Ii ). 
To summarize, i;l this section we first captured the LRU algorithm as a set of 
recursion equations. These equations were then transformed into a combinatory 
expression, and during the process of transformation we determined the behaviour 
of the cells and the connection structure of the implementation. 
It should be noted that only the implementation of the state-transition logic has 
been verified correct with respect to the LRU algorithm. In general the designer 
must also ensure that the system will be initialised to an appropriate state. Fortunately 
our LRU processor is self-initialising: it will give the correct result after N insertions 
where N is the number of cells in the processor. 
3.3. Optimising the preiiminaty design 
So far the LRU processor has been expressed as a single state machine with a 
single bank of latches and long feedback paths. Our next step is to decompose this 
state machine into a cascade of state machines, which can then be pipelined so that 
the clock speed is independent of the number of processors. in other words, we 
shall first construct a semi-systolic array that will subsequently be made fully systolic. 
To make use of the theorems in Section 2.2, we promote indmp to work on 
streams by using the stream version of the components and combinators. We assume 
that there are IV components in the row of &Cell, and that N = KM where 
ADKH: 
LRUO 
= {equation (IO): definition of LRUO} 
loop ( Inslmp ; fst 2 ) 
= {equation (24): definition of Inslmp) 
hop (fst hpfst ; rowN h&e/l ; 
= (sndF;fstG=fstG;sndF 
and Joop(fstF;G;sndN)= F;Joo 
Dupf;rt ; hap (row N InsCell ; fst 3 ) ; n1 
= {rowF;fstB =row(F;fsM)} 
Dupfst;ioop(rowN (hC&fSt~)); lzI 
a row expressed as a pipe of loops} 
9))” ; TTTI 
228 U’. Luk, G. Brown 
= (loop ( InsCell ; fst 52 ) ; 9 = 53 ; loop ( InsCell ; fst 9 ) 
and equation (7): pipelining a pipe} 
L&4pfst;((loop(InsCell;fst%))” ;9)” $-“‘;a, 
= W ;n,=n,;9) 
LRUI;W” 
where 
LRUI SC Dupfst;((loop(lnsCell;fst9))” ;S2)pl’;wI 
An instance of LRUI is shown in Fig. 6. 
Note that LRUI can be used to produce pipelined versions 
parameter K controls the degree of pipelining: the array is fully 
of LRUO. The 
pipelined when 
K = 1 and /W = N, otherwise signal rippling through K cells will occur. Moreover 
LRUI has a latency of 1u + 1= ( N + K )/ K cycles and requires 3M + N = 
N( K +3)/K latches; hence a smaller K results in a faster circuit, but the latency 
and the number of latches in the design will increase. 
b 
P 
Fig. 6. Design LRU I( N = 6. K = 2, Al = 3); 0 = 2 and IC = InsCd 
A designer should therefore select the value of K to achieve the optimal trade-off 
in speed, latency and the amount of hardware for a particular LRU processor 
implementation. The readers are referred to [S] for additional discussions and 
examples on controlling pipelining in regular computational arrays. 
3.4. Further refinement 
Two observations will be reported in this section. First of all, CXYZ can check that 
a true value on the top horizontal output of the proposed architecture (Fig. 6) 
indicates that the input page is not already residing in the primary storage. Hence 
our design can be used for generating requests for page repiacements in the primari 
storage. 
Next, we shall sketch how the number of latches in the LRU focessor can be 
further reduced by adopting a two-phase non-overlapping clock scheme. In such a 
scheme a latch is made up of two half-latches-for instance in NMOS technology 
a half-latch is implemented by connecting tagether a pass transistor and an inverter. 
Two adjacent half-latches are activated in opposite phases of a two-phase clock, 
Top-down development qf a s_vsrolic LR U processor 229 
such that 9,, is activated during phase 4, and !2& is activated during phase 4*. 
One can model this situation by regarding the system to contain two interleaved 
but independent computations, with the intermediate results of one computation 
stored in 9,,‘s and those of the other computation stored in 9&, forming a 2.slow 
system [7]. Note that if a component is delay-commutative, it is also commutative 
with C&4 and with 9~~. 
Now the core of LRU1 consists of the expression (KIoopcells ;91)~ where 
K loopcells *ef (loop( InsCe~l;fst 9))K 
indicating that LRUI is pipelined by every Kloopcells. Given that M is an even 
number, half of the pipelining latches can be saved if we are content to pipeline by 
every two Kloopcells instead, giving 
( Kloopcells’ ; 9 ) “” = ( Kloopcells* ; 9&, ; Q) W* 
= ( Kloopcells ; 9& ; Kloopcells ; !B4Jw2 
Of course, the speed of the system is halved as well. 
Further discussions on n-slow systems can be found in [7]. 
4. Developing other systoiic processors 
Remember that the LRU processor has been developed in two steps: casting the 
algorithm in the combinator notation, and optimising the resulting combinatory 
expression using algebraic theorems. This is a general strategy for developing systolic 
processors [4]; and while the first step is usually problem-dependent, the algebraic 
theorems used in the second step can be applied to rewrite any expression in the 
required form provided that the preconditions associated with the theorems (such 
as delay commutativity of components for pipelining theorems) are satisfied. 
Our optimisation of the LRU processor (Section 3.3) consists of a rewriting 
sequence for an expression in the form loop(row F ; fst 9). This optimisation can be 
applied to any design with its state-transition logic expressed in row. In the following 
we shall outline two examples, one involving a numerical algorithm and the other 
a nonnumerical algorithm, which are amenable to this treatment. 
Polynomid evakcation 
kcitton of a polynomial by Horner’s rule CGE 5~ Cc;c;ibe$ by the following 
recursive algorithm: 
peval (s, x) ( ) def s 
peval (s, x) ((a)^as) *ef peval (s x x + a, x) as 
(Notice that given 
mcell (s, x) a *cf (s X x + a, x) 
230 14’ Luk, G. Brown 
we could have expressed the algorithm as peval= reduce nrcell.) It can be shown that 
peoai (s, x) us = (row madd ; 7r2 ; lzI ) (( s, s), as) 
where 
madd ((s, x), a) def (a, (s x -x * a, x)) 
Having expressed the algorithm in row and checked that Madd is delay-commutative, 
we can follow the optimisation steps described in Section 3.3 to obtain 
((loop(Madd;fstW)” ;W”‘, a polynomial evaluator with a serial input and with 
constant coefficients. This description abstracts from the details of initialising the 
feedback latches with the sequence of polynomial coefficients. 
Sorting 
The function insort a takes a sorted sequence and inserts the element G at the 
appropriate place with respect to the ordering relation: 
insort a ( ) dcf (a) 
insoft a ( (.$xs ) dcf if a S x then (a, X)%S 
Cl a 2 x then (x)^insort a xs 
fi 
It can be shown that 
last ( insort a xs ) = ( row scell ; m ) (a, xs) 
where 
scell (a, x) dcf if a 6 x then (a, x) else (x, a) fi 
Again we can follow the rewriting steps described in Section 3.3, since ScefI is 
delay-commutative. This results in (( loop( Scell; fst 9 ))K ; Q)“, a sorter with a serial 
input and a serial output, provided that the feedback latches are initialised with the 
greatest element given by the ordering relation. 
5. Conclusion 
Our implementation of the LRU algorithm consists of a regular array of com- 
ponents and is suitable for integrated circuit technology. The fully-pipelined version 
can accept page insertions at a very high rate, comparable to the speed of a shift 
register. Furthermore it is very compact: ior a system with N pages of primary 
storage, it contains approximately (3 N logzN + N) bits of storage (for feedback 
and pipelined latches) and N log2 N exclusive-or gates for equality testing. 
Top-dawn development of a systolic L RU processor 231 
A survey of systematic methods for systolic array design can be found in 123. In 
deriving the LRU processor we adopt a simple notation to express both the algorithm 
and its implementation. This approach allows design to be transformed using 
“traditional” mathematical manipulations such as inductive proofs and equational 
reasoning. The resulting expressions are concise and can be used to generate designs 
with different performance trade-offs; and it has been shown that the transformation 
strategy is general enough to optimise other systolic architectures. Currently tools 
[6] are being prototyped to support this style of systolic processor development. 
Appendix A. Proofs 
We shall first prove that 
P@op(rown R) = (IoopR)” (A.1 1 
Proof. The proof is by induction on n. Consider the base case of equation (25): 
~~~~Pm% W)Y 
= {equation (8): definition of loop} 
3:. 0~ 2) (tow0 RI+, y) 
= (equation (5): definition of row} 
3z.(x=)‘) A (2’0) 
= {equation (3): definition of pipe} 
x (loop R)“y 
as required. Consider now the induction case of equation (25): 
-r(~qHrm,+, R))y 
= {equation (8): definition of loop} 
3~~s. {x, {ZYZS) (row,+ t R 1 #zYzs, y) 
= {equation (6): definition of row} 
3u,z. (x, z} R{z, u) A 32s. (u, zs} (row, R) (zs, y} 
= {equation (8): definition of loop} 
3u.x(IoopR)u II u(loop(row, R)jy 
= { inh ztion hypothesis} 
3u.x (loop R) u A u(loop R)“y 
= {equation (4): definition of p@e) 
x (loopR)“+’ y El 
Next, we shall prove that 
update p q true ((x)~xs) = (p)^update x q (q # x) xs 
Proof. The proof is by folding and unfolding the definition of up~!ure and Mete: 
update p q true 4 (_~)~xs 1 
= {equation f 18): definition of update 
( pjadelete y f (.x)~s.s ,L 
= (equation ( IS 1: definition of &lfrta) 
(p)?f(q=s) v (xv=()) then s_s 
0 ( fj # X) A (X.‘i f ( )) then (s)^defete q _I_.‘F 
ii 
= (equation ( 18): definition of update) 
(p)^if(q=s~ v ( ss = ( ) ) then ss 
0 f q f s ) A ( xs f ( ) 1 then update s q true ss 
fi 
= {expanding if) 
( P)~ if 9 = s then .xs 
0 s.s = ( ) then ss 
0 ( (1 f s ) A ( ss f ( } ) then update s 9 fnre _u 
ii 
= {equations (17) and (191: definition of update] 
( p)* if 9 = x then update s (I ( q z x) ss 
0 xs = ( ) then update s q ( q .# x ) _cs 
a(qf-w) A (ssf()) then update-q (q#x)x_s 
fi 
= (simplify) 
(p)*update s 9 (9 f _x)_K.~ c 
Acknowledgement 
We thank Michael Goldsmith, Graham Hutton, David Gries, Frank Luk, Fred 
Schneider and an anonymous referee for their careful reading of earlier drafts of 
this paper. The first author also expresses his gratitude to the Croucher Foundation 
and to Rank Xerox (UK) Limited for sponsoring his research. 
References 
E.W. Dijkstra, Monotonic replacement algorithms and their implementatic _ i 1: kt ‘2 w-r. 19 December 
19741. in: E-W. Dijkstra, ed.. Selected Niitings on Computing: A Perwrt.. ! Per.y?ecrire t Springer, 
Berlin. 19821 U-88. 
J-A-B. Fortes, KS. Fu and B.W. Wah. Systematic approaches to the design of algorithmically specified 
systohc arrays, in: ptoceedings International Coqference on Acoustics. Speech and Signal Processing, 
Tampa ( 1985) 300-303. 
G. Jones and M. Sheeran, Timeless truths about sequential circuits, in: S.K. Tewksbury. B.W. 
Dickinson and SC. Schwartz, eds., Concurrent Computarions: Algorihns, Architectures and Tech- 
nohgy I,Blenum, New York, 1988) 245-259. 
([S] W. Luk and G. Jones, From specilication to parametrised architectures, in: G.J. Milne, ed., The 
Fn tion c!f Hardware Desigrr arrd Ver$catian ( North-Holland, Amsterdam, 1988) 267-288. 
[5) W. Luk and G. Jones, Parametrized retiming of regular computational arrays, in: P.M. Dew, R.A. 
Eiarnshaw and T.R. Hey wood. eds., Parallel fiacessing -far Campurer Vision and Display (Addison- 
W’esley . Keridhg, MA. 1989 1 SO-63. 
(61 W. Luk. G. Jones and M. Sheeran. Computer-based tools for regular array design, in: 1. McCanny. 
J. McWhirter and E. Swatizlander. eds., Systolic Array Processors (Prentice Hall International, Hemel 
Hempstead. 1989 1 589- 298. 
f71 M. Sheeran, Retiming and slowdown in Ruby, in: G.J. Milne, ed., Ihe Fusion af Hardware Design 
and Ver@-urian (North-Holland, Amsterdam, 1988) 289-308. 
