Abstract. We present a novel systolic processor that implements the leastrecently-used (LRU) policy for multi-level storage systems. The design is developed by successively re ning a high-level description of the algorithm. The e ect of varying the degree of pipelining on performance is discussed. We also show how the design methodology used for the LRU processor can be applied to the development of other systolic systems.
Introduction
In multi-level storage systems, data are partitioned into pages with frequently used pages kept in a small, fast primary storage and with less frequently used pages kept in a large, slower secondary storage. It may become necessary for performance reasons to move a page from secondary to primary storage, because the frequency with which pages are accessed varies with time. The replacement policy determines the best candidate for replacement among the pages in primary storage; LRU dictates that this candidate is the page that has been accessed least recently.
An LRU implementation needs to perform two tasks: rst, to maintain a sequence of the pages, ordered by most recent time of access, that currently reside in the primary storage; and second, to provide a mechanism for detecting the least recently used page in this sequence. Given that the next page to be accessed is p, updating the sequence of pages consists of deleting p from the sequence and prepending p to the result. If p is not an element of the current sequence, then the prepending of p is accompanied by the removal of the last element of the sequence.
Our proposed LRU processor is based upon a non-systolic algorithm originally described, without proof, by Dijkstra 1] . It consists of a chain of identical components similar in both size and speed to the cells of a shift register. The novel aspects of our development of this design include: the architecture of the processor is obtained by successively re ning a high-level description of the LRU algorithm; the design is captured as a succinct expression with parameters that can be varied to give implementations with di erent performance tradeo s; and the development method is quite general and has been used in deriving a number of word-level and bit-level systolic designs 4], 7].
We shall rst introduce the notation for describing hardware and the associated algebraic theorems for transforming designs. We then indicate how the LRU algorithm, expressed as a set of recursion equations, can be recast in this notation. The resulting representation is further re ned by algebraic transformations to produce a parametrised description from which a range of designs with di erent performance trade-o s can be generated. The application of this development method in deriving other systolic pro-cessors is also brie y discussed. Finally we summarise our work and provide the proofs of relevant theorems in the Appendix.
Notation
A simple notation for describing recursive algorithms and for expressing such algorithms using combinators will be presented. To deal with sequential systems some additional notions, such as relations and streams, will be considered.
Recursion equations and combinators
Objects in our notation are either atoms (such as numbers) or sequences of objects: for instance the object h0; h1; 2ii is a 2-sequence containing the number 0 and the sequence h1; 2i. A sequence is an ordered collection of elements with the empty sequence denoted by h i. Sequences are appended using the operator`^', so h1; 2; 3; 4i = h1i^h2; 3i^h4i^h i. #x denotes the number of elements in sequence x. The function last is used to extract the last element of a sequence; so last hx 0 ; x 1 ; x 2 ; x 3 i = x 3 . The k-th component of an N -element sequence can be extracted by the projection function k (1 k N ); for example 2 hx; hy; zii = hy; zi.
Notice that function application is denoted by juxtaposition, and this can be extended to two or more arguments. For instance, the value of a function f with arguments x and y is written as f x y and means (f x) y. To illustrate this style of description, an algorithm for summing a given number and the elements of a sequence of numbers is as follows: sum s h i def = s; sum s (hxi^xs) def = sum (s + x) xs:
While recursion equations like these are adequate for describing algorithms, a proliferation of such equations tends to produce unstructured descriptions. It is often useful to recast a recursive algorithm in combinators which are higher-order functions encapsulating common patterns of computation. For instance given the combinator reduce where reduce f a h i def = a; reduce f a (hxi^xs) def = reduce f (f a x) xs; by matching these de nitions with those for sum it is clear that sum = reduce add where add x y def = x + y:
There are two obvious bene ts of expressing algorithms in combinators. The rst bene t is that the absence of bound variables in combinatory expressions results in useful algebraic properties. These properties enable designs to be optimised by equational reasoning, and we shall illustrate that in the next section. The second bene t arises from the structure associated with a combinator which indicates how components can be connected together { for instance reduce add s hx 0 ; x 1 ; x 2 ; x 3 i corresponds to the connection structure in Figure 1 .
Two combinators will be needed in developing the LRU processor. The rst combinator is (reverse) functional composition, (f ; g) x def = g (f x). The second combinator is row, a slight generalisation of reduce, row f ha; h ii def = hh i; ai (1) row f ha; hxi^xsi def = hhyi^ys; bi (2) where ( hy; zi def = f ha; xi; hys; bi def = row f hz; xsi; which corresponds to a linear array of components with connections on every side ( Figure 2 ). Combinators such as reduce and row provide a target template for recasting algorithms in the rst phase of our development process.
Relations and streams
To deal with sequential circuits, a combinatory expression is promoted to a binary relation that relates a stream (an in nite sequence of data) in its domain to a stream in its range { an approach rst suggested by Sheeran 7] . Our main motivation for using relations is that it allows a non-constructive description of circuits with feedback loops. We shall illustrate later how this description facilitates the statement and proof of useful transformation rules.
Di For instance, given two streams x and y, x; y represents a stream of pairs such that for all t, x; y t = hx t ; y t i.
We shall write binary relations in in x form, so that an adder can be de ned by x; y Add z def = 8t : z t = add x t y t 8t : z t = x t + y t :
We shall follow the convention to denote a stream representation of a combinational circuit (such as Add) by capitalising the rst letter of the corresponding`static' expression (such as add). We can also de ne combinators for relations on streams. Two components with a common interconnection can be described by the combinator relational composition, which is similar to the functional composition combinator de ned earlier:
For combinational circuits Q and R, it is the case that 8t : z t = (q; r) x t implies x (Q;R) z.
A homogeneous pipe is obtained by repeatedly composing the same component using relational composition. Given that Id represents the identity relation such that x Id y def = x = y, we have
Another common combinator is parallel composition, which describes two devices operating independently on the components of a stream of pairs,
We shall adopt the abbreviation fst Q def = Q k Id and snd Q def = Id k Q.
The stream version of the row combinator will be needed, which can be de ned as follows: 
The subscript denotes the number of components in the row, and is omitted when the meaning is clear. This stream version of row can be obtained from the non-lifted version of row (equation 1 and equation 2) with the appropriate decomposition of streams of sequences into sequences of streams, and vice versa. Some discussions of this method can be found in 3].
A delay is given by 
( Figure 3 ). loop (R; fstD) corresponds to a circuit with a delay on the feedback path, a standard state machine con guration. A useful result concerning rows, pipes and loops is loop (row n R) = (loopR) n ;
which can be veri ed by induction on n (see Appendix). An instance of this theorem is shown in Figure 4 . This theorem is important because it allows the designer to concentrate on developing the state-transition logic of a single state machine and subsequently decomposing it into a cascade of state machines. The alternative { designing and synchronising individual state machines from the outset { is usually more complex.
Note. The loop combinator is de ned such that equation 9 is expressed in its simplest form. In Ruby 7] the loop combinator is de ned by x (loop R) y def = 9z : x; z R y; z so that, given that u R ?1 v def = v R u, the theorem (loop R) ?1 = loop (R ?1 ) holds. The relationship between the two looping constructs is given by loop R = loop (R ; Swap) where swap hx; yi def = hy; xi.
(End of note.) 3 Developing the LRU processor
We are now ready to develop the LRU processor. There will be two phases in this development: in the rst phase we specify the LRU processor and transform the speci cation into a combinatory expression to obtain a preliminary design; in the second phase we optimise the preliminary design by algebraic theorems to obtain a range of designs with di erent performance trade-o s.
Specifying the LRU processor
Our goal is to develop LRU 0, a sequential implementation of the LRU algorithm. LRU 0 should have the following characteristics: its state is the sequence being maintained, its output is the last element in this sequence, and its input is the page, if any, to insert on the next clock cycle. This circuit will be formed by adding latches and feedback paths to a purely combinational circuit, InsImp, which implements the state-transition logic. 
Obtaining a preliminary design
There are many ways to implement the LRU algorithm. Since a systolic implementation is desired, we shall implement the state-transition logic speci ed by the function ins by a linear array of N identical cells, where N is the size of the system state. The state machine LRU 0 (equation 10) can then be constructed by adding latches and feedback paths to the array of cells; developing a systolic version of this machine will consist of distributing latches between the cells.
In order to make use of the algebraic theorems for the combinators described in Section 2.2, we need to transform ins into a form compatible with the row combinator which describes a linear array structure. This is a crucial step that demands insight: like conducting other inductive proofs, the di culty is to nd an appropriate generalisation of the induction hypothesis. 
We now show by induction that update can be implemented by a linear array of cells which will be called insCell. The de nition of insCell will be chosen so that update p q b xs = 1 (row insCell hhp; q; bi; xsi): (21) A schematic of the structure on the right-hand side of equation 21 is shown in Figure 5 . will satisfy equation 11. To summarise, in this section we rst captured the LRU algorithm as a set of recursion equations. These equations were then transformed into a combinatory expression, and during the process of transformation we determined the behaviour of the cells and the connection structure of the implementation.
It should be noted that only the implementation of the state-transition logic has been veri ed correct with respect to the LRU algorithm. In general the designer must also ensure that the system will be initialised to an appropriate state. Fortunately our LRU processor is self-initialising: it will give the correct result after N insertions where N is the number of cells in the processor.
Optimising the preliminary design
So far the LRU processor has been expressed as a single state machine with a single bank of latches and long feedback paths. Our next step is to decompose this state machine into a cascade of state machines, which can then be pipelined so that the clock speed is independent of the number of processors. In other words, we shall rst construct a semi-systolic array that will subsequently be made fully systolic.
To make use of the theorems in Section 2.2, we promote insImp to work on streams by using the stream version of the components and combinators. We assume that there are N components in the row of insCell, and that N = KM where N K 1. An instance of LRU 1 is shown in Figure 6 . Note that LRU 1 can be used to produce pipelined versions of LRU 0. The parameter K controls the degree of pipelining: the array is fully pipelined when K = 1 and M = N , otherwise signal rippling through K cells will occur. Moreover LRU 1 has a latency of M + 1 = (N + K )=K cycles and requires 3M + N = N (K + 3)=K latches; hence a smaller K results in a faster circuit, but the latency and the number of latches in the design will increase.
A designer should therefore select the value of K to achieve the optimal trade-o in speed, latency and the amount of hardware for a particular LRU processor implementation. The readers are referred to 5] for additional discussions and examples on controlling pipelining in regular computational arrays.
Further re nement
Two observations will be reported in this section. First of all, one can check that a true value on the top horizontal output of the proposed architecture ( Figure 6 ) indicates that the input page is not already residing in the primary storage. Hence our design can be used for generating requests for page replacements in the primary storage.
Next, we shall sketch how the number of latches in the LRU processor can be further reduced by adopting a two-phase non-overlapping clock scheme. In such a scheme a latch is made up of two half-latches { for instance in NMOS technology a half-latch is implemented by connecting together a pass transistor and an inverter. Two adjacent half-latches are activated in opposite phases of a two-phase clock, Of course, the speed of the system is halved as well.
Further discussions on n-slow systems can be found in 7].
Developing other systolic processors
Remember that the LRU processor has been developed in two steps: casting the algorithm in the combinator notation, and optimising the resulting combinatory expression using algebraic theorems. This is a general strategy for developing systolic processors 4]; and while the rst step is usually problem-dependent, the algebraic theorems used in the second step can be applied to rewrite any expression in the required form provided that the preconditions associated with the theorems (such as delay commutativity of components for pipelining theorems) are satis ed. Our optimisation of the LRU processor (Section 3.3) consists of a rewriting sequence for an expression in the form loop (row F ; fstD). This optimisation can be applied to any design with its state-transition logic expressed in row. In the following we shall outline two examples, one involving a numerical algorithm and the other a non-numerical algorithm, which are amenable to this treatment. This results in ((loop(Scell; fstD)) K ; D) M , a sorter with a serial input and a serial output, provided that the feedback latches are initialised with the greatest element given by the ordering relation.
Conclusion
Our implementation of the LRU algorithm consists of a regular array of components and is suitable for integrated circuit technology. The fullypipelined version can accept page insertions at a very high rate, comparable to the speed of a shift register. Furthermore it is very compact: for a system with N pages of primary storage, it contains approximately (3N log 2 N +N ) bits of storage (for feedback and pipelined latches) and N log 2 N exclusive-or gates for equality testing. A survey of systematic methods for systolic array design can be found in 2]. In deriving the LRU processor we adopt a simple notation to express both the algorithm and its implementation. This approach allows designs to be transformed using`traditional' mathematical manipulations such as inductive proofs and equational reasoning. The resulting expressions are concise and can be used to generate designs with di erent performance trade-o s; and it has been shown that the transformation strategy is general enough to optimise other systolic architectures. Currently tools 6] are being prototyped to support this style of systolic processor development.
The proof is by folding and unfolding the de nition of update and delete. 
