We present a hardware oriented priority queue algorithm requiring n 2 comparators and swappers to maintain an n item queue. It supports two operations, insert and extract minimum (or alternatively, extract maximum), both of which operate in a single cycle. Thus, sorting time is O(n).
Introduction
A priority queue is an essential component in many software systems. This paper was motivated by the apparent lack of a priority queue algorithm that could be efficiently implemented in hardware. Such a device could be used in a wide range of applications from rapid scheduling (e.g. for multithreaded processors [Moo94] or ATM network routers [The93] ) to event timers which can efficiently handle multiple events.
The next section presents a statement of objectives. Section 3 reviews background material starting with a hardware view of the popular software priority queue, the heap sort. Then the hardware oriented rebound (section 3.3) and up/down (section 3.4) sorters are discussed because they provide inspiration for the tagged up/down sorter (section 4). Whilst the tagged up/down sorter is a deceptively simple algorithm, it has a complex behaviour. To show that the algorithm conforms to the objectives a machine checked formal proof of correctness is given (section 5). Finally implementation strategies are presented in section 6 and conclusions are drawn in section 7.
Objectives
There are two objectives:
1. To design a device which can perform an insert or extract minimum (or as a variant, an extract maximum) operation every clock cycle.
Manuscript received January, 1995. This work was supported in part by the UK EPSRC in the form of a studentship (91307980) and a grant (GR/J 11140).
y S.W. Moore 2. Records with identical keys should be extracted in FIFO order of insertion (particularly useful for some scheduling operations).
Background
There are many hardware sorting techniques, of which most aim to sort a complete set of data in the minimum time using as little hardware as possible (e.g. Batcher sorting networks [Bat68, Dij87] , heap sort on a systolic array [Lin93] and others [BDHM84] ). Unfortunately these do not meet our first objective of single cycle insertion and extraction. In order to meet the first objective it is essential that any number inserted must be compared (and possibly swapped) with the current minimum value. An obvious solution would be maintain a sorted list but this would require n?1 compare and swap units to sort n numbers in a single cycle.
Non-scalable solutions
There are several non-scalable solutions. For example, if key size can be kept very small then a FIFO may be allocated for every possible key [HPDL92] . Sorting is then just a multiplexing operation. Alternatively, if there is a wide range of possible keys but only a small number of records then a parallel search through all the records, using a content addressable memory structure, could be used [HK89] .
If fast insertion is required but slow extraction is sufficient then a priority packet queue may be used [PF95] . In such a scheme inserted keys are compared with the current minimum and the larger result is buffered. Extraction picks up the current minimum and then an exhaustive search of the buffer is performed to find the new minimum, either serially or with some degree of parallelism.
Variations on the heap sort
Typically software implementations of priority queues utilise the heap sort technique which takes O(log(n)) time to insert or extract [CLR91] . Insertions are made at the bottom of the heap and the heap is then massaged into a correct ordering (this process is sometimes called the "heapify" function). Extraction of the minimum is from the top and the hole it leaves is filled by a value from the bottom followed by an invocation of the heapify function.
A hardware variant to meet the objectives would require insertion and extraction initiated from the top of the heap. A dedicated processing element (PE) could be placed at each level of the tree structure within the heap. Thus, large values (assuming an extract minimum is required) would ripple down through the levels dislodging even larger values and settling in their place.
The only problem now is to maintain a balanced heap in order to prevent the algorithm degenerating into a sorted list structure. One approach would be to maintain a count of the number of nodes below every node so that at each level of the tree a decision about which lower levels should store the next value. This appears to be inefficient in terms of storage but it should be noted that the number of PEs and the size of the counter for each node would grow as O(log(n)). Thus, in terms of silicon real-estate this would work well for large datasets. Unfortunately an insertion or extraction takes at least two cycles (read and examine followed by a write).
The rebound sorter
The rebound sorter was proposed by T.C. Chen et al. [BDHM84] and improved upon by Ahn and Murray [AM89] . Whilst it is unsuitable for our application it forms the basis for more suitable approaches.
The basic sorting element (see figure 1 ) consists of two memory elements capable of storing one word, a comparator and various data paths. Incoming data consists of two words: a word of key and a word of associated data to form a record. Records are inserted key first followed by the data on the subsequent cycle. The comparator is used to compare keys stored in the Ln and Rn parts of the sorting element in order to determine the direction that the (key; data) pairs should take; this is known as the decision cycle. The values input in the following cycle will be the associated data so the decision made in the current cycle is used again in order that the data follows its key; this is known as the continuation cycle. Figure 2 illustrates the sorting behaviour. The principle of the algorithm is that incoming values proceed down the left side until they rebound off the bottom (hence the name) or hit a larger value on the diagonally lower right.
It can be seen that records take two cycles to insert or extract and all of the insertions must take place followed by all the extractions. Thus, this algorithm does not meet our objectives. Furthermore, it should be noted that n?1 comparators are required to sort n records and that these comparators are only used every other cycle. t0  t1  t2  t3  t4  t5  t6  t7   t8  t9  t10  t11  t12  t13  t14  t15 where n' = data associated with key n result of making a decision continuation from last decision comparison point Figure 2 . An example of a rebound sort
The up/down sorter
The up/down sorting algorithm was originally designed to be implemented in bubble memory technology [LCW81] . It is constructed as a linear array of sorting elements in a similar manner to the rebound sorter described in the previous section. However, (key; data) pairs are inserted in parallel and the sorting element (see figure 3 ) is more complex, primarily because of the implementation technology.
Initially all of the sorting elements contain infinity which may be indicated by the maximum possible number. An inserted value arrives in A n . Simultaneously a copy of C n is made to B n , and D n is transferred to A n+1 . A compare and steer operation takes place resulting in the maximum of A n and B n being transferred to D n and the minimum of A n and B n transferred to C n . Extraction similarly involves C n being removed or transferred to B n?1 , D n copied to A n and C n+1 transferred to B n followed by the compare and steer operation. An example is given in figure 4.
Interestingly this algorithm allows insert and extract operations to be interleaved, and only requires n 2 sorting elements to sort n numbers. Unfortunately this implementation uses four storage areas per sorting element but if implemented in digital electronics this may be reduced to just two storage areas. The next section abstracts this algorithm, determines that FIFO ordering of identical keys is not maintained and suggests solutions. Then a clocked digital implementation is presented. 
Ensuring FIFO ordering
FIFO ordering could be assured by associating an order of entry number with each record. However, a cleaner solution is to tag records by setting a single tag bit when they arrive on the right so that if they are swapped to the left they can be forced to swap back to the right on the next cycle. This works because once a record arrives on the right it must be sorted with respect to the other keys on the right. If a record gets swapped to the left, then on the next cycle (regardless of whether an insert or extract takes place) it will be compared with the right value which was previously physically below it. Thus the right key must be either greater than the one on the left or have the same key. However, the record on the right was inserted later than the record on the left. Therefore, a swap must be performed if the ordering on the right is to be maintained (see figure 7 for an example). This is formally proved in the next section. The tagged up/down sorting algorithm may thus be defined as a two stage process: 
Figure 7
. Using tagging to ensure FIFO ordering
Proof of correctness of the algorithm
Formal verification of an algorithm entails making a formal definition of the algorithm and proving that, under particular constraints, certain necessary (and formally defined) properties are assured. These necessary properties give a more abstract specification of the behaviour of the algorithm. In this work, the key property is that the least record currently in the queue is the one returned by the extract operation. An invariant is defined to specify well-formed states of the queue and it is proved that key property holds for these states. The invariance is proved by showing it holds on an empty queue and is maintained by both insert and extract operations. The proof of these properties is realised using the HOL system [GM93] , a high integrity machine-implementation of a classical higher order logic. The HOL system has been used often for reasoning about properties of hardware designs [Gor86, Coh88, Coh89, Gra92, Mel93]. The scope of other applications can be gleaned from the proceedings of the annual HOL Users Group Workshops [AJLW91, CG92, JS93, MC94]. The use of this system lends a high level of assurance that the proof is valid. Using formal methods demands a precise and complete specification of the algorithm's data structures and its operations and likewise the invariant and desired properties. Concepts expressed informally in natural language need translation into precise logical expressions, and these must be subjected to examination to confirm that they capture the intended concepts adequately. Hence in presenting the verification we dwell at length on the representation of the queue and the definition of the invariant.
Representing the queue and algorithm
Both insert and extract operations involve shifting one side of the stack one position relative to the other. We represent the nonempty locations of the queue as a pair of lists of records. Inserting a record consists of augmenting the left list with a new record at its head, while extraction removes the head record from the right list. Both are followed by a compare and swap function passing over the list pair.
An
A record datatype has a key field of type :num (natural numbers), a data field of arbitrary type, a tag field of type :bool, and a timestamp field also of type :num. The timestamp serves only to record the order of insertion of records for specifying the required FIFO behaviour when two records have the same key, and is used by the algorithm specification. Record field selectors key, time, and tag, and a tagging function set tag are defined, all with the obvious meanings.
A few auxiliary functions are needed to define operations on lists and list pairs. We use the symbol ":" as the infix (cons) operator that adds a new element at the head of a list, and "::" for the infix append operator. The (higher order) function Map applies a function to every element in a list.
The meaning of the function is expressed below, representing the primitive recursive definition that exists in the HOL system. Note that function application is indicated by juxtaposition, lists are enclosed by " " and "]", elements are separated by ";", and the empty list is " ]". 
Defining an invariant
The invariant captures many intuitions about the queue operation. These include that all records on the right are ordered, that tagged records on the left are BELOW the next lower record on the right, and that pairs of records at each level are ordered. Additionally, the right side has as many or just one more record than the left side, untagged records in the left must have a later timestamp than every lower record with the same key in either side, and every record on the right is tagged. Formulating precise definitions of these properties relies on the Map functions described above. Each pair of records at the same depth in the queue is ordered, with the one on the right BELOW the left one. Two more predicates are required. The timestamp must reflect the order of insertion of records. Thus the timestamp of an inserted record must be later than that of every record with the same key already in the queue. Also, records are not tagged prior to insertion.
load constraint x(ll; rr) def = :(x:tag)Â ll (Map ( a:(x:key = a:key) ) (a:time < x:time))ll)Â ll (Map ( a:(x:key = a:key) ) (a:time < x:time))rr)
Finally, the Least predicate expresses that a record is the minimum with respect to the BELOW ordering for all records in the queue.
Least x(ll; rr) def = All (Map ( a:BELOW (x; a))ll)Â ll (Map ( a:BELOW (x; a))rr)
Results
The correctness result comprises three theorems. (Note that HOL theorems are identified by the`symbol.) These theo-rems express that the invariant holds on an empty queue, it is preserved through both queue operations, and it assures the extracted record is the least.
load constraint x(ll; rr)) 
In the other case we split the proof into requirements on the heads of the list and the rest, with the latter solved by theorem (2). The argument for the separate requirements parallels those for theorem (2), with the added premises that the new record is not tagged, and has a later timestamp than any record with the same key already in the queue.
The next theorem assures that the top right record is the Least with respect to the BELOW ordering.
Invariant (ll; (r : rr)) ) Least r(ll; rr)
Since the right side is ordered, and record pairs at each level are ordered, the result follows by the transitivity of BELOW.
Theorem (1c) combines the results from theorems (2) and (3), and the definition of Extract.
The proofs of all theorems have been completed using the HOL system. The informal proof sketches presented above outline the reasoning behind the mechanical proofs, and reflect the sequence of proof steps (tactics) applied. We submit that the formal proof lends a very high assurance of the validity of the proof. Such assurance cannot replace peer review in evaluating results, but has been effective in discovering omissions in informal proofs developed prior to and along with the formal proof. The invariant was strengthened in response to each omission, and the final definition was marked version number five. This demonstrates the practical advantage of machine-checked proof, even when the subject is a relatively simple system.
We observe that the result verifies the abstract algorithm. The results can be applied to the verification of a concrete design by incorporating limits on the number of records, thus constraining both insert and extract operations, and defining an abstraction from the loaded cells of a queue implementation to the list pair representation. This remains as future work. 
Controlling the single cycle design
The two crossbars are controlled by x which maps A n to the left and B n to the right if x = true, otherwise A n maps to the right and B n to the left. Insertion and extraction are controlled by insert and extract signals which are mutually
exclusive. An insert or extract is performed by pulsing the appropriate control line high (see figure 10 ) which clocks the required latches for A n , B n and x on the falling edge. ac, atc, bc and btc control the clocking of the (A n ; at n ) and (B n ; bt n ) latches. If x = true then the left value is in A so ac and atc will be clocked if insert is pulsed. If x = false then the right value is in A so ac and atc will be clocked if extract is pulsed; however, to force tagging of right hand side values, atc will also be clocked if insert is pulsed and the OR gate arrangement into at n will set the tag bit. Thus, the tag is set on the following cycle. The corresponding logic is required for B, but with :x. The control equations for the one step control logic are defined by (assuming that the flip-flops are negative edgetriggered, i.e. they latch on the falling clock edge): let x = control for crossbars: (lout; rout) = if (x) (1)
Discussion of the operation of the single cycle design
First we consider the insert operation. The record to be inserted is presented at l n (see figure 9 ), the insert signal is pulsed true and extract remains false. The latch used to hold the record will depend upon the value of x which is determined by the contents of (A n ; at n ), (B n ; bt n ) and oldx before the insert takes place. If, for example, we take x = true (so the A n and at n latches are holding the left record) and perform an insert (so extract = false) then the control equations (2) through (5) become:
Thus, since x = true the new record (at l n ) will be placed on the inputs of A n and at n which will be latched into place on the falling edge of the insert signal by ac and atc (the original record in (A n ; at n ) being propagated to the next sorting element). We can also see that latch B n is not clocked because bc remains false but that the tag bit is set by the OR gate arrangement into bt n and the clocking signal btc.
On the falling edge of insert the current value of x = true is transferred to the variable oldx and the next value of x is calculated from (1):
x = ((An:k = Bn:k) _ (An:k > Bn:k))^:An:t Thus, the crossbar only causes a swap (x goes from true to false) if (A n :k < B n :k) _ at n which conforms with the algorithm in section 4.2. Furthermore, it should be noted that if a swap has occurred then (B n ; bt n ) has been remapped from the right output (r n ) to the left output (l n+1 ) and that this record has been correctly tagged. Likewise, if we had started with x = false then similar conformation would be obtained. Now consider the extract operation. If we start with x = false (so the A n and at n latches are holding the right record and r n+1 is the input) then equations (2) through (5) become:
Thus, r n+1 will be latched into A n , and the tag bit at n will be set, on the falling edge of extract. At this point the value of oldx will be set to false resulting in the following calculation of the next value for x: x = (An:k > Bn:k) _ :Bn:t It can be seen that the conditions for x to change from false to true, thereby causing a swap, correspond with the specification in section 4.2 (remembering that oldx = false so B n was mapped onto the left and A n onto the right). Furthermore, the value shifted in has correctly had its tag set.
Implementing the single cycle design
A single cycle implementation has been produced based upon the schematic of figure 9 and the control equations in section 6.1. Mentor Graphics' GDT ECAD system was used with ES2's 1 m 2 layer metal CMOS technology files.
The size of the key and length of the sorting structure were varied to assess scalability. Results showed that performance remained virtually constant as the length grew, the only difficulty being efficient distribution of insert and extract signals for very long structures. Performance decreases with key size due to the comparators. However, careful comparator design reduces this to O(log(key size)). Silicon area grows almost linearly with the length and key size.
A detailed analog simulation was undertaken on a sorter with 8-bits of key and data and a length of 8 (i.e. it can sort up to 16 records). The automatically routed design, using minimum size transistors, consumed a silicon area of 5.7mm 2 without pads. The cycle time is approximately 10ns. It is anticipated that a full custom implementation using dynamic logic would reduce the silicon area and improve the cycle time.
Conclusions
We have presented the algorithm, formal proof of correctness and clocked digital implementation for the tagged up/down sorter. The algorithm requires just n 2 comparators in order to sort n records. We have fulfilled our objectives of single cycle insert and extract operations. Furthermore, extract always removes the record with the least key, and in the case of repeated keys FIFO ordering is maintained.
A formal verification of the operating properties of the single cell design described in section 6.1 has been completed but is not presented. Future work will include extending this verification to an n-element sorter implementation.
We are also interested in exploring self-timed implementations of the tagged up/down sorter.
