Systolic Priority Queues by Leiserson, Charles E.
SYSTOLIC PRIORITY QUEUES . 
Charles E. Leiserson 
Department of Computer Science 
Carnegie-Mellon University 
Pittaburch, Pennsylvania 15213 
Copyright -c- 1979 by Charles E. Leiserson 
Reproduc e d by Permission 
199 
This research is supported in part by the National Science Foundation under Grant 
MCS 75-222-55, the Office of Naval Research under Contract N00014-76-c-o370, 
NR 044-422, and by the Fannie and John Hertz Foundation. 
CALTECH CONFERENCE ON VLSI, January 1979 
200 Charle s E. L e i ser son 
1. lnaroduction 
Very large scale integrated (VLSI) circuit technology has made it possible to build 
multiprocessor hardware devices to aid in the rapid solution of sophisticated problems. 
An algorithms designer wishing to take full advantage of the massive parallelism offered 
by VLSI must address geometric issues hitherto relegated to layout artists. The reason 
for this is that VLSI is a planar technology in which the interconnections among 
components on a chip may cost more than the components themselves. The designer of a 
multiprocessor algorithm to be implemented in this technology must consider the 
complexity of the data paths between processors in evaluating the algorithm. 
Many programming applications require the ability to insert record$ into a set, and at 
any time to retrieve from the set the record having the smallest key according to some 
ordering. A data structure that provides such services is called a priority queue. (See 
Knuth [1973], pp. 150-152 and Aho, Hopcroft, and Ullman [1974], pp. 147-152.) The 
operation INSERT<O,cJ replaces the set 0 with the set 0 u {c}. The operation 
EXTRACT _MIN(0) returns the smallest element c of 0 and replaces 0 with 0 - {c}. This 
paper shows how high-performance priority queues can be built using the VLSI 
technology. 
Section 2 of this paper discusses systolic systems, the model of parallel computation 
used for this work. Section 3 presents a systolic array implementation of a priority 
queue. Section 4 shows how multiple priority queues can be implemented as a single 
device that shares processors among the queues. The organization of the shared 
structure is presented In Section 5. Section 6 deals with the geometric layout of the 
multiple queue device In VLSI. The conclusion Is presented in Section 7. 
2. Systolic Systems 
A systolic system is a network of processors that rhythmically compute and pass data 
among themselves. The analogy is to the rhythmic contraction of the heart which pulses 
blood through the circulatory system of the body. Each processor in a systolic network 
can be thought of as a heart that pumps multiple streams of data through itself. The 
regular beating of these parallel processors keeps up a constant flow of data throughout 
the entire network. As a processor pumps data items through, it performs some 
constant-time computation and may update some of the Items. 
INNOVATIVE LS I DESIGNS SESSION 
Systolic Prio rity Qu eues 201 
Systolic systems provide a realistic model of computation whrch captures tne concepts 
of pipelining, parallelism, and interconnection structures. Kung and leiserson (19/8j 
demonstrates that many basic matrix computations can be performed by systolic sy stems 
whose underlying network is array structured. These systolic arrays are suitable for 
Implementation as VLSI hardware devices. This paper will show the utility of systolic 
trees. 
Unlike the closed-loop circulatory system of the body, a systolic computing system 
usually has ports into which inputs flow, and ports from which the results are retrieved. 
Thus a systolic system can be a pipelined system - input and output occur with every 
pulsation. This makes them attractive as peripheral processors attached to the data 
channel of a host computer. Figure 1 illustrates how a special-purpose systolic devi<:e 
might form a part of a PDP-11 system. A systolic system might be attached directly to 
the CPU of a Von Neumann machine, much as a floating-point processor may be added to 
extend the instruction set of a computer. 
u 
CPU 
N 
Primary 
Memory 
I B u 
Systolic 
Device 
s 
Disk Tape 
Ficure 1: A systolic device connected to the UNIBUS of a POP-11. 
The activities of the processors In a systolic system can be assumed to be 
synchro.nous. With each pulse of a clock, a processor executes the same constant-time 
program. Furthermore, each processor is only allowed a fixed number of input and 
output lines and a constant amount of local storage. It is possible to view the processors 
as being asynchronous, each computing its output values when all its inputs are available, 
as In the data flow model. For the results of this paper, the synchronous approach is 
more direct and intuitive. 
CALTECH CONFERENCE ON VLSI, January 1979 
202 Charles E. Leise r son 
3. A Simple Sy$tolic Priority Queue* 
A linear systolic array can implement a fast priority queue. Each processor in this 
array has two registers A and a, and each processor can access the registers of Its two 
neighbors, as shown in Figure 2. The A registers hold elements in the queue in sorted 
order, with the smallest element in A 1· The a registers contain elements that are being 
inserted into the queue. Initially, all the elements in the queue are +co. The priority 
queue operations INSERT and EXTRACT _MIN are performed by the user at the left end in 
the diagram. As items are inserted by the host, they displace overflow elements which 
are output at the right end. Normally, the overflow element will be +co, but when this Is 
not the case, a real overflow has occurred. 
Host ( 
Computer 
Overflow 
Figure 2: A simple systolic priority queue. 
Even and odd numbered processors alternately pulsate, each time executing the 
following: 
1. a1 .- a1 _1. 
2. Arrange the elements in Ai - l• Ai, and ai so that Ai-l s Ai s a,. 
Processor 0 is a dummy processor which does not execute any code, but whose registers 
can be altered by the host machine. The array pulses twice each time an operation is 
performed by the host machine, once for ,odd numbered processors and once for those 
with even positions. The operation lNSERT(Q,aJ is implemented by placing the Item 4 In 
a0 and -co in Ao just before processor 1 pulses. Each element travels to the right until 
it finds its place in the array. 
By loading Ao and a0 with •co, the pulsation of the systolic array causes 
•rhi• ••ction de•cribe• r••••rch done jointly with H.T. Kunt. 
INNOVATIVE LSI DESIGNS SESSION 
Systolic Priority Queues 
0 1 2 3 4 5 6 7 Step 
0 INSERT(6) 
1.1 
1.2 
2.1 
2 EXTRACT 2. MIN 
3.1 
3.2 
4.1 
4.2 INSERT(8) 
Figure 3: Several steps in the execution of the systolic array 
shown in Figure 2. 
203 
CALTECH CONFERENCE ON VLSI, January 1979 
204 Charles E . Leiserson 
EXTRACT _MIN to be performed, the minimum value being found in A0. With each pair of 
pulsations the systolic array is ready to execute another INSERT or EXTRACT _MIN 
operation. Figure 3 shows several steps in the execution of this systolic array. The 
initial configuration in the figure shows the insertion of items 9 and 15 already in 
progress. Although it may take an element a long time to find its place in the systolic 
array, to the host computer an INSERT operation appears to take only constant time. 
Since the minimum element in the queue is always at the front, an EXTRACT _MIN 
operation also appears to take constant time. The operation of the systolic array Is 
plpelined so that no degradation occurs even when the host executes many priority 
requests In a row. Thus we may say that the systolic array has a response time which Is 
a constant, independent of the length of the array. 
4. The Systolic Multiqueue 
Suppose several of the simple priority queues in Section 3 are att.tched as a device to 
a host computer. No matter how a fixed number of processors are allocated, the capacity 
of any particular queue may be exceeded while most of the other queues are empty. In 
this section a single device is presented that is capable of implementing many priority 
queues that dynamically share processors. Like the simple queue in the previous section, 
the systolic multiqueue can perform INSERT and EXTRACT _MIN for a single host 
computer, on any of m queues, with a response time that is a constant, independent of 
the size of the queue. 
Figure 4 illustrates the organization of the systolic multiqueue. Each of the m queues 
to be implemented requires a systolic array of the type presented in Section 3. These 
can be accessed directly by the host computer. When a systolic array overflows, the 
overflow element travels through a switching network to a large systolic tree. Each time 
the minimum of a particular queue is extracted from the corresponding systolic array, the 
minimum of the elements that have overflowed from that queue is removed from the 
systolic tree. The internal structure of this shared overflow area is examined in Section 
5. Here, we only need to know Its behavior. 
The records stored in the systolic tree are the same as those in the systolic arrays, 
but an additional field is used to identify the queue from which the item originated. Thus 
items are stored in the systolic tree according to a composit~ record <Q,c> where c was 
originally inserted by an INSERT(Q,c) operation and eventually overflowed from the 
systolic array corresponding to that queue. 
INNOVATIVE LSI DESIGNS SESS ION 
Systolic Priority Queues 
Systolic 
Arrays 
To Host Computer 
------------------~~-----------------( ~ 
Switching Network 
Systolic 
Search 
Tree 
Figure 4: The systolic multiqueue device. 
205 
The operations that can be performed by the systolic tree are very similar to those 
commands given to the entire systolic multiqueue by the host. The composite record 
<Q,a.> is inserted into the systolic tree by INSERT(Q,o.), and EXTRACT _MlN(Q) removes the 
CALTECH CONFERENCE ON VLS I , January 1979 
206 Char les E . Leise rson 
smallest element in the systolic tree which has 0 as its first field. As will be seen in 
Section 5 , a systolic tree of size n can perform each operation in time O(log n). The 
operations are pipelined, however, so that the systolic tree can process several 
operations in parallel, waiting only constant time between successive operations. For 
example, if a sequence of EXTRACT _MIN's are started constant time apart, it will take 
O(log n) tor the first minimum to be retrieved, but then results appear with every cycle 
of the systolic tree. Thus the systolic tree provides high throughput, with O(log n.) 
response. 
The claim was made, however, that the systolic multiqueue had constant response time 
for each of m queues. Systolic arrays of size proportional to log n + log m are used to 
achieve this goal by satisfying any immediate requests from the host. When the host 
executes EXTRACT _MJN(Q), that operation is performed on the corresponding systolic 
array. At the same time, a request is put into the systolic tree to perform an 
EXTRACT _MlN(Q). A result is yielded by the systolic tree O(log n) time later and takes 
O(log m) more time to traverse the switching network. It is then inserted into the 
systolic array at the same end the host computer uses. Even if the host has performed 
log n. + log m EXTRACT _MlN(Q) operations in the meantime, the quick response systolic 
array has been able to satisfy the requests. Now if the host continues to perform 
EXTRACT _MIN(QJ's, a stream of results from the systolic tree will be inserted ilito the 
array just in time to satisfy the requests. The systolic array will always have at least 
one item in it because operations on the systolic tree are pipelined. It does not matter 
whether or not the host accesses different queues. Since it can only access one queue 
at a time, no systolic array will empty before the beginning of a stream of items from the 
systolic tree has reached the systolic array. 
The number of processors in the systolic arrays is m log n. If the size of the systolic 
tree is doubled, this means only m more processors need be added to the systolic arrays. 
The amount of sharing of processors among the m queues is clearly substantial. 
Furthermore, the systolic multiqueue will not overflow until the shared systolic tree 
overflows. In fact, overflow of the systolic tree can be handled "nicely" as will be seen 
in Section 5. 
5. Systolic Trees 
It seems natural to use a tree-structured hardware device to achieve pipelined 
performance with O(log n.) response time for lNSERT(Q.o) and EXTRACT _MlN(Q). After all, 
INNOVATIVE LS I DES I GNS SESS I ON 
Sys t ol i c Prio rit y Que ues 2 07 
a software implementation on a sequential machine can guarantee O(log n.) performance 
by using a height-balanced binary search tree. AVL trees, 2-3 trees, and 8-trees are 
popular data structures which have this performance. For a sequentijtl implementation of 
a single priority queue, a heap is an attractive data structure because heap storage can 
be managed as easily as stack storage. (Aho, Hopcroft, and Ullman [1974] has a good 
presentation of several of these techniques.) Unlike most programmed implementations of 
priority queues, however, the parallel structure required by the systolic multiqueue 
cannot use a separate data structure for each queue. 
A major problem in the design of a hardware search tree is that the standard balanced 
tree schemes do not map well onto a fixed interconnection structure. A sequential 
algorithm can move the tree pointers to maintain the balance of the tree. Data usually 
remains in fixed locations. Since the "pointers" In a hardware tree are electrical wires, 
data must be moved to maintain the balance of the tree. 
Because the systolic multiqueue requires the operations JNSERT(Q,a.) and 
EXTRACT _MIN(QJ, keys are considered to be from a composite record <Q,a.>. A dummy 
queue number +oo is used to indicate an empty record. Records are compared by 
lexicographic ordering, that is, <Q,o.> < <Q',o.'> if Q < o· or if Q - o· and key(o.) < key(o.'). 
It is useful to view all 
[EXTRACT _MIN(Q' ), INSERT(Q,o.)]. 
operations on the tree as occurring in pairs 
Normally, the paired operation involves the dummy 
queue +oo. For example, when an insertion Is performed on an arbitrary queue, a +oo 
record is deleted by EXTRACT _MIN(+oo). If the systolic tree overflows from too many 
insertions, however, this exceptional condition can be handled by the operating system of 
the host computer. The job using a particular queue can be disabled and the elements in 
that queue can be removed. When an EXTRACT _MIN frees up some space, the elements 
of that queue can be "swapped" back in by the paired INSERT. The analogy to a virtual 
memory computer which has a swapping drum is a good one. Queues can be managed 
just like any other operating system resource. A small amount of bookkeeping Is 
required to keep for each queue, the number of Items In the tree. 
One scheme for implementing the paired operations is illustrated in Figure 5 . Each 
processor In a systolic array is also a leat of a systolic tree. A processor Pi contains 
one record <O;,.a.,>. The tree serves to broadcast paired operations to the processors 
and to retrieve the EXTRACT _MIN results from the processors. A paired operation 
[EXTRACT _MIN(Q' ), INSERT(Q,o.J] will reach all the processors at the same time. Each 
CALTECH CONFERENCE ON VLSI , Janua r y 1979 
203 Charles E. Leise r son 
Fi&ure Sa The systolic array-tree. 
processor P;, executes: 
1. Extraction. If 0;, • 0' and 0;,-1 < 0', then send <Q;,.o.;,> up the tree as the 
result of EXTRACT _MINCO'). 
2. left shift. If 0;,_1 ~ 0', then shift <Q;,.a.c.> left to P;,_1. 
4. Insertion. If <Q;._1.a.;,_1 S <Q,a.> < <O;..a.;.>, then P,; gets <Q,a.>. 
During the first step, each processor checks to see whether it contains the item to be 
extracted. After that item is sent on its way up the tree, the elemer)ts to the right slide 
left to take up the empty slot. Then the position for the insertion is determined, and the 
elements to the right of that processor slide right to make room. Finally, the Item to be 
inserted is placed in the slot left for it. Naturally, the shifts can be optimized so that 
those elements that slide both left and right do not actually have to move. 
Whereas the array-tree keeps all the data at the leaves of the tree, the systolic tree 
shown in Figure 6 keeps the data in the internal nodes. Consequently, the structure is 
more like a standard search tree. The processor at each node holds two records, and 
has connections to its father and two sons. A depth-first tree traversal that prints the 
left record, recursively visits the left son, recursively visits the right son, and then prints 
the right record will print out the values in lexicographic order. There is a good reason 
for having the pointers between the records rather than the normal search tree method 
of a record between pointers. A balancing similar to the shift step in the systolic 
array-tree can be performed top-down to permit pipelining of the paired operations. 
I NNOVATI VE LSI DES I GNS SESS ION 
Systolic Priority Queues 209 
Each processor need only look at itself and its two sons to determine the shift. The 
topology of this tree is in some sense superior to the array-tree because there are 
fewer connections. This will be examined more closely In the next section. 
Fi&ure 6: The systolic search tree. 
6. VLSI Geometry of the Systolic Multiqueue 
Simple and regular interconnections in a VLSI design lead to cheap implementations and 
high densities. Communication is costly in VLSI, and as the technology improves, the time 
and energy required for . communication grows In comparison with that needed for 
processing. Therefore, the geometry of the systolic multiqueue must be considered in 
evaluating the cost of a VLSI implementation. 
The linearly connected systolic array easily satisfies the requirement of having a 
simple geometric realization. The number of external data paths is small as well, 
emanating only from the ends of the structure. As was shown in Kung and Leiserson 
[1978], linearly connected systolic arrays are ideal for implementation in VLSI. 
More interesting are the structures of the systolic binary trees. Mead and Rem ( 1978] 
substantiates the assumption that communication information from the leaf of a VLSI tree 
to the root takes time proportional to the height of the tree. The fact that the root is 
the only off-chip connection is highly desirable for VLSI where the number of pins on an 
IC package is a severe constraint. 
CALTECH CONFERENCE ON VLSI, January 1979 
210 
The topology of a tree makes a planar embedding easy. In a technology where 
interconnections frequently occupy much of a chip, the simple connections of a binary 
tree leave more room for processing elements. It is easy to embed a binary tree in the 
plane using O(n lotz n) area for an n node tree. Figure 7 shows a geometry which 
realizes this bound. In fact, it is possible to embed a binary tree in the plane using area 
that is only linear in the number of nodes. This embedding is shown In Figure 8. 
Fi&ure 7: Embedding a binary tree in area O(n log n). 
Fi&ure 8: Embedding of a binary tree in linear area. 
I NNOVATIVE LS I DESIGNS SESS ION 
Systolic Priority Que ues 211 
Whereas the systolic search tree presented in Section 5 can use either geometry, 
routing the linear connections in Figure 8 appears to be more complex. The tree part of 
the array- tree is used only for broadcasting, however, and a linear ordering of the 
leaves need not be the natural ordering shown in Figure 5. This makes the problem 
simpler, and leads to the linear area geometry of Figure 9. 
Figure 9: The systolic array-tree embedded in linear area. 
There is an advantage to the geometry shown in Figure 7 over that in Figure 8 . The 
linear area solution does not permit connections between the leaves of the tree and the 
edge of the chip. Although the systolic tree in the systolic multiqueue does not need 
connections to the leaves, the O(n log n) area embedding can be used to make a chip that 
will permit a larger systolic tree to be built up from several chips based on the linear 
area embedding. Figure 10 shows how this might be done. The decomposition can be 
very efficient since the number of linear area chips dwarfs the number of O(n log n) area 
chips. 
7. Conclusion 
The systolic multlqueue can be attached to a traditional computer system just like any 
other device. Because each operation on a queue takes constant time, however, It Is 
CALTECH CONFERENCE ON VLS I, January 1979 
212 Charles E . Le i ser son 
......_ ___ __.I O(n log n) area chip with external connections at leaves. 
D Linear area chip with external connection at root only. 
Figure 10: A large systolic tree as several VLSI chips. 
reasonable to connect it directly to the CPU, and make the device visible to the user by 
extending the instruction set of the computer. Thus the INSERT operation might be a 
three operand instruction taking a queue number, 1 key, and a pointer to data. In a 
multiprogramming or timesharing environment there might be many users of the systolic 
multiqueue at the same time. 
A priority queue is not an obscure data structure, and the uses of priority queues are 
many. The computation time of sorting alone is sufficient to justify the · systolic 
multiqueue. Internal sorting normally takes O(n log n) time, but using the systolic 
multiqueue, we save a factor of log n. Just insert the n items into one of the queues, and 
then execute n EXTRACT _MIN's. Not only does the computation take less time, but the 
INNOVATIVE LSI DESIGNS SESSION 
Systolic Prior i ty Queues 213 
load on the CPU of the host machine is lessened. External sorts frequently use priority 
queues. For instance, the popular replacement selection algorithm (Knuth [ 1973), pp. 
251-256) has a priority queue as its primary data structure. 
Many types of search can be speeded up by utilizing fast priority queues. The A* 
algorithm (Nilsson [1971], pp. 57-65), for instance, chooses the best of many possible 
alternatives at each stage of the search. Many game playing programs using alpha-beta 
search sort the moves at each level in the game tree to increase the number of cut-offs. 
As the program searches deeper and deeper, the combinatorial explosion makes this very 
expensive, and therefore the sort is frequently abandoned at greater search depths. 
This need not be the case if the computer system has a systolic multlqueue. 
In relational databases, the join operation is frequently implemented by sorting on the 
chosen fields of two relations and then performing a merge. Algorithms for finding the 
minimum spanning tree or convex hull of a set of points in a plane can use the systolic 
multiqueue. A priority queue is also useful for hidden line elimination. Priority queues 
are used in operating systems for resource management. 
The systolic multiqueue provides insight into the organization of special-purpose 
multiprocessor devices with an emphasis on a VLSI implementation. The sharing of 
processors by several independent hardware structures is a key issue. A tree may not 
be the optimal shared structure for a given set of constraints. For example, a systolic 
array structure that performs like a Young tableau (Knuth [1973), pp. 48- 72) might be 
better under certain conditions, although asymtotically the number of processors 
dedicated to a single queue grows as the square root of the number of shared 
processors rather than the logarithm. 
The systolic multiqueue can be optimized and modified in many ways. For instance, it 
is easy to convert the systolic multiqueue into a systolic multideque that implements the 
priority deque operations INSERT, EXTRACT _MIN, and EXTRACT _MAX. Sten Andler has 
observed that because only one systolic array in the systolic multiqueue need operate at 
a time, the m systolic arrays of length O(log n) might be implemented using only O(log n) 
processors each having enough memory to hold m items. Various modifications can be 
made to the broadcast tree in the systolic array-tree and to the switching network in the 
systolic multiqueue. 
Advances in microelectronics have made the realization of "smart" data structures a 
CALTECH CONFERENCE ON VLS I , January 1979 
214 
pract ical reality. VLSI gives us the capability of building logic-in- memory hardware that 
will drastically change how things are computed. Models of computation based solely on 
the Von Neumann architecture will be insufficient to evaluate algorithms. Multiprocessor 
devices like the systolic multiqueue will introduce new cost functions to the sequential 
algorithm designer. But much work must be done to define and examine the models of 
parallel computation that lie between the mathematical world of computable functions and 
the physical world of space and time. 
References 
Aho, Hopcroft, and Ullman [197 4] Aho, Alfred V., John E. Hopcroft, and Jeffrey 
D. Ullman, The Design and Analysis of Computer Algorithms, 
Addison-Wesley, Reading, Massachusetts. 
Knuth [1973] Knuth, Donald E., Sorting and Searching, Addison-Wesley, Reading, 
Massachusetts. 
Kunc and Leiserson [1978] Kung, H. T. and Charles E. Leiserson, "Systolic Arrays (for 
VLSI)," in Mead and Conway [1978]. 
Mead and Cor1way [1978] Mead, Carver A. and lynn A. Conway, Introduction to VLSI 
Systems, to be published. 
Mead and Rem [1978] Mead, Carver A. and Martin Rem, •Highly Concurrent 
Structures with Global Communication," In Mead and Conway [1978]. 
Nilsson [1971] Nilsson, Nils J., Problem Solving Methods in Artificial Intelligence. 
McGraw-Hill, New York. 
INNOVATIVE LS I DES I GNS SESSION 
