We present two deterministic routing networks, the pruned butterfly and the sorting fat-tree. Both networks are area-universal, i.e., they can simulate any other routing network fitting in similar area with polylogarithmic slowdown. Previous area-universal networks were either for the off-line problem, where the message set to be routed is known in advance and substantial precomputation is permitted, or involved randomization, yielding results that hold only with high probability. Our two networks are the first that are simultaneously deterministic and on-line, and they use two substantially different routing techniques. The performance of our routing algorithms depends on the difficulty of the problem instance, which is measured by a quantity X known as the load factor. The pruned butterfly algorithm runs in time O(X log' N), where N is the number of possible sources and destinations for messages and X is assumed to be polynomial in N . The sorting fat-tree algorithm runs in O(X log N + log2 N) time for a restricted class of message sets including partial permutations. Other results of this work include a new type of sorting circuit, an area-universal circuit, and an area-time lower bound for routers.
Introduction
The performance of a general-purpose parallel computer depends fundamentally on the ability of the interconnection network to quickly route arbitrary sets of messages. Considerable attention has been given to the design of sparse interconnection networks and routing algorithms for them [VB8l1Ale82, Upf84,Va182,Pip84,Ran87,LMR88]. If network is taken to be the number of nodes, networks constructed using expander graphs generally achieve the best performance for routing and related operations [AKS83,Lei85b,PU87,PU89,HB88,Her89a, Her89b1Upf89, LM89] . The basic advantage of expanders is the high bandwidth available across most cuts of the network, although designing a routing algorithm that fully exploits this bandwidth often requires considerable ingenuity.
In the more realistic VLSI model [Tho801 we will adopt for this paper, the cost of a network is the chip area it occupies in a two-dimensional layout.' When the cost of the wires is taken into account, the problem of designing routing networks with the best performance has a different solution. In the VLSI model, high bandwidth networks like expanders give good routing performance only for the special case of when the number of nodes of the network is much smaller than the area cost. In a fundamental paper [Lei85b], Leiserson initiated the investigation of routing networks that are area-universal, in the sense that they can route almost as efficiently as any other network of similar area. Before we can make this notion more precise, we need a few definitions.
We model a routing network as a graph whose vertices represent constant-size logic gates, and whose edges represent wires connecting the corresponding gates. A distinguished subset of N vertices called terminals perform I/O functions and can be sources or destinations of messages. A message is a triple If A is a subset of the terminals, a cut S is a set of edges such that every path from a terminal in A to a terminal in A includes some edge in S . The load placed on cut S by M is defined as the number of messages that must cross it, and denoted l ( M , S). The load fact o r of a cut is l ( M , S)/lSl, denoted X(M, S). The load factor on the entire network, X(M), is the maximum load factor of any cut. Clearly bX(M) is a lower bound on the number of bit steps necessary to route M.
Expressing an algorithm's performance in terms of ,he load factor measures how close to optimal it is for the network, but says nothing about how well the network can simulate other networks of similar area. To prove results about area-universality, it is helpful to have a measure of routing difficulty that reflects layout constraints. The best bound one could hope for on the layout area of an N-terminal router is @ ( N ) , and any such layout would place constraints on the time to route. In particular, a rectangle of area O(N/2') in a hierarchical decomposition of the layout would contain O ( N 2' terminals enclosed by a perimeter of length O( P N / 2 ' ) . The following definition captures the lower bound on routing time imposed by the limited bandwidth crossing the perimeter of any such rectangle.
Let T be an N-leaf complete binary tree with the leaves labeled 0 to N -1 from left to right. Let an edge at height h above the leaves have a weight of 2rh/21. Now consider the embedding of the message set's graph in T, where each message edge is mapped to the simple path in the tree from the leaf labeled by its source to the leaf labeled by its destination. Then q ( M ) , reference load factor, is defined to be the maximum over all edges e of T of the congestion of e divided by its weight.
The usefulness of the reference load factor stems from the following result, a simple variant of Theorem 10 in [Lei85b] . O ( m r ( N ) ) in time.
In this paper and in the rest of the literature, Proposition 1 is used to prove universality results in two steps. The first is to design an N-terminal router with area not much larger than A in such a way that for any message set M , X ( M ) 5 v ( M ) . The second is to design a routing algorithm for the network that runs in time close t o O(X(M)). The first step was made easier by the introduction in [Lei85b] of the "fat-tree" framework for routing networks. A fat-tree is a complete binary tree with subnetworks at the nodes that perform switching functions. Two neighboring nodes in the tree are joined by a group of wires called a channel. The number of edges in a channel c is referred to as its capacity, written cap(c). The fat-trees proposed in the literature and in this paper have the useful property that the cuts corresponding to the channels are sufficient for determining the load factor. Thus choosing the capacity of a channel at height h above the leaves to be 2rh/21 guarantees X(M) 5 v ( M ) . ing time (with high probability)2. In [GreSO] , areauniversality is investigated under alternative assump tions about wire delay, and considered fat-tree networks with processors of various sizes.
In Sections 2 and 3 we give the first deterministic solutions for area-universal on-line routing. We propose two routers, the pruned butterfly and the sorting fat-tree. The N-terminal pruned butterfly can route an arbitrary message set of polynomial size in time O(X log2 N) time and area O(N log2 N).3 The sorting fat-tree routes only message sets with constant degree, but achieves a better performance on this important class of message sets. The sorting fat-tree of N terminals has area O(N log2 N) and can route on-line in time O(X log N + log2 N ) . sorting fat-tree actually achieves the bandwidth lower bound. Note that this special case is common: for a random permutation, the expected value of A is Proposition 1 implies that both the pruned butterfly and the sorting fat-tree can simulate any Nterminal router of area O(N log2 N) with a slowdown of O(log2 N), or any router of area O ( N ) with slowdown O(1og N).
In the construction of the sorting fat-tree, we deploy a new type of sorting circuit of independent interest. It has area A and for any n with a 5 n 5 A/ log A, it can sort n words of (log n+O(log n)) bits with optimal AT' = O(n' log' n). Previously known VLSI circuits [Lei85a,BP85] achieve AT2 = O(n2 log2 n), but for a fixed value of A, different circuits are needed for different values of n. The novel feature of our sorter is that the same circuit can process all input sizes in the given range in optimal time. In Section 4, we construct a deterministic areauniversal VLSI circuit. The circuit has area O ( A ) and can be "programmed" to simulate any circuit of area A with an O(1ogA) slowdown. This result is in the same spirit of those on size-universal combinational boolean circuits. Finally, in Section 5 we establish an existential AT2 = R(v2b2N 1og2(N/v2)) tradeoff between area and worst-case time to route message sets with reference load factor 7.
q-).

Deterministic On-line Routing on the Pruned Butterfly
In this section we present the pruned butterfly switching network (Subsection 2.1) which, augmented with some auxiliary circuitry (Subsection 2.2), supports an efficient on-line routing algorithm for arbitrary m e s sage sets (Subsection 2.3).
The Pruned Butterfly
We give the name pruned butterfly to the graph G( V, E) defined as follows, for N a power of 4. 
Auxiliary Circuitry
For the on-line routing of a message set, the switching structure of the pruned butterfly needs to be augmented with some circuitry supporting auxiliary functions such as buffering, counting, sorting, and partialsum computation. Each leaf node of the pruned butterfly has a bitserial interface with a processor. The processor stores a set of messages, each consisting of a record with an O(1og N)-bit information field, a log N-bit destination field, and a log log N-bit peak-level field, storing the level of the minimum common ancestor of the message's current position and destination. The messages are kept in a priority queue organized by peak-level (minimum level at the top) which can receive-every O(1og N) bit steps-either an insert or a deletemin instruction from the leaf.
A leaf node is responsible for initializing the peaklevel of the messages originating at the attached processor, and for updating the peak-level of a message before inserting it into the queue. A leaf also maintains logN counters, each storing the number of messages in the queue with a given peak level. In addition a leaf is equipped with a comparator for log N-bit numbers and a circuit to compute a mod b where a and b are log N-bit numbers. The leaf can be laid out in a square region of side length O(logN), and can perform any of the mentioned operations in O(1og N) time.
By adding a single-bit full adder and an O(1ogN)-bit shift register to each node of the pruned butterfly tree, and using a straightforward bit-pipelined version of the tree implementation of prefix computation algorithms [DS82,BP89], we have: Lemma 3 Let T be a n n-leaf subtree of the pruned butterfly routing network, and suppose a message set M of cardinality at most f i is initially at the root ofT. Then in O(1ogN) time the message set can be sorted by destination and output at the root T .
An internal node at the root of an n-leaf subtree can be laid out in area O(n + logN), and a leaf can be laid out in area O(log2 N), so an H-tree layout of the entire circuit takes area O ( N log2 N). We now consider two useful forms of data movement on the pruned butterfly, called compression and ezpansion. A set of messages mo, m l , . . ., ms-l where mh has source at leaf lh and destination a t root Ph is said to form a compression if, for h = 1, . . . , s -1, Ih-1 5 lh and r h = h mod n. In an expansion, the role of leaves and roots is reversed.
Let us consider the greedy routing strategy where, a t each step, a message mh tries to advance on its unique source-tu-destination path and gets dropped if it conflicts with a message mk where k < h. This simple strategy has the following useful property. Corollary 1 If the load factor of a compression does not exceed 1, the compression is routed by the greedy strategy without conflicts.
An interesting property of the N-leaf pruned butterfly (reportedly [GL89] already observed by Leiserson and Leighton) is that it embeds an N-leaf mesh of trees with constant dilation and load. Therefore, the R(N log2 N) lower bound for the area of the mesh of trees transfers to the pruned butterfly. An O(N log2 N) layout of the latter graph is easily achieved by the H-tree method, as shown in Figure 3 . Moreover, mesh of trees algorithms can be readily adapted to the pruned butterfly.
Routing Algorithm
Routing is performed by a sequence of logN stages: stage 0, . . . , stage logN -1. Let Mi be the set of messages with peak at level i , at the beginning of stage i. During stage i each message of Mi is moved to a new leaf, possibly different from the final destination, but always lowering the message's peak. A message not routed to its true destination is said to be sidetracked, and is processed again in the stage associated with its new peak. A crucial property maintained by the algorithm is that, for each i , X(Mi) is O (X(M) ).
Stage i is conveniently described in terms of the activity of a generic subtree T with root at level i + 1.
Such a subtree will interact only with its sibling TI to which it will send and from which it will receive some messages. Let MT be the set of messages in Mi with source in T, and let A bit is prepended to each message, indicating whether the message is active or sidetracked. The bit is initialized to "active". A direction bit is associated with each pruned butterfly vertex and initialized to "left". Routing is bit-serial. Decisions are decentralized and made on-the-fly by individual vertices of the pruned butterfly according to the following rules. (a) If there is no conflict, an active message is routed toward its destination. (b) If two active messages compete for the same edge, then the one with the larger destination is sidetracked and the corresponding bit is set. (c) An active message has precedence over a sidetracked one. (d) Each time a vertex receives a single sidetracked message from its parent(s), it sends it down the edge indicated by the direction bit of that vertex, and toggles the bit.
During stages of the algorithm before stage i , parts of some message batches destined for TI could become sidetracked at the parent of the root of TI, increasing the number of messages that must leave T during stage i. Using the version of Lemma 1 for expansions one can show that no such batch part increases X a ( M~) unless the root channel of TI is completely utilized by active messages. Since X(M) is an upper bound on the number of times that channel can be completely utilized, we have:
Lemma 5 For each subtree T, X a ( M~) is O(A(M)).
An inductive argument using the direction bit policy gives the following lemma about the even distribution of sidetracked messages.
Lemma 6
No processor stores more than 2d(M)(log N + 1) messages at any time during the ezecution of the routing algorithm.
Combining the results of this section, we arrive at the following theorem. 
Deterministic On-line Routing on the Sorting Fat-tree
The partition of messages by peak level in the pruned butterfly may unnecessarily serialize the routing of subsets of messages which use different channels and hence could be routed simultaneously. The sorting fattree, to be described next, circumvents this problem by first bringing all messages t o their peaks and storing them. Unlike the pruned butterfly, a given node v of the sorting fat-tree can then reorganize all of the messages with peak v for more efficient transmission down to their destinations. This strategy leads in certain cases to an optimal routing time of O(X log N). However, the strategy requires that all messages be present in the routing network simultaneously and hence limits the class of message sets that can be handled. Subsection 3.1 describes the network and Subsection 3.2 the routing algorithm.
The Sorting Fat-Tree
The sorting fat-tree is an N-terminal routing network whose structure is a hybrid of a fat-tree and a mesh. Groups of log2 N terminals are interconnected by log N x log N two-dimensional meshes. Each terminal in a mesh is allocated a square region with side length O(1ogN). Thus each mesh occupies a square region with side length O(log2 N). The N/log2 N meshes are placed at the leaves of a fat-tree laid out in the H-tree style. The internal node of the fat-tree at the root of each subtree with n terminals is connected to its parent by a channel of capacity f i . This node is also equipped with a sorting circuit, to be described next.
The sorting circuit placed a t a given node of the fattree must satisfy certain performance requirements, as indicated by the following considerations. Roughly speaking, to achieve an overall routing time proportional t o the message set load factor, it is desirable that each node of the fat-tree operate in time proportional to the load factor of its incident channels or, equivalently, to the size of the set of message that must traverse the node. Since an important step of our algorithm consists in sorting this set, we need a sorter that can operate in time proportional to its input size. Qualitatively, we will call such a circuit a flexible sorter.
Although VLSI sorting has been investigated extensively, the circuits proposed in the literature for sequences of length r between, say, s and 1 take, for every value of r , the time corresponding to 1 and are hence not flexible. Combining ideas from several known constructions [Ben65,Wak68,Law75,BP85,Lei85a], we have designed a flexible sorter. While the details of design are left for the full paper, the result obtained is stated in the following theorem. Some auxiliary circuitry, similar to that described in Subsection 2.2, is required to support the routing algorithm, and can be added without increasing the order of the area.
Routing Algorithm
The algorithm of this section works for any message set M of constant degree. Since such a message set can be partitioned into a constant number of partial permutations, for simplicity, we describe the algorithm in this special case. The following list of steps outlines the routing algorithm.
1. Messages with source and destination in the same mesh are routed by a standard technique.
2. Messages that must be sent between meshes are partitioned into log N batches.
3.
Within each mesh, messages are reorganized into row major order by batch number.
4.
The batches are routed to their peaks consecutively.
5. All the messages with the same peak are sorted in order of destination.
6. The messages are again partitioned into logN batches.
7.
The batches are routed down to their destination meshes.
8. Within each mesh, the messages are routed to their final positions.
Step 1 can be accomplished in O(1ogN) time by standard mesh routing techniques based on sorting and prefix computation.
Step 2 is more involved. We will first describe the partition, then how the network computes it.
Let z be the set of messages that must be routed between meshes. Recall that the ascending load factor Xa(%,c) is defined to be load factor of the source-topeak movement of z. Let Mi to be the set of messages in with their peak a t level i (0 5 i < logN -2 log log N). Assign to each message in Mi a rank between 0 and lMil-l according to increasing order of the message sources. The set Mi is then partitioned according to rank into MiOUMilU. -U M i ( l o g N -l ) , where Mi, is the set of messages in Mi whose rank modulo log N is j. Then the log N batches of the partition are given by M,j = U i M i j , for each 0 j < log N.
To compute the partition of ?i? just described, the network first computes for each message its peak level i. This is a simple matter of comparing the source and destination bits and can be done for all messages in parallel in O(1og N) time. The network then prepends i to the message's destination field to simplify later routing operations. Computing the rank of each message within the appropriate Mi can be done with a prefix operation for each Mi. Since each prefix operation takes O(1ogN) time, all the ranks, and hence all the batch numbers can be assigned to the messages in O(log2 N ) time.
The purpose of Step 3 is to prepare the messages so that when they are routed upward, the entire channel will be busy a t once: there will be no gaps between messages. Since each mesh holds log2 N terminals and the capacity of a channel is the square root of the number of terminals below it, the channel that joins a given mesh to the fat tree has one wire for each column of the mesh. Regardless of the mesh's orientation in the layout, for the purposes of defining row major order, we will consider top of the mesh to be the side to which the channel is attached. Steps 3 and 8 are similar to
Step 1 and can be accomplished in O(1ogN) time.
During
Step 4 the batches of messages are routed in order from their source meshes up the tree to their peaks. For simplicity of presentation and analysis, we assume that the nodes of the fat-tree proceed in lockstep, under the control of a global synchronizer. A more asynchronous mode of operation would be more efficient, but also more involved to explain.
Step 4 can be viewed as a sequence of stages numbered consecutively. In the even stages, each node at an even level receives the messages of a given batch from its children, reorganizes them, and sends them to its parent. In the odd stages, nodes at odd levels do the same. A stage is completed only when all the nodes have finished their jobs.
Before the messages of a given batch are sent across any channel in the fat-tree, they are grouped for efficient transmission into sets of size cap(c) called waves. Each wave is sent across the channel in time proportional to the message length, with one message on each wire of the channel. A set B of messages is sent across channel c in time O((lB1 log N)/cap(c)) = O (X(B,c)logN) .
At each stage, after the node has received all messages in the current batch from both children, it separates messages that have their peak a t the node from those that must be sent up to the parent. This is easily accomplished by sorting on the peak field previously prepended to the message destinations. As the messages exit the sorter, they are already properly organized into waves and are immediately sent to the parent node. The messages currently at their peaks are shifted into the unused portion of the buffer where they remain until Step 5.
A fat-tree node with parent channel c can execute the sorting and transmission operations on batch j in 
Area-Universal Circuits
Intimately related to universal routing is the subject of universal circuits. Such circuits can be "programmed" to simulate any other circuit of slightly smaller area. The programming is done off-line, and takes the form of loading some control registers that are thereafter treated as read-only registers. A universal circuit can be based on a universal router whose messages take paths corresponding to wires of the circuit. An upper bound on the area of the simulated circuit translates into an upper bound on the load factor of the message set. To obtain a n efficient solution we have refined this strategy t o take advantage of the fact that only off-line routing is needed, and that addresses can be eliminated from the messages, reducing their length to one bit. The nodes of the square regions of the universal network are programmable finite-state machines with inputs from and outputs to their mesh neighbors. A piece of wire of unit length is simulated by implementing the identity function between the appropriate 1/0 ports of the corresponding finite-state machine. To simulate one step of the square regions, the universal circuit executes a number of steps proportional to the length of the longest wire in those regions. The full paper proves a technical lemma showing that the length of the longest wires in the regions can be reduced to .O(logA) by a modification of the layout that increases the area by at most a constant factor.
Theorem 4
The capacities of the fat-tree channels are log A for each of the A/ log2 A leaves, and for the first 2 log log A levels going toward the root. Then the capacity doubles every other level, reaching the value &/logA at the root. Let M be the set of single-bit messages corresponding t o the signals on the wires joining different square regions of the simulated circuit. By standard bifurcator techniques it is easy to see that 
Lower Bounds
Let R be an N-terminal router of area A. By bifurcator-based techniques, the terminals of R can be labeled in such a way that the time T to route M satisfies the tradeoff AT2 = i2(b2q2(M)N) , where v ( M ) is the reference load factor. This lower bound captures the bandwidth constraints imposed by a certain set of cuts of the network. It is natural to ask whether there is a router that can achieve this lower bound, delivering any message set with performance AT2 = O ( b 2 q 2 ( M ) N ) .
The answer is negative, at least for a wide class of routers, as stated by the following theorem. (2) Combining (1) and (2) and applying the result to hf (G,a,N/,a) , we obtain AT2 = n ( b 2 q 2 N l o g 2 ( N / q 2 ) ) .
An argument showing that the reference load factor of h!f(G,a,N/,a) is @ ( q ) concludes the proof. N/q2) ). Therefore the result of Theorem 3 is existentially tight for the class of constant-degree message sets with reference load factor q, whenever R(1ogN) 5 q 5 O(N'/'-'), for some fixed constant 6 > 0.
Concluding Remarks
The main problem left open by this research is to design an N-terminal network that can route any m e s sage set in O(X1ogN) time and O(N1og'N) area, matching the lower bound of Theorem 5.
We conjecture that such an optimal router would combine features from both of our fat-trees. Specifically, in order to achieve logarithmic diameter, it is desirable that-as in the pruned butterfly-the internal nodes of the tree be constant-depth circuits (perhaps of the expander type). Moreover, the routing algorithm should simultaneously deal with messages that have peaks at different levels, as in our sorting fat-tree.
