We investigate implementations of butter y networks. Obvious mappings of network nodes to chips lead to implementations with expensive wiring. We consider Ranade's butter y routing algorithm. For this algorithm, we present a new mapping of network nodes to chips. This mapping only needs half the number of chips and links between chips. The chips' interconnections still form a butter y network.
Introduction
Interconnection networks are important parts of parallel architectures, and therefore should provide high throughput with small delay. All network nodes should be identical. They should have constant degree, and the routing algorithm should only require constant length bu ers per node. Then one kind of network node can be used to build networks of arbitrary size. A network that meets these requirements is the butter y network.
De nition 1 A butter y network with n = 2 u inputs and outputs is a graph G n that consists of u+1 stages with n nodes per stage. G 1 consists of a single node, G 2n can be constructed by taking two copies of G n and 2n additional nodes that form the last stage of G 2n . Node i, where 0
This work was partly funded by the German Science Foundation (DFG) in SFB 124, TP D4. D. Cross is currently with Mentor Graphics Corporation, 1001 Ritter Park Drive, San Jose 95131. i < n, in the last stage of the smaller butter ies is connected to nodes i and i + n in stage u + 1.
The construction is shown in g. 1.
Ranade was the rst to develop a randomized packet routing algorithm for butter y networks that meets the requirement of constant length bu ers 6]. (Pippenger published a similar result in 5], but his algorithm could end in a deadlock.)
Ranade used his algorithm for the design of a very elegant emulation of a shared memory parallel machine on a processor network 7] . The emulation overhead is c log n. A reengineered version of his emulation was shown to have an emulation overhead where c is very small 2], making the emulation interesting for practical use. Prior to 2], shared memory emulations were thought to be impractical because of large constant factors involved. Because shared memory parallel machines are easier to program than distributed memory machines, they could become a serious competitor to the latter, if there are prac-
X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X a a a a a a a a a a a a a a a a a a a a a a a a a a a a  a a a a a a a a a a d d This reduces delays on wires and allows for an increase in speed. Engineering aspects such as cooling and power supply also simplify.
In section 2 we will sketch the design of a network node that implements Ranade's routing algorithm and will discuss the problem of pin count restriction that leads to mapping M. Section 3 shows how a slightly di erent mapping M 0 doubles gate utilization of the chips. This implies that we need only half the number of chips and links between chips. We nish with the proof that mapping M 0 preserves the interconnection structure of the chips.
Original network nodes
We assume that packets consist of an address specifying the output, one word of data, and control information. At the beginning, log n packets are fed into each input. These packets are sorted by address, and the sorted order is kept during routing.
A node behaves as follows: If two packets are waiting in its input bu ers, the packet with the smaller address is transmitted. The address also speci es the output along which the packet has to be sent.
If one input bu er is empty, the packet in the other bu er has to wait until it can be sure that no packet with a smaller address will arrive at the empty input in the future. Otherwise, the order would be destroyed. Each time a packet is transmitted along one output, a GHOST packet is transmitted along the other output. The GHOST packet also carries the packet's address, but has di erent control information. A GHOST packet arriving in an input bu er guarantees that only packets with an address larger than the GHOST's address will arrive on that input in the future.
Some of the packets can contain requests for a device on an output of the network to send back an answer, e.g. requests to a memory module to read the contents of a cell. The answers only consist of data. Their return path through the network is determined by keeping track of the requests' path. This is done by maintaining a direction queue in each node. For each request that passes the node, input and output are recorded. As the sorted order guarantees that answers will arrive in the same order as the requests passed, this information can be used to send back the answers 6].
A network node that implements these functions is shown in gure 2. We will ignore the dash line in this section. In practice, each address and data word will consist of 32 bits, and the control information will consist of 8 bits. In order to have a high speed transmission, each node{ to{node link should be capable of transmitting one packet and one answer per clock tick. To achieve this, the link must have a width of 72 bit in one direction to transmit a whole packet and 32 bit in the opposite direction to transmit an answer. Each link then has a total width of 104 bits. Such a setup causes severe problems due to pin restrictions if one node is implemented on a single chip. As one node has four outgoing links, it needs 4 104 = 416 pins to obtain full speed.
Custom chips can be obtained with up to p = 240 pins at reasonable prices. This forces sending packets in two pieces, yielding 416=2 = 208 pins. Compared to pin count however, gate utilization is very low. Thus, most of the silicon on the network chips is wasted.
To increase the gate/pin ratio of the network Although the node has two inputs and two outputs, it can be cut into two halves such that only one link crosses the cut. This is due to the fact that only one packet is transmitted at a time. The cut is shown as a dash line in gure 2. We now implement a 2 2{butter y in one chip but take only the lower part of the nodes of the rst stage and the upper part of the nodes of the second stage. Step: Assume that the assumption holds for some t = n=2. We show that G 0 n = G n=2 . To do that we recall the inductive de nition of a butter y network G n . By the induction hypothesis we know that we can shrink both subnetworks G n=2 to butter ies G n=4 . Shrinking the subnets G 2 of the last stage of G n results in n=2 nodes in the last stage of G 0 n . We now know that G 0 n is constructed by taking two subnets G n=4 and n=2 nodes as an additional stage. We only have to prove that the ith and the (i + n=4)th nodes in this last stage are connected to the ith nodes in the last stages of the two subnets G n=4 , 0 i < n=4. Then G 0 n = G n=2 . But this is obvious from the inductive de nition of G n if we look at how G n is constructed from G n=4 subnets. The ith subnets G 2 , 0 i < n=4, in the last stages of both graphs G n=2 are connected to subnets G 2 with numbers i and i + n=4 in the last stage of G n .
2
