Abstract
Introduction
Interconnection networks play an important role in providing low latency, high bandwidth communication in multicomputers. Some examples of interconnection networks used in commercial machines are the IBM SP2 multistage interconnection network [1], Cray T3D 3-dimensional torus [2, 31, the Connection Machine fat tree [4, 51, and Intel Paragon mesh [6] . Routing in an interconnection network can be classified as a d a p tive or non-adaptive depending on the dynamics of route selection. In non-adaptive (or oblivious) routing, there is a fixed routing decision at each intermediate switching element (switch) along a path between a source node and a destination node-each switch can use only one output port for message packet forwarding. Adaptive routing methods allow more than one choice of output ports. Switches try to minimize network contention by exploring alternate routes to destinations [5, 7, 8, 91 . On the other hand, some networks employ oblivious routing methods such as the sourcebased routing (source routing) used in SP2 [I, 10, 111 due their flexible choice of network topology and simplicity of switch design. In the source routing method, the packet route is deterministic and is completely determined at the source processor which encodes the ' To whom correspondence should be addressed. Although many adaptive routing networks have been constructed and proposed to date, they have all been destination-based adaptive routing networks. This paper presents the first attempt to combine the source routing and adaptive routing, referred to as the adaptive source routing (ASR) method. The proposed combination has the advantages of both methods. The route and the adaptivity of each packet is determined at the source processor node. Every packet can be routed in a fully adaptive, or partially adaptive, or oblivious manner, all within the same network at the same time. The ASR method also provides support for multiple types of network traffic, in-order delivery of multiple packets, and network partitioning.
In the following, we give an overview of adaptive, destination-based, and source-based routing methods.
In Section 2, we describe the proposed Adaptive Source Routing method. In Section 3, we present a performance comparison of the ASR networks and the oblivious routing networks by simulations. In Section 4, we present a route generation algorithm that finds m a xi m a l l y adaptive routes between the processor nodes. This algorithm enables the ASR method.
Background
In adaptive routing networks, message packets make use of multiple paths between source-destination node 1063-7133/96 $5.00 0 1996 IEEE
Proceedings of IPPS '96
pairs [7] . Switches alleviate congestion by sending packets via less busy alternate routes. Typically, a busy output port will cause an adaptive routing switch to use another output in routing a packet to its destination. In a (destination-based) adaptive routing network, a switch element must therefore "know" which of its outputs lead to the intended destination. Therefore, a common characteristic of many adaptive routing networks is a regular and simply described network topology such as a hypercube, mesh, k-ary n-cube, or a fat tree [4, 5, 7, 8, 91 . The switches then have an implicit knowledge of the entire network topology, and therefore they can route packets accordingly. A disadvantage of adaptive routing is that it limits the choice of network topologies. In an alternative approach, each switch may have a routing table that maps destination processor addresses to the switch port numbers, however this will occupy real-estate on the switch (chips) and the bounded size of the tables may limit scalability of the networks.
In the destination-based routing, the address (e.g. position) of the destination processor in the network, or an address difference between source and destination, is encoded in the packet header and then the network decides how to route the message. The source processor has no influence on the routing decisions. This method also requires switches to have a global knowledge of the network topology in order to correctly route packets. Some examples of destination-based networks are the CM-5 fat tree [5] and Intel Paragon 2-dimensional mesh [6] .
In the source routing method, unlike destinationbased routing, switches need not know the topology; the source processor determines the route and encodes the routing instructions in the packet header. Switches then follow these instructions to forward the packet to its destination. Cray T3D [3] and IBM SP2 [l] systems are based on source routing networks. For example, in the SP2 multistage network, which consists of 8 x 8 switches, the packet header initially contains 3-bit routing words R I , R2,. . . , Rn, where n is the number of network stages to travel. Each word indicates a switch port numbered from 0 to 7. The source processor determines the route and puts respective words in the header. Each switch forwards the packet through the output port indicated in the first route word and strips off the first word before forwarding the packet to the next level in the network [l] . Thus, the packet contains no routing information upon arriving at its destination. In the source routing method, typically routing headers are computed only once and then kept in a route table in each processor node. The route table approach enables faulty links and switches to be mapped out easily, and allows more choices of network topologies tha.n destination-based routing, and allows multiple routes to be defined per destination.
Architecture
In the adaptive source routing method proposed here, the key idea is in the definition of the routing words in a packet header. Each routing word indicates a set of permitted output ports, rather than a specific output port. Each m-bit word has the format
where m is the number of switch ports. One bits in the routing word indicate the set of outputs that the switch is permitted to use for forwarding the packet. The source processor is responsible for encoding the correct routing instructions in the header as in the source routing method.
Each switch examines the first word of each packet and (adaptively) selects an unused port from one of the permitted outputs to forward the packet to the next network stage. If none of the permitted ports are available, then the packet will be blocked and cannot proceed until at least one of the ports become available. The switch strips off the first route word before forwarding the packet as before. For example, in the 32 node network given in Fig. 14 , the header of a packet from processor 4 to 30 may consist of words RI = 11110000, R2 = 11110000, R3 = 10000000, Re = 01000000. The header indicates to the first, second, third, and the last stage switches that they may forward the packet through one of four ports 4-7 ( R I ) , one of four ports 4-7 (R2), port 7 (R3), and port 6 (&), respectively. In general, the number of distinct paths a packet may follow from source to destination is where lRil is defined as the number of one bits in the routing word I&. Obviously, not only N p a t h paths must exist between the source and the destination, but any combination of the outputs specified in the successive routing words in the header must correctly lead the packet to its intended destination. 
Network Performance
We are primarily interested in the effect of adaptive routing on bidirectional multistage (BMIN) topologies similar to the topologies used in the IBM SP2, the Thinking Machine CM-5 [12] , and the Meiko CS-2 [13] . Figure 2 illustrates a 16 processor node BMIN and shows sample routes from a source node 0 t o destination nodes 3 and 10. The 16 ports on the right side are unused in this configuration. The BMIN switches-for this example 8 input, 8 output devices-could be permitted t o forward packets from any input port to any output port (including ports on the same "side").
We have seen few studies directly comparing adaptive versus oblivious routing for BMIN's, although the CM-5 machine employs destination-based adaptive routing. In addition, we are interested in assessing the effects of adaptive routing when used in combination with switches that incorporate central buffers similar t o SP2 switches [l] .
To evaluate the performance of adaptive source routing, we conducted network simulations based upon a C++ model of SP2-like switches. These switches implement buffered wormhoZe routing [1] for flow-control and contain a 1 KB dynamically-shared central buffer. Under light to medium loading, a switch is typically able to buffer an entire arriving packet when that packet becomes blocked due t o output port contention. Thus, in effect, the switch often operates in virtual cut-through [14] fashion, completely removing blocked packets from network links. However, under heavy loading the central buffer may become full, and packets may then be blocked across several switches, just as in wormhole routing [15] .
In BMIN networks, adaptive choices can typically be made while the packet is traveling "away" from the processors until it reaches any switch which is a least common ancestor of both the source and the destination node. When more than one output port is both idle and permitted for adaptive routing, our simulations assume the choice of output port is made on a least-recently-granted basis. The path "back" t o the destination from the least common ancestor switch is unique. We assume minimal paths-if there exists a n h-hop path between source and destination, no > hhop paths may be traversed for communication between them.
All simulations assume an open network model containing idealized processor nodes: the nodes contain an infinite transmit queue buffer, and packet flits are immediately pulled from the network as they arrive. We assume a n exponential distribution for message injection time (message arrival time). We apply a range of loading to the network, where a load of 1.0 indicates that each node is injecting packets in the network a t the maximum link data rate. Latency curves include input queueing time and are not shown after saturation (steady-state latency is infinite after saturation, assuming infinite input queues). The maximum packet size is 255, and messages longer than 255 bytes are broken into multiple packets before transmission.
The open network model makes it possible t o "stress" the network t o a far greater degree and cause For each experiment, we compare adaptive routing with oblivious routing schemes. For instance, .in SP2 systems, each node maintains a route table containing 4 valid minimal routes for each destination node. If there are less than 4 unique minimal routes, as when the source and destination node are connected to the same switch, then 2 or more of these routes are identical. Choosing between 4 routes reduces the effect of contention and reduces the probability of creating "hot-spots" in the network.
Permutation traffic simulation
In this section we investigate the relative performance of adaptive routing when the communication pattern is a static permutation. We test 2 permutations: bit-reversal and transpose. In bitreversal, a source processor represented in binary by sn-1sn-2.. . slso sends messages to destination sosl .. .sn-2sn-1. In transposes for even n, the destination is SE-SE-2.. .s1sosn-1sn-2.. .SE+ISE. We simulate 16-way and 64-way BMIN's. Adaptive routing attained both the lowest latency and the highest saturation bandwidth for this difficult permutation. For our 1-route oblivious routing, each packet traverses a "straight" path to a least common ancestor switch, and then the packet proceeds on the unique path t'o the destination. This topology has a maximum of 4 distinct paths between pairs of nodes, and thus 4-ronte oblivious routing is equivalent to randomized routing for the 16-way topology shown in Figure 2. In general, 1-route oblivious routing either performs very well or very poorly depending on the permutation. Its dismal worst-case performance and high variability make it a poor choice for a general routing strategy, amd we will not consider it further in this paper.
For this 16-way topology, adaptive routing and 4-route oblivious routing have exactly the same paths available. However, with adaptive routing any packets traveling a 3-hop path are guaranteed not to contend with any other packets while traversing the first switch stage. Why? For this first hop, only 4 input ports (the "left" input ports in Figure 2 ) are contending for the 4 "right" output ports of the switching element (packets cannot enter and then exit the "right" side of the switching elernent, because the resulting path would not be minimal). Therefore if a packet is entering the "left" side, there are 5 3 other input ports currently sending packets to the "right" side, leaving at least one "right" output port open. The 4-route oblivious packets may often contend in the first stage, and this is the major cause of higher latency for this experiment. We have established that adaptive routing performs well for several types of permutation traffic on small systems. We now briefly examine the performance of one permutation on a larger system to illustrate that the benefits of adaptive routing extend over a range of system sizes. Figure 5 displays the latency curves for the transpose permutation on a 64-way BMIN topology. Adaptive routing still obtains lower latency and higher saturation throughput, although it no longer achieves the "no contention'' curve of the 16-way system. For the 64-way system, packets with source and destination in different 16-way groups will traverse 5 switches and have 16 possible least common ancestors. Thus, 4-route oblivious routing no longer corresponds t o random routing, and we include the 16-route oblivious case t o demonstrate that adaptive routing maintains performance advantages over random routing as the system size grows. Other permutations and system sizes support similar conclusions, but we will not exhaustively detail results to further support these claims here.
Random traffic simulation
In other experiments, we injected traffic with a uniform destination distribution: for each message, the source randomly chooses any node except itself as the destination. Figure 6 plots message latency for adaptive routing and 4-route oblivious routing for short (100-byte and 500-byte) messages. Latency before saturation is lower and saturation load is higher for adaptive routing, although neither criteria is significantly better than that of oblivious routing.
To see how the effect of adaptive routing for random traffic changes with system size, Figure 7 shows the results of the same short message experiment conducted on a 128-way BMIN, a n example of which can be found in [l] . For this larger topology, the positive effects of adaptive routing on random routing are more pronounced. There are more stages in which. adaptive routing avoids contention compared with oblivious routing. Figure 8 shows the results of the same 128-way experiment conducted with longer (2000-byte and 8000-routing saturates the network at a 25% highe.c input load than 4-route oblivious routing. As messages be-
GENERATE-ltOUTES( GT, 9)
G,, ~-BF91(GT, 2 for each processor d # s do 3
GR t BFSZ(G,,, d); 4 Gs +-AIJL_FEASIBLE-ROUTES(GR);
6 return the routing table RT Figure 9 : Generating routes from a processor to other byte) messages. For the longer messages, adaptive 5 RT,d t IklAXADAPTIVEROUTE(Gs); --come longer, the effect of hot-spots becomes greater, and adaptive routing tends to shift traffic awa.y from heavily loaded parts of the BMIN network. To summarize: for BMIN's, adaptive routing is generally superior to oblivious routing for both permutation and random routing. The advantages accrue for two reasons: (1) Adaptive routing does not contribute to contention on the path "away" from the processors, because for this portion of the path each packet always finds an output port link available. (2) Even in the absence of contention, adaptive routing randomizes traffic by choosing among several available output ports going "away" from the nodes.
Routing Algorithm
In this section, we describe an algorithm that, generates the adaptive routing headers of the message packets. The algorithm maximizes the adaptivity, (Npath), of the header. The problem of maximizing the adaptivity may be complicated by irregularities in the network topology, such as faulty links and switches. Here, we present an approach that is applicable to any multistage interconnection network, including networks with faults and partitioned networks.
We represent the topology of the network by a di- We will work out an example on a 32 node network shown in Fig. 14 
. The topology graph GT = (~VT, E T )
contains 48 vertices which represent the switching nodes and processor nodes. The processors are indexed from 0 to 31 and switches are indexed from 32 to 47.
In the examples to follow, the message source will be processor 4 and its destination will be processor 30. 
Routability Graph
A routability graph GR = (VR, ER) enumerates all possible shoriest paths from a source to a destination processor. A routability graph contains only switching nodes and it is a subgraph of the topology graph with all switching nodes and edges that are not in the shortest paths from the source to destination node eliminated. For example, Fig. 15 shows the routability graph for the source-destination pair (4,30) of the network given in Fig. 14 in Fig. 10 -constructs the predecessors subgraph GT, = (V,$ , E T S ) which is different from the breadth-first tree generated during conventional BFS [18] . In G T s , V,, contains all processor nodes of GT, and those switching nodes of GT which are in the shortest route from the source processor s to at least one destination processor other than s. Similarly, ET* contains those edges (links) of GT in reverse direction which are in the shortest route from the source processor s to at least one destination processor other than s. As seen in Fig. 10 , each node v E VTS contains multiple parents stored in its 7r, [v] field which also denotes the adjacency list of vertex v in GT,. Hence, edge list ET, of GT, is constructed on GT in adjacency list format by the T fields of the vertices in V,,.
In the second step, the routability graph for a processor pair ( s , d ) can easily be constructed by running another BFS-like algorithm--BFS2(GT,, d ) in Fig. 11-on G, , starting from destination processor d. In Fig. 11 , each non-black (white and gray) vertex v E V,, encountered while scanning the adjacency list of a vertex U of depth j from the destination switch constitutes an edge from vertex v to U at stages i and i + 1 of GR, respectively, where i = n -j -1.
Solution Graph
A solution graph Gs = (VS, E s ) enumerates every feasible adaptive route solution (route-word encoding) The solution graph Gs generated at the end of the second outer for-loop (lines 7-21) may cuntain vertices and edges which are not involved in any feasible solution path from the source to the destination because of the vertices at later stages which do not have any outgoing edges. These infeasible vertices and edges are removed in the last outer for-loop (lines 22-27) in order to reduce the computational complexity of the dy- 
Maximizing Adaptivity
Once the solution graph is created, the maximally adaptive route may be found by finding a path from source to destination node in the solution graph that maximizes the product of the adaptivity values of edges.
The adaptivity of a n edge e E Es is defined as the number of 1-bits (i.e., l&[e]l) in its edge label & [ e ] , representing the number of common output port choices of the switches in Sa that can be used to lead the messages a t those switches to the destination. The adaptivity of a path from source t o destination is the multiplication of the adaptivity values of edges on the path. Hence, the problem reduces t o finding an optimal path from w: to v;" in Gs with maximumadaptivity. As an example, in Fig. 16 , the top most path has a product cost (adaptivity) of 1 x 4 x 1 = 4 (i.e., 1OOOIOOOO/ x ~1 1 1 l O O O O~ x ~10000000~), which indicates that the given sequence of routing words result in 4 different routes between source and destination processors. Likewise, the bottom most path has a product cost of 4 x 4 x 1 = 16 (i.e., ~11110000/ x j11110000/ x ~10000000~), which shows that the given sequence of routing words result in 16 different routes between source and destination processors. Note that the bottom most path happens to be the solution with the maximum adaptivity; there are no more than 16 distinct shortest paths from processor 4 to 30, as can be verified from Figs. 14 and 15. Therefore, the route header encoding with the maximum adaptivity is RI = 11110000, Ra = 11110000, R3 = 10000000, and Rq = 01000000 in this example. Since the adaptivity of the optimal path from destination switch w;" to the destination processor is 1, the adaptivity of optimal routes from all vertices of Gs can easily be computed by performing a backward pass over the vertex stages of Gs as shown in Fig. 13 . ADP[v:] contains the adaptivity value of the optimal routing solution(s) when the first for-loop (lines 2-9) terminates. In this for-loop, nezt attribute for each vertex is computed to enable the construction of a n optimal routing in the second outer for-loop (lines 11-14). This for-loop constructs an optimal routing by simply following the next fields of the vertices in forward direction starting from the source switch a t stage 1.
Conclusion
In this paper, we presented the first attempt to combine the source routing and adaptive routing methods, referred to as the adaptive source routing (ASR) method. We showed that the route and the adaptivity of message packets are determined at the source processor node, and that packets can be routed in a fully adaptive, or partially adaptive, or oblivious manner in the same network, at the same time. We described how the ASR method may support multiple types of network traffic, in-order delivery of multiple packets to avoid over-taking, and network partitioning. The source routing nature of the ASR method e1i:minates the need for routing tables on the switch chips which may limit scalability and occupy valuable real-estate on silicon. We presented performance comparison of adaptive versus oblivious routing networks. We found adaptive routing t o be generally superior to oblivious routing for both permutation and randomi traffic.
We presented an algorithm that generates maximally adaptive routing headers for the message packets. The algorithm is applicable t o multistage networks in general, including faulty networks and irregular topologies.
