Adaptive routing is widely regarded as a promising approach to improving interconnection network pelformance. 
Introduction
Wormhole routing [4] has become the switching technique of choice in distributed-memory multiprocessors. Implementations of wormhole routing divide each message into flits. The header flit of a message contains the routing information and the data flits of the message follow the *This work was done while this author was with the Department of Computer and Information Science at Ohio State University in Columbus, Ohio. D. N. Jayasimha* Intel Corporation, MS FW2-02 2200 Mission College Blvd.
Santa Clara, CA 95052 djayasim@mipos2.intel.com header flit through the network. When the header arrives at an intermediate router, the router immediately forwards the message header to a neighboring router if an output channel the message can use is available.
Since the flits of a message are forwarded as soon as possible, the message latency is largely insensitive to the distance between the source and destination. In addition, wormhole routing requires only enough storage on a router to buffer a few flits, rather than the entire message. These two properties account for the popularity of wormhole routing in distributed-memory multiprocessors.
Although wormhole routing can reduce the communication overhead in large-scale multiprocessors, wormhole routing does not provide a complete solution to the problem of minimizing communication overhead in parallel programs [2] . The primary drawback to wormhole routing is the contention that can occur with even moderate traffic, which leads to higher message latency and increased communication overhead. Channel contention can be reduced through software techniques such as mapping, hardware techniques such as adaptive routing, or a combination of both techniques.
Hardware Techniques
A cost-effective method of reducing message latency, proposed by Dally [3] , is to allow multiple virtual channels to share the same physical channel. Each virtual channel has a separate buffer, with multiple messages multiplexed over the same physical channel. Both latency and contention can be further reduced by using the multiple paths between the source and destination nodes. Many adaptive routing algorithms have been proposed to exploit this possibility.
The simplest routing algorithms are oblivious and define a single path between the source and destination. Adaptive routing algorithms, on the other hand, support multiple paths between the source and destination. Minimal routing algorithms allow only shortest paths to be chosen. Adaptive routing algorithms can be further differentiated by the fraction of shortest paths they allow. Fully adaptive routing algorithms allow all messages to use any shortest path. Some fully adaptive routing algorithms allow more adaptiveness than others by placing fewer restrictions on the choice of channels.
Many researchers have proposed adaptive routing algorithms to address the problem of contention. The hypothesis is that adaptive routing improves network throughput and performance by allowing messages to route around congested channels. This claim has been substantiated by comparing the average message latency among routing algorithms. With few exceptions, the trend has been that increasing adaptiveness results in lower average message latency, even when the additional complexity of adaptive routing [I] is included in the comparison [9] . These comparisons are based on traffic patterns such as transpose and uniform traffic, which may not adequately reflect the message traffic seen in typical parallel programs.
Software Techniques
Average message latency does not correlate well with the execution time of a parallel program. The reason there is not much of a relationship between average message latency and execution time becomes clear once one considers how the communication among processors affects the execution time.
Before executing a parallel program, the program is first decomposed into tasks. The tasks each require some computation time. Tasks may also require messages from other tasks prior to execution and may transmit results to other tasks after execution. The mapping problem addresses the problem of assigning these tasks to the processors so that the parallel program executes as quickly as possible.
Early work on the mapping problem focused on reducing the total communication cost. (Please see [ 121 for references on mapping.) Reducing the communication cost, however, does not necessarily reduce the total execution time, because the impact on total execution time depends on which messages are delayed. This is due to what is commonly referred to as the critical path. The critical path is the longest path, counting both computation and communication costs, from a source node of the task graph to the last node of the task graph. Delaying a message on the critical path, a critical edge, increases the total execution time of the task graph. Other messages can become critical edges if they are delayed too long.
The average message latency has been used to measure the performance of a routing algorithm, but the performance of a parallel program is determined by the total execution time. The usual unstated assumption is that a lower average message latency results in a lower total execution time, but that is not necessarily the case because of critical edges. The critical edges are typically a small subset of the edges in a parallel program. Thus, the average message latency has little relationship to the communica- 
Program Characteristics
In order to determine whether average message latency provides a good indication (of execution time or the critical edges play a major role in determining execution time, we simulated some parallel programs. Each parallel program used in the simulation is represented using the directed acyclic graph (DAG) model. The DAGs used in this paper are derived from the Cholesky factorization of three different irregular sparse matrices taken from real applications [5] . To provide differing computationto-communication ratios, CPU clock cycle times of 20 nanosecs and 5 nanosecs are used. DAGs 1, 3, and 5 execute on the slower processors; the corresponding DAGs on the faster processors are 2, 4, and 6. The characteristics of each DAG are shown in table 1. Each message is prepended by an additional one-byte header during simulation. This header contains the destination address, which is used by all the routing algorithms to determine the path from the source to the destination. To reduce the simulation time while maintaining the original DAG structure, the task weights and message lengths have been proportionally reduced from those in the original DAG [5] . The message start-up times is also correspondingly reduced to ten clock CPU cycles. We assume a communication coprocessor is available to transmit messages with no further interaction with the CPU. If a task sends more than one message, these messages are generated at intervals of ten clock cycles to simulate the start-up time of each message. The communication characteristics of a DAG change when the tasks are assigned to the processors, because the source and destination tasks are sometimes placed on the same processor. The number of messages and their length characteristics are also shown in table 1. The execution time of the tasks affects the task assignment, so a different cycle time changes the number of injected messages. The mapping is performed by a simulator, referred to as the mapper, which accurately computes the execution time of the mapping. The mapper has an interface to a network simulator [SI, The simulation is event-driven, with each message corresponding to a unique event, The simulator allows different message traffic patterns to be specified by modifying a single process, called the user process. For these simulations, the user process must support message traffic that is generated by a parallel program. This requires interaction between the mapper and the simulator as depicted in figure 1 , which shows the high-level operation of the user process. In essence, the user process provides the connection between the processors and the network. When a task completes execution, the messages sent by this task are injected into the network. The user process passes these injected messages to the simulator. The simulator then routes these messages. When a message is delivered, the delivery time is passed to the user process, which in turn passes this information back to the mapper. The mapper uses this information to update the status of the destination task. If all the messages have arrived for this task and the processor is free, then the mapper initiates this task and subsequently injects additional messages into the network.
Mapping Heuristics
Two different mapping heuristics are used. The heuristic proposed by Yang, Bic, and Nicolau [ 151 and referred to as the YBN heuristic, determines the mapping independent of contention in the interconnection network. The mapping is generated independent of the routing algorithm, so a comparison of different routing algorithms with the same mapping is possible.
The other heuristic, which was proposed by Schwiebert and Jayasimha [ 121 and is referred to as our heuristic, produces a mapping by iteratively adjusting the previous mapping. Each mapping is simulated in order to determine the network contention., which is used to generate an alternative mapping. Different mappings are produced for each routing algorithm, which optimizes the mapping for that routing algorithm.
Both mapping heuristics require the input DAG to be a clustered task graph with one cluster per processor. The Dominant Sequence Clustering (DSC) heuristic proposed by Yang and Gerasoulis [17] is used to cluster the task graphs and the cluster merging algorithm [6] implemented in the Pyrros [I61 system is used to produce the correct number of clusters. After the clustering and merging steps, there is one cluster per processor. We calculate the latest starting time of each task and execute the ready task with the minimum latest starting time first. A task becomes ready once it has received all of its messages. The messages generated by a task are prioritized by injecting them into the network in increasing order of the latest starting time of the destination tasks.
The interaction between the simulator and the mapper allows the contention experienced by messages to be relayed to our mapping heuristic. Based on the feedback from each simulation, the mapping is modified by moving clusters with critical edges to processors that are topologically closer. The number of different mappings that are tried is at most the number of processors. Since different routing algorithms experience different contention patterns, our mapping heuristic adjusts automatically to match the particular characteristics of each routing algorithm. The YBN heuristic does not use contention information to select a mapping.
Routing Algorithms
Two fully adaptive routing algorithms and an oblivious routing algorithm have been used in these experiments. Only minimal routing is used. The fully adaptive routing algorithms are opt-y [13] and mad-y [7] . Opt-y is more adaptive than mad-y [13] . Adaptive routers are more complex than oblivious routers [l] . The calculation of network cycle time based on router complexity permits a more realistic comparison of the performance of different routing algorithms. A network cycle time of 5.92 nanosecs is used for the adaptive routing algorithms and 5.41 nanosecs is used for oblivious routing.
Simulation Results
The total execution time (in nanosecs) of each DAG, averaged over three runs, is presented. In addition to measuring the execution time of the DAG, the average message latency is also computed.
2D Mesh Simulations
Simulations were run for both 4 Our mapping heuristic always produces a better mapping than the YBN mapping heuristic, because our heuristic adjusts the mapping to reduce contenition. Our mapping heuristic always chooses different mappings for each of the routing algorithms, so we are able to addlress the contention characteristics of each routing algorithm individually.
A scatter plot of the execution times and average message latencies of 64 different mappings of DAG 6 are shown in figures 2 -4. Similar results were obtained for the other DAGS. The figures show little relationship between the average message latency and the execution time. The correlation coefficients' for opt-y, imad-y, and XY are 0.12, 0.51, and 0.14, respectively, suggesting that average message latency is a poor predictor of total execution time. Table 3 shows the execution times using the same cycle time for all the routing algorithms. The adaptive routing algorithms now often perform better than oblivious routing. When oblivious routing is better, the (difference is much less than the difference shown in table 2.
The average message latency shows a lack of correlation with the execution times. For example, with DAG 4 using the YBN heuristic (the same mapping), the average message latency is 247ns with opt-y, 242ns with mad-y and 25111s with XY. In other words, XY performs slightly better even though the average message latency is higher. A similar result can be seen with our mapping heuristic. For example, with DAG 5 the average message latencies with opt-y, mad-y, and XY are 939ns, 1154ns, and 973ns, respectively, even though the performance is worse with opt-y than XY. On the other hand, sometimes the average message latency is lower along with the execution time.
For example, with DAG 6 using our heuristic, the average message latencies with opt-y, mad-y, and XY are 221ns, 233ns, and 240ns, respectively.
Discussion and Conclusions
We have explained wh!y the average message latency is a poor predictor of perfimnance in a real parallel processor. The simulation results support our arguments. If the multiprocessor is used i n a multiprogramming environment, then the total execution time of individual applications may not be as important as increasing the through- there is good reason to believe that the results will hold with a larger number of DAGs. The reason for the relatively poor performance of adaptive routing was not because the chosen DAGs had communication patterns that were better suited for oblivious routing. In fact, the results show exactly the opposite; adaptive routing usually had lower average message latency. The delay of critical edges was the cause of poor performance and most parallel programs have a relatively small set of critical edges.
Furthermore, DAGs with a different computation-tocommunication ratio are unlikely to produce substantially different results. DAGs with little communication experience less contention, so the routing algorithm makes even less of a difference. The delay of a few critical edges, however, could still result in a mismatch between average message latency and total execution time. DAGs with more communication experience more contention, so the difference in average message latency between adaptive and oblivious routing would increase. The total execution time, however, would still be relatively independent of the average message latency, because the critical edges are a small subset of the messages in most DAGs.
Several topics for future work are currently being explored. One possibility for improving performance is assigning higher priority to critical edges, with lower priorities assigned to the other messages depending on when each needs to arrive at the destination. Routing algorithms designed to use these priorities could improve the total execution time, perhaps at the expense of the average message latency. Even at run-time, it should be possible to partially determine priorities.
The interaction between adaptive routing and mapping affects the performance of the routing algorithms. The relative performance of adaptive routing may improve if more intelligent mapping heuristics are used. For example, one consequence of achieving deadlock freedom with routing restrictions is that adaptive routing often introduces imbalance into the network by favoring certain message routes. The mapping heuristic may realize better performance by exploiting this imbalance.
The results also suggest that an improvement in the hardware design of adaptive routers is needed to compensate for the added complexity of adaptive routing. Efficient designs of adaptive routers will decrease the difference in router cycle times and should lead to improved performance for adaptive routing. Our simulation results for the hypercube, which use the same router cycle time for both adaptive and nonadaptive routing, support this claim [14] .
Finally, additional measures of routing algorithm performance may be required to obtain a clear picture of expected performance under real conditions. For example, if two routing algorithms have similar average message latencies, the one with a lower variance seems likely to give better performance in practice [IO] .
