Abstract − This paper develops detailed analytical performance models for k-ary n-cube networks with single-flit or infinite buffers, wormhole routing, and the non-adaptive deadlock-free routing scheme proposed by Dally and Seitz. In contrast to previous performance studies of such networks, the system is modeled as a closed queueing network that (1) includes the effects of blocking and pipelining of messages in the network, (2) allows for arbitrary source-destination probability distributions, and (3) explicitly models the virtual channels used in the deadlock-free routing algorithm.
Introduction
Multiprocessor mesh interconnection networks are 2-dimensional networks, with the processors arranged at the nodes of a grid, and point-to-point links connecting each node to its neighbors. Mesh interconnection networks are a special case of k-ary n-cube networks in which the number of dimensions, n, is two. Recent studies of k-ary n-cubes with wormhole routing (a low-latency pipelined routing scheme [9] ) have shown that under reasonable assumptions, the optimal value for n is two or three [2, 8, 10] . Many existing and emerging multiprocessor systems use such low-dimensional direct networks to interconnect the processors, including the Intel Paragon, Cray T3D, Stanford Dash [14] , M.I.T. Alewife [1] , M.I.T. J-Machine [16] and CMU-Intel iWarp [5] .
In this paper, we develop performance models to study k-ary n-cube networks with wormhole routing, with either single-flit or infinite network buffers. Our model for the single-flit buffer case includes the deadlock free routing algorithm of Dally and Seitz [9] . In contrast to previous analyses of these networks [2, 10, 11] , the models we derive are closed queueing network models. Also in contrast to previous work, (i) we include the effects of ----blocking and pipelining of messages in the network, (ii) we allow for arbitrary source-destination probability distributions, and (iii) we explicitly model the virtual channels used in the deadlock avoidance algorithm. In the single-flit buffer model, the representation of message pipelining and blocking and the asymmetric virtual channel loadings of the deadlock avoidance algorithm require an approximate Mean Value Analysis solution that is rather complex. These features, however, have a significant impact on system performance and are thus important to model. The model provides a further example that approximate Mean Value Analysis can be used for accurate performance prediction of highly complex systems with non-product-form queueing behavior.
We use the models to examine several performance issues for two-dimensional networks. We study network performance and scalability with processors that must block after each request, as well as with processors that can make multiple requests before blocking for responses. We compare the performance of three mesh network topologies: the unidirectional and bidirectional tori (meshes with end-around links connecting corresponding nodes on opposite edges) and the bidirectional mesh without end-around links. We first study the above issues under a uniform traffic pattern. We then examine the impact of communication locality on network performance and scalability, and discuss how the other conclusions obtained under uniform communication change in the presence of varying degrees of locality. We also study network performance when a communication hot-spot occurs, including the effect of a hot-spot on other traffic in the network. Finally, we analyze and explain a potentially important performance implication of the deadlock avoidance algorithm. Specifically, this algorithm produces asymmetric loads on the virtual channels sharing each physical network link. The performance analysis shows that this asymmetry can lead to a perceptible difference between the efficiencies of processors at different locations in the mesh.
The remainder of this paper is organized as follows. Section 2 describes the mesh network and key performance issues in more detail, and states the assumptions about the system workload. Section 3 presents an overview of the models, and gives the details for the new techniques developed. The models also use several previously developed Mean Value Analysis approximations; the complete set of equations is given in Appendix B. Section 4 first presents the results of the model validations we performed using simulation, and then presents the performance analysis using the analytical models. Section 5 contains the conclusions of our study.
System Description
We describe the system and workload assumptions made in this study in Sections 2.1 through 2.5. In Section 2.6, we discuss several performance issues related to mesh networks that will be studied using the model. 
Mesh Network Topologies
The basic topology of multiprocessor mesh interconnection networks is illustrated in Figure 2. 1. There are a number of variations on this basic topology. The connection between each pair of adjacent nodes may be unidirectional or bidirectional, with the latter usually being implemented as two unidirectional links. With the unidirectional topology, end-around connections that connect a node at one edge to the corresponding node at the opposite edge (as shown in Figure 2 .1) are necessary. A mesh with end-around connections is often called a torus. Endaround connections may also be included in the bidirectional case to reduce the average number of hops a message must travel in the network. A torus can be organized so that all links are of equal length, with each link being about twice as long as in the case without end-around connections [8] .
Organization of a Node
A node in the system typically consists of one or more processors, some associated local memory, and a hardware switch that controls the routing of messages through the node (Figure 2 .2). When the node needs to send a message to another node, it queues the message in a local buffer (not shown in the figure). The message waits until the node-to-switch link (connecting the processor and memory to the local switch) becomes free, and then until space becomes available in the outgoing virtual channel buffer. (The message must compete for the channel buffer with messages from neighboring nodes that request the same buffer; we assume that the switch chooses among competing requests in first-come-first-serve order.) Thereafter, the message is forwarded down the link into the channel buffer in the pipelined manner of wormhole routing, which we describe next. We assume that the ----processor is not involved in transferring the message from the local buffer into the outgoing channel in the local switch.
Wormhole Routing
In wormhole routing a switch begins forwarding a message as soon as the header is received and the required channel buffer in the next switch can accept one or more flits of this message. Thus, the flits of a message are transmitted from one switch to the next in a pipelined fashion and may occupy several channels along the path from source to destination. Only the header flit of a message contains routing information. If the header flit of a message is blocked because the required buffer in the next switch along its path is full all the flits in the message are blocked and, therefore, so are the channels they occupy. If more than one flit can be buffered at a node, flits behind the header can "catch up" until the available buffer space is filled. At this point they block and can continue only after the header is unblocked. We assume this method of routing throughout the paper.
Deadlock Avoidance for Finite Buffers
In the ideal case, when buffer capacity is unlimited, deadlock cannot occur in the network and the wormhole routing scheme is equivalent to an optimized form 1 of the virtual cut-through routing algorithm defined for data communication networks [12] . In practice, buffer capacity in a node is limited and deadlock can occur in the networks with end-around connections because all the buffers in a cycle could be filled, with no message able to make progress along that cycle.
Dally and Seitz have proposed a deterministic routing scheme that uses the concept of virtual channels to break cycles and prevent deadlock in the networks with end-around connections [9] . In this scheme, each physical link is shared by two virtual channels that are fed by separate buffers. As long as both virtual channels have messages to send, they alternate their flits on the physical link. If one of the buffers is empty or blocked, the other channel can transmit continuously, using the entire link bandwidth. When a message is blocked, all the virtual channels occupied by the flits of that message are also blocked, and no other messages can use those channels.
The algorithm is illustrated in Figure 2 .3, and operates as follows. Each node in the k-ary n-cube is assigned an n-digit, base-k number, which specifies the position of the node in the cube. Dimensions are numbered and messages are always routed in decreasing order of dimension. (For example, in Figure 2 .3, d=1 for the columns, d=0 for the rows, and routing is column first.) In each dimension d, d=n −1, . . . ,0, a message is routed in that 1. The virtual cut-through algorithm specifies that, if the header flit is blocked at a switch, the entire message has to be received before the message is forwarded. Instead, wormhole routing allows a partially received message to be forwarded as soon as its outgoing channel becomes available.
---- virtual channels along a column Unidirectional 4×4 torus Unidirectional 8×8 torus Channels on path from S=12 to D=20 Uniform traffic distribution.
dimension until it reaches a node whose dth digit agrees with the dth digit of the destination node. The message is routed along the "high" virtual channel if the dth digit of the destination address is greater than the dth digit of the present node's address. Otherwise, it is routed along the "low" virtual channel. For example, in a unidirectional mesh network (as in Figure 2. 3), a message to a node with a higher row number is routed on the high virtual channel along the column until it crosses the link out of node on row 0 (shown at the edge of the network in the figure), and thereafter uses the low virtual channel on the column until it reaches the destination row.
The algorithm imposes a total order on the virtual channels that are used in each direction along any dimension of the network. Furthermore, the requirement that messages are routed in decreasing order of dimension implies that no cycles exist across dimensions. The algorithm is thus deadlock-free because it imposes a partial order on all virtual channels in the network.
The above deadlock-free routing scheme generates asymmetric loads on the virtual channels in the network even when all processors have a uniform message destination probability distribution (i.e., even when the loads on the physical links are balanced). Figure 2 .4 shows the fraction of total link traffic that uses each virtual channel for the links on a single column in a unidirectional 8×8 torus, assuming uniform message distribution. Note that all traffic on the link leaving the processor on row 0 uses the high virtual channel, and thus the buffer space of the low virtual channel is completely unused. In general, on a physical link near the "edge" row or column (after ----which traffic crosses over from the high to low channel or vice versa), the traffic tends to be concentrated on one of the two virtual channels. For links far away from the terminal row or column, the traffic is more evenly balanced on the two virtual channels. In parallel work, Bolding has recently observed the same phenomenon [4] and gives similar data as in Figure 2 .4, showing the buffer utilizations on the high and low channels for bidirectional and unidirectional topologies.
Workload Assumptions
We assume that all the processors in the system execute subtasks of large MIMD parallel programs. 2 Most previous studies of mesh networks, the hypercube and other k-ary n-cubes, have assumed a message-passing workload [8, 10, 11] , but a number of recent shared-memory systems have also been based on mesh networks (Alewife [1] , Dash [14] , Cray T3D). Our model of the network is applicable to both types of workloads.
We allow a processor to have a maximum of N out requests outstanding before it is required to block for a reply. (N out is a parameter of the model.) For the model and results described in this paper, we assume that the rate of generation of requests is independent of how many requests are outstanding, until the maximum of N out is reached. This assumption can accurately capture the behavior of systems in which each processor can switch between multiple contexts [1, 21] , and should also be a reasonable model for many message-passing workloads.
The assumption may be somewhat more approximate for processors that permit non-blocking memory operations (e.g, as with buffered writes, non-blocking caches or prefetching), where the intervals between successive requests may depend in complex ways on the number of outstanding requests. Finally, as explained in Section 3, a simple modification would allow the model to capture the behavior of hierarchical multiprocessors [14] containing multiple processors per node (by allowing the rate at which a node generates requests be proportional to the number of additional requests it can make before blocking).
Our model does not restrict the communication patterns in the system. Each processor sends a message to each other processor with a specified probability, and these arbitrary probability distributions are inputs to the model. This permits us to study the effect of non-uniform traffic patterns on system efficiency.
The workload and system parameters used in the study are defined in Section 3.
2. We do not explicitly model synchronization events. Instead, we assume that these are reflected in the rate at which processors generate messages.
cases that bound the performance of any particular finite buffer size, and show how much can be gained by increasing switch buffer sizes.
Multiple outstanding requests:
Allowing a node to have more than one request outstanding has the potential to at least partially hide the latency of remote communication, but there is also potential for higher congestion in the network. We investigate how much improvement in absolute system efficiency is possible with multiple outstanding requests (due to overlapping communication with execution) and whether at some point network congestion cancels this gain. We also investigate how system scalability is affected by allowing multiple outstanding requests per processor.
Mesh Topology: There are performance and cost tradeoffs between the three network topologies mentioned in Section 1. In a k-ary n-cube network, the mean number of links that a message must traverse assuming all other nodes are equally likely to be the destination is approximately The extra links in the bidirectional networks imply a higher cost, however, and this must be accounted for to allow a fair comparison between the various topologies. When comparing these networks, we assume that the number of input and output wires per switch is fixed, which implies that the channels of the unidirectional network can be twice as wide as those in either of the bidirectional networks, offsetting the larger mean number of hops required. Furthermore, the bidirectional mesh without end-around connections, unlike the torus networks, does not require the deadlock-prevention algorithm of Dally and Seitz since the fixed-dimension-first routing is sufficient to prevent cyclical dependencies between links in the network. To allow a fair comparison, we compare the torus networks with single-flit buffers per virtual channel to non-torus mesh networks with two-flit buffers per physical link. This ensures that the buffer capacity per switch is equal for all three topologies. [17] . We use our model to study the effect of communication hot-spots in mesh networks, with processors that block after a limited number of outstanding requests. We study the degradation in overall system performance due to a hot-spot, as well as the effect of a hot-spot on the latency of other traffic in the network.
Performance imbalance caused by the deadlock-avoidance algorithm:
In Section 2.4 we pointed out that the deadlock-free routing algorithm of Dally and Seitz generates asymmetric loads on the virtual channels in the network. The asymmetry does not necessarily imply that the processors near the edge are more adversely affected than the processors near the center of the mesh. The actual effect is complicated and requires careful reasoning about the pipelining effects of wormhole routing. A more detailed explanation of the asymmetry, and a quantitative analysis of its potential impact on performance, are given in Section 4.7. Time to respectively read and write one word from a memory module
The Model
In order to study the design trade-offs outlined at the end of the previous section, we have created closed queueing network models of the k-ary n-cube network for each of two buffer sizes: finite buffers of size one flit, ----and infinite buffers. The parameters of the models are defined in In our models, each processor forms a class of customers with its own destination probability distribution and with population equal to N out . In other words, each possible message a processor can have outstanding is modeled as a separate customer in the system. When there are n < N out requests outstanding, the remaining N out −n customers are served in FCFS order at the processor, 3 as in [22] . Thus, each customer in the system repeatedly performs the following actions:
− execute for an amount of time measured in switch cycles that is geometrically distributed with mean τ, − visit a remote node and return to the processor (representing a remote memory access, or sending a message and receiving an acknowledgement), − queue at the processor to resume execution.
We develop the equations assuming that a remote processor is not interrupted when it receives an incoming message, which would be true for a shared memory system. In this case, an incoming message only requires a memory access at the remote node. (The equations can easily be modified to reflect message processing by the node processor or message-handling co-processors.)
We choose to develop approximate Mean Value Analysis models because of the previous success of this technique for analyzing other interconnection networks with features that violate separable model assumptions [20, 22] . Approximate Mean Value Analysis is based on estimating the mean round-trip time, or cycle time, for each class of customers in the queueing network, relative to some reference point. The processor serves as the reference point for the residence time equations in our model. The mean round-trip time for a customer of class i is the sum of its mean residence times (queueing and service) in the local processor, in the network, and at the remote node: 3 . FCFS service at the processor is appropriate for systems such as those that maintain multiple contexts at each processor since only one context executes at any time. An infinite server would be more accurate for hierarchical systems. The equations for queueing at the processor can easily be modified for this case.
----
Each processor in the system has a distinct mean round-trip time because of non-uniform virtual channel loads as well as possibly non-uniform communication patterns.
The mean round-trip time in the network is the weighted sum of the mean times for the message and the response, for each type of request:
where r j,sd is the mean time for a message of type j from node s to node d.
To calculate r j,sd , we need to model the routing, pipelining, and blocking of messages in the network. These features require an approximate model solution. Our model for systems with infinite channel buffers is similar to the model developed for Banyan networks in [22] . Their equations assume that processor cycles are required to transfer a message into the network; we do not make this assumption. The only other difference is that we use a somewhat more accurate technique to estimate residence times at the processor for N out >1. This technique is also employed in our finite buffer model, and is discussed in Section 3.3 below. Otherwise, we do not give the model equations for the infinite buffer case. 
Overview of the Model with Single-Flit Buffers
Since messages in the mesh network can occupy several channels simultaneously, the mean message residence time, r j,sd , is the sum of the following three terms:
(i) the mean waiting time for the link from the node to its switch, (w node,sd | j ),
(ii) the mean residence time for the header flit on each virtual channel c between s and d (r j,c,sd [1] ), and (iii) the mean delay until the remaining flits of the message reach d (T catchup ):
where the summation is over virtual channels, c, on the path from s to d, including the channels out of the ----processor at s and into the processor at d. 4 Note that the above equation is similar in form to the equation used in [2] and [22] . One difference between our equation and the corresponding one in [2] is that we include the waiting time for the link that connects the processor to the switch, not just for the first switch buffer. A difference between our model and both [2, 22] is that T catchup is not deterministic, since at each link the flits may or may not have to alternate with flits on the link's other virtual channel. To compute T catchup , we assume that the probability a flit must share a link is approximately equal to the utilization of the link by messages on the other virtual channel mapped to the link. Appendix B contains the details.
Further development of the model equations requires new techniques for estimating r j,c,sd [1] , r proc , and the waiting time for the first network virtual channel when there are multiple channels from processor to switch.
These approximations are motivated and outlined in Sections 3.2 -3.4 respectively. Section 3.5 concludes with a discussion of the model complexity.
Mean Channel Residence Time (r j,c,sd [k])
Let r j,c,sd [k ] denote the average residence time of the k th flit of a type j message from s to d on channel c.
The mean residence time for a header flit (k=1) on channel c is itself the sum of three terms:
(i) the average waiting time for the next channel on the path from s to d (we denote this channel by (c +1) sd ),
(ii) the average waiting time for a flit on the virtual channel that is multiplexed onto the same physical link as c; we denote this channel by c , and approximate this term by u link,c , the mean utilization of the link by messages on c (i.e., the fraction of time the link is actually transmitting flits from c ), and (iii) the one cycle for transferring the flit to the next queue:
where a message from s to d enters (c +1) sd via input port I. The possible input ports to a virtual channel are the virtual channels from neighboring switches or the channel from the processor at the current node. The waiting time for (c +1) sd is a function of the input port because the traffic on (c +1) sd coming from the various input ports is asymmetric, in general. 
The key question for the model is how to calculate w c | I , the waiting time for virtual channel c experienced by a header flit of a message that enters c via input port I. This waiting time is the sum of three terms: 5 (i) the mean residual residence time of a message in service at c that arrived via some other input port i≠I, if any,
(ii) the mean residual residence time for the last flit of a message in service at c from port I, if any, and (iii) the mean time to serve messages waiting to use c from other input ports (at most one per port):
The calculation of each of these terms is explained below.
(i) The total residence time of a message at c is random with an unknown distribution. Rather than assume knowledge of this distribution to calculate the mean residual life of a message in service, we assume that the residence time of each flit of a message is deterministic, i.e. its mean residual life is r j,c,i [k ] / 2. We expect that this assumption will be good for low to moderate network traffic, and will introduce only small error at higher loads since a flit residence time is small compared with total message residence time. Thus, the mean residual residence time of an entire message (seen by an arrival on input port I) can be calculated by conditioning on the event that the arrival finds the k th flit of a type j message that arrived from input port i in service at channel c. The (iii) The third term is the average waiting time for messages that are waiting on input ports other than i when a message arrives at input port I (we assume that these will be transmitted by channel c before the arriving message). For each input port i≠I, the probability that a type j message is waiting to use c is approximated by the utilization of channel (c −1) i by header flits of type j messages that will next use c : u j, (c −1) i [1] . Multiplying by the total residence time of such a message and summing over i≠I and all message types j gives the third term.
The remaining unknowns in the above equations (u j,c,i [k ] , u link,c ) are calculated using previously developed MVA techniques [20, 22] , as described in Appendix B.
Processor Residence Times (r proc [i])
The processor is modeled as an FCFS queueing center, where the service time is geometrically distributed with a mean value of τ cycles. In early model validation experiments, we found that the widely-used Schweitzer approximation for product-form networks [18] is not sufficiently accurate for the processor queues since the customer population for each processor, N out , can be small. Furthermore, previously developed approximations such as Linearizer [6] , which achieve greater accuracy by solving the equations at a few neighboring populations, introduce too much additional complexity into the model. Below we develop a new approximation for r proc [i ] that is empirically accurate and yet requires very little additional computation when N out is not large. The key idea is that we solve for r proc [i ] recursively (i.e., similar to exact Mean Value Analysis) without recursively solving for the residence times at other queues in the system for each customer population. Empirically, we found the new approximation to be considerably more accurate than the Schweitzer approximation. Furthermore, we note that the approximation is applicable to any multiclass queueing network where all the demand at some queue comes from a small fraction of the customers in the network. As far as we are aware, this approach has not been previously reported in the literature.
Consider some processor i. Define r proc (i,n) to be the steady-state average residence time at processor i if there are n customers in its class. Thus, r proc [i ] = r proc (i,N out ). Similarly, let q proc (i,n) and u proc (i,n) denote the mean queue length and the mean processor utilization at processor i with n customers in its class. r proc (i,n) is the sum of the mean service time (τ), the mean waiting time for customers found waiting in the queue, and the mean residual service time (res proc ) for the customer in service, if any. We estimate the mean queue length and processor utilization seen by an arriving customer by q proc (i,n −1) and u proc (i,n −1) respectively (just as in exact MVA), producing a recursion over n = 1, . . . ,N out . The key to the approximation is that q proc (i,n −1) and u proc (i,n −1) are calculated using the same values of r network [i ] and r remote [i ] , for all n = 1, . . . ,N out . (These values are available from the previous iteration in the numerical solution of the overall model.) Thus,
(The equations are numbered to correspond to Appendix B.) The mean residual service time of a customer found in service, res proc , has to be calculated as seen by the tail flit, not the header flit, since the customer is queued up at the processor only when its tail flit arrives. Conditioned on finding a customer in service, the mean residual service time seen by the head is τ, by the memoryless property of the geometric distribution. A returning message is of length L resp 1 with probability P 1 and L resp 2 with probability P 2 . In one cycle, the probability that the customer in service at the processor does not complete service is γ = 1 − (1/τ). Therefore, the average residual service time seen by the tail flit is approximated by:
Equations 14-18 are solved for each processor i separately, in order to calculate all the r proc [i ].
Waiting Time for First Virtual Channel With Multiple Processor-to-Switch Channels
In our mesh network analysis, we found the physical link that connects each processor to its associated switch to be a bottleneck under high loads, in the network with single-flit channel buffers. Therefore, we investigated the use of multiple physical links connecting the processor to the switch, one for each outgoing virtual channel from the switch to neighboring nodes. (In practice, only two to three physical processor-to-switch links should provide about the same performance since almost all the messages out of a processor are concentrated on a few outgoing virtual channels.) With this organization, when a message from the processor finds its link to the switch busy and later reaches the head of the queue for this link (where it is has to wait for the outgoing virtual channel buffer), it cannot find a message from any other input port occupying the channel; it can find only the tail flit from the preceding message. Furthermore, such an arriving message is more likely to find waiting messages at other input ports (which were blocked by the preceding message). These observations lead to a somewhat different expression for the waiting time for the outgoing channel (i.e., the first virtual channel along the path of the message) in the case of multiple processor-to-switch links.
Consider a tagged customer of class q, and let c denote the first virtual channel along its path. Define w′ c | q to be the average time this message has to wait before entering channel c. As in Equation (6), w′ c | q is the sum of ----three terms:
the mean waiting time for messages in service at c that arrived from input ports i≠PROC, if the tagged message found the processor-to-switch link idle (PROC denotes the input port used by messages arriving to c from the processor), the mean waiting time for the tail flit of previous message from PROC, if the tagged message found the processor-to-switch link busy, and the mean waiting time for messages blocked at input ports i≠PROC that are waiting to use channel c:
The form of the first term for w′ c | q is the same as the first term in equation (6), with
is the probability that a message of class q finds channel c busy serving the k th flit of a type j message that arrived via input port i≠PROC, j ∈ {msg 1,msg 2,resp 1,resp 2}. This is estimated as follows:
fraction of time c not serving message coming from node
where b node, j,c | q [k ] denotes the probability that a class q message finds the processor-to-switch link corresponding to outgoing channel c busy serving the k th flit of a type j message.
The second term in w′ c | q is straightforward because the probability that the processor-to-switch link for channel c is found busy with a message of type j is merely
, and the tagged message must see all but one cycle of the residence time of the last flit of the message ahead of it.
For the third term, q′ j,c,i | q is defined as the average number of messages of type j from i≠PROC found waiting to use c by the arriving class q message. The calculation of q′ j,c,i | q requires another observation about the blocking phenomena in the mesh. A message following another message out of the node and into c is more likely than a random arrival at c to find a message on input port i already waiting for channel c. To account for this, we calculate the probability that a random message using c via I =PROC blocks a type j message on incoming virtual channel i. (The latter message will then be waiting to use c when the following message from I =PROC reaches the head of its queue for c.) Similarly, we consider messages from each i′≠PROC (and i′≠i) blocking messages ----from i to c. Therefore, q′ j,c,i | q is as follows:
We illustrate the calculation of one of these terms here. A key observation we make is that Pr{a message from I =PROC blocks a type j message from i} is proportional to the relative number of messages to c that arrive from i and I respectively (counting only type j messages from i). But this relative number is exactly the ratio of the visit ratio of type j messages from i to c to the visit ratio of all message types from I =PROC to c. This ratio of visit ratios, multiplied by the probability that a random message from i is blocked by a message from I =PROC then gives us the probability that a random message from I to c blocks a message on i. Thus, define V j,c,I to be the sum of the visit ratios of customers of all classes as type j messages to channel c via input port i. Then,
where the summations in the parentheses sum to the probability that a random message arriving to c from i has to wait for a message from I =PROC (which may be in service at c, or blocked on the processor-switch link to c).
The above discussion highlights the main points of this new heuristic used to calculate the waiting time for the first virtual channel. The complete equations are given with the rest of the model in Appendix B.
Complexity of the Models
The space complexity, where L max is the length of the longest message type. As a result, the model with L max =11 and N=64 cannot practically be run on systems with fewer than ten megabytes of main memory. Nevertheless, solving the model is still about ten to a hundred times faster than simulating the wormhole routing protocol under a statistical workload. Furthermore, the model allows us to explore various issues and design trade-offs under the realistic assumptions of arbitrary message sizes, and blocking due to finite buffers. For example, the effects of asymmetric channel loads and hot spot traffic are a direct result of limited buffer space for the channels. Finally, the model with infinite buffers is highly efficient and can be used to explore many of the mesh network design trade-offs for larger systems. 
Results
In this section we describe the results of extensive analyses of two-dimensional networks using the above models. We assumed a shared-memory workload for these experiments, as discussed below. We first present the ranges of input parameter values used in our study (Section 4.1), and the results of validation experiments (Section 4.2). In Section 4.3, we evaluate the performance and scalability of a baseline system that we use as a reference point for studying further network design issues. In Section 4.4, we study the impact of allowing multiple outstanding requests. In Section 4.5, we compare the alternative mesh topologies. In Section 4.6, we study the performance impact of near-neighbor workloads, and re-evaluate the design issues studied in Sections 4.3-4.5 under such workloads. In Section 4.7, we study the degradation in system performance due to communication hot-spots.
Finally, in Section 4.8, we analyze the imbalance in processor efficiencies caused by the asymmetric loads on the virtual channels in the deadlock prevention algorithm (see Section 2.4).
Model Input Parameter Values and Performance Measures
The measures of system performance we use are individual and average processor efficiency, defined as the fraction of time a processor spends doing useful work:
Other measures that are obtained from the model equations include steady state mean channel queue lengths and steady state link utilizations. We validated the accuracy of several of these detailed measures as well.
The ranges of values used for the various model input parameters are given in Table 4 showed very little further improvement in processor efficiency (for our parameter settings). Also, since average remote access latencies are typically greater than 20 switch cycles in an 8×8 mesh, it is difficult to envisage programs that make requests faster than about one every 20 cycles executing with any reasonable efficiency on this or larger systems. Thus, we believe the above range of τ should allow us to study the performance of a fairly wide range of programs.
Messages are assumed to consist of a header flit, plus address flits (type msg 1), data flits (type resp 1), address and data flits (type msg 2), or acknowledgement flits (type resp 2). These interpretations of the message contents and the associated message lengths in Table 4 .1 are intended to represent a shared memory workload.
Message-passing programs could be expected to exchange larger messages between processes, though less frequently [8] . The models can be modified to study such workloads; however, that is beyond the scope of this paper.
The message sizes also reflect the assumption that the channels in the unidirectional torus are twice as wide as in the bidirectional networks with an equal number of wires per switch. Finally, we set P 1 = 0.8, P 2 =0.2, and D mem,r = D mem,w = 4 for all experiments. We do not expect moderate changes in these parameters to significantly alter our results.
Validation of the Models Against Simulation
We used event-driven simulators to validate the analytical models, for both the single-flit and infinite buffer cases. The simulators use a statistical workload identical to that of the analytical models, but implement the wormhole routing of the flits and the deadlock-free routing algorithm exactly. We present representative results of the validations of the single-flit buffer model in Tables 4.2 At low to moderate network loads, the average processor efficiency from the analytical model agrees closely with the value obtained by simulation (less than 3% error). Thus our finite buffer model has accuracy similar to the very accurate, less complex models of the infinite buffer case; validations of the infinite buffer model gave results very similar to the model in [22] and are not shown here.
In cases of high network contention (eg., with N out ≥ 4 and τ < 50), the analytical single-flit buffer model tends to be somewhat optimistic. In these cases some links are nearly saturated and the absolute value of the processor efficiency tends to be very low. The maximum error in average processor efficiency across all our validation experiments was 20%, which is shown for N=16, τ=5 and N out = 8 in Table 4 .2(a). In all cases, the predicted efficiencies are qualitatively correct.
We also examined more detailed performance measures, including estimates of the individual, asymmetric processor efficiencies. The maximum, minimum, and ratio of maximum to minimum processor efficiency predicted by the analytical model and simulation for the 8×8 torus are shown in Table 4 .3. Again, agreement is very good, particularly for the max/min efficiency ratio. Note also that the ratio of the two efficiencies is always underestimated by the analytical model. Thus, the imbalance estimates, discussed in section 4.8, are generally
conservative.
---- Bidirectional torus, infinite buffers, uniform traffic, N out = 1.
Baseline System Performance
We choose as our "baseline system" (which we will use as a reference point for studying further network design issues), the bidirectional torus with uniform traffic, and processors that block after each request (i.e., N out = 1). In Figure 4 .1, the solid lines show the average processor efficiency as a function of request rate (1 /τ) for the baseline system with single-flit channel buffers, and system sizes of 16 and 64 processors. The performance of this system is low at moderate or high request rates (1 /τ>0.03). The poor performance in this system is chiefly caused by inherent latency of communication rather than by contention in the network. To show this, we also give the efficiency curves assuming there is no contention in the network (the dashed lines in Figure 4 .1). Comparing the two sets of curves, we see that the absolute loss in efficiency due to contention is about 5-10%. Thus, the system performance is latency-rather than bandwidth-limited when processors block after each request.
Furthermore, because communication latency is the chief cause of low efficiency, increasing buffer sizes per switch yields very little performance improvement for these system sizes. In fact, for these systems, the average processor efficiency with infinite channel buffers (shown in Figure 4 .2 and discussed below) is almost identical to the performance with single-flit channel buffers.
To examine how system performance scales with increasing system size, in Figure 4 .2 we plot the average processor efficiency as a function of mesh radix (√ N ) for different request rates (1 /τ), for the baseline system with infinite channel buffers. The figure shows that the performance of the baseline system scales well (i.e. average ----processor efficiency decreases slowly) with increasing system size, even though the absolute performance is low.
These curves also show that the decrease in efficiency with increasing radix is primarily due to higher latency rather than due to network contention. Specifically, the decrease in efficiency is close to linear, showing that it is primarily due to the increasing number of hops a message must travel, rather than an increase in the delay (due to contention) at each hop. We conjecture that a baseline system with small channel buffers will also show good scalability, based on the low network contention seen in all the cases studied above. (The space complexity Bidirectional torus, uniform traffic, infinite buffers.
Multiple Outstanding Requests
Since baseline system performance is chiefly limited by communication latency rather than contention, a plausible technique for improving processor efficiency is to allow processors to make multiple requests before blocking. The impact of multiple outstanding requests on system performance and scalability are as follows. per processor are sufficient in systems that are being prototyped today [1, 21] .
The figure also shows that larger channel buffers become increasingly important as N out is increased, because of the increasing contention. The performance difference between single-flit and infinite channel buffers is significant even for N out = 2 and becomes quite large for N out ≥ 4.
Because of the increased contention, it is important to re-evaluate the scalability of the network with multiple outstanding requests. In 
Alternate Mesh Network Topologies
We next compare the performance of the different network topologies under uniform communication. In this sub-section, we use the term "mesh" specifically to refer to the network without end-around connections. Our first comparison is between the two bidirectional networks: the torus and the mesh. The average number of hops a message must travel assuming uniform traffic is about 33% larger without the end-around connections. On the other hand, as explained in Section 2.6, the mesh does not require multiple virtual channels per physical link, as required by the deadlock-prevention algorithm for the torus. We therefore use a buffer size of 2 flits per physical link in the mesh network, to ensure a fair comparison with the torus with a single-flit buffer per virtual channel. We next compare the bidirectional torus with the unidirectional torus. For the former, we use message lengths of L msg 1 = 4, L resp 1 = 10, L msg 2 = 12 and L resp 2 = 4, rather than 3, 9, 11 and 3 used in all other experiments. This allows us to halve these message lengths for the unidirectional torus, according to our assumption that its links are twice as wide as those in the bidirectional torus. 7. Section 4.8, however, shows that performance imbalance between different parts of the system is higher with the unidirectional network, which may exacerbate the difference in performance at larger system sizes. 
Nearest-Neighbor Workloads
The previous experiments assumed uniformly distributed inter-node communication, i.e., each node is equally likely to communicate with each other node. In this section, we investigate how near-neighbor communication locality affects system performance and our previous conclusions about system design trade-offs. We consider near-neighbor traffic patterns in which some fraction, F nn , of the traffic generated by each processor is equally divided among its four nearest neighbors, while its remaining traffic is uniformly distributed to all nodes (including the four neighbors). The uniformly distributed traffic represents non-near-neighbor communication required by the near-neighbor application, as well as other activity on the system such as operating system traffic. Locality of communication influences many of the design issues that were examined in previous sections assuming uniform traffic distribution. These must now be re-evaluated.
The set of points for single-flit buffers in Figure 4 .6(b) shows mean processor efficiency for the 8×8 mesh with single-flit buffers and N out = 4. We observe that, as for uniform traffic, the infinite buffer case is significantly better than the single-flit buffer case even up to F nn = 70-80%.
The results for 1/τ = 0.04 in Figures 4.6 (a) and (b) show that the improvement when N out goes from 1 to 4
is much stronger at higher levels of locality, for N > 64. Thus, locality increases the benefit of multiple outstanding requests for large systems. This is true because network contention is reduced for high values of F nn , so that increasing N out does not cause much higher contention, but does improve performance by overlapping communication with computation.
We can also re-evaluate how system performance scales when communication locality is present. Under uniform traffic we concluded that the mesh network scales well when N out is 1, but scales poorly for N out = 4. In Figure 4 .6, we see that increasing locality has a positive effect on the scalability of the network (as expected), but nevertheless, for N out = 4, system performance only scales well for F nn ≥80%. It may be unrealistic to expect such high levels of locality for real workloads.
Finally, the relative performance of the various mesh topologies may differ under near-neighbor workloads.
In particular, we showed in Section 4.5 that the bidirectional torus has a significant performance advantage over the unidirectional torus with multiple outstanding requests. The performance advantage of the bidirectional torus will increase in the presence of locality since, in the unidirectional torus, a round-trip message to a near-neighbor requires √ N hops as compared with 2 hops.
To summarize the results of this section, locality of communication improves system performance particularly in large systems with multiple outstanding requests, and increases the benefit of multiple outstanding requests, but only marginally improves the ability of the mesh to support larger system sizes. In particular, the case of N out = 4 does not scale well for F nn <80%. Other conclusions of the experiments with uniform workloads are also not altered for workloads with communication locality.
Hot-spot effects.
Hot-spots are a form of non-uniform communication that can strongly impact system performance. Hotspots can arise, for example, when a number of processors make a significant fraction of their requests to a single memory module or to a single node in a multiprocessor. The issue has been studied using open queueing models (i.e., assuming non-blocking processors) in the context of multistage interconnection networks [13, 17, 23] .
We examine the effect of hot-spots in mesh networks by assigning some fraction, F hot , of requests from each processor to a particular node in the system, while the remaining fraction 1−F hot is distributed uniformly across all processors. The above hot-spot experiments assumed a single node-to-switch link, as in all previous experiments. This link in the hot node is a bottleneck in the system, and substantial queues build up at the link because we have assumed unlimited buffer space for it. Hence, traffic in the rest of the network sees almost no contention. Using multiple links from node to switch (for example, one node-to-switch link per outgoing virtual channel) alleviates this bottleneck. The average response times for this case are shown in Figure 4 .8. 8 The figure shows that the average round-trip time has reduced considerably for the cases that showed non-negligible increases in response time due to hot-spots in Figure 4 .7. Although Figure 4 .8 assumes one link per outgoing channel, we would expect to see approximately the same performance if the eight processor-switch channels were multiplexed onto 2-3 physical links since a very high fraction of the outgoing traffic at each node uses only two or three of the eight outgoing virtual channels. (7/8 of the total traffic out a node must go out first on the column, and is further restricted to only two or three of these four outgoing virtual channels by the deadlock-avoidance algorithm.)
When the bottleneck on the node-to-switch link in the hot node is alleviated, the switch-to-node link becomes the new bottleneck in the system. Now, channel buffers on paths leading to this bottleneck link can get filled up, affecting messages to non-hot nodes as well. Hot-spot studies in indirect networks have shown that traffic to memory modules other than the hot module is slowed down as much as traffic to the hot module itself [17] . This phenomenon has been called tree saturation. To analyze the corresponding effect in mesh networks, we plot in Figure 4 .9 the average response time for messages to the hot processor (dashed lines) and to all other processors (solid lines), assuming multiple node-to-switch links. 9 For N out = 1, we see that traffic to non-hot processors does not see significant increase in response time even when F hot is as high as 20%. For N out = 4, mean response time to the non-hot processors actually decreases slightly when F hot is increased from 0 − 5%. In this case, contention at the hot node has significantly decreased overall network throughput, offsetting any treesaturation effect.
The above results suggest that the presence of hot-spots in mesh networks does not significantly increase response times for non-hot traffic, in systems of this size. This is different from the conclusions of Pfister and Norton [17] for systems of the same size based on multistage interconnection networks. The principal reason for the 8 . The curves for N out = 4 and F hot = 10% and 20% in Figure 4 .8(b) were plotted using data from simulations because the analytical model did not converge in this case. (The results of the analytical models for lower values of F hot were in good agreement with simulations.) 9. Note that for N out = 4, with uniform traffic (F hot = 0%) the round-trip time to the "hot" processor is actually lower than to other processors. This is a direct result of the asymmetric loads on the virtual channels described in Sections 2.4 and studied in Section 4.8. The hot processor chosen for these experiments (processor [2, 2] ) is located at a point in the mesh where loads on the outgoing virtual channels are balanced. The hot-spot experiments described in this section have focused on 64-processor systems. The degradation due to a hot-spot will become more severe with increasing system size. However, finite-buffer models are necessary for realistic hot-spot studies and we have been unable to use our analytical finite-buffer model to study large systems quantitatively. (Simulating these systems is even more difficult.) Nevertheless, we believe the results of this section provide insights that would be valuable in studying large systems. Qualitatively, we expect multiple node-to-switch links to significantly alleviate the effect of a hot-spot for larger systems as well. We also expect that larger systems with blocking processors (finite N out ) will be able to support much higher levels of hot-spot traffic without introducing tree-saturation than would be predicted using open system models (i.e., assuming N out = ∞).
Performance Imbalance Caused by the Deadlock-Free Routing Algorithm
The analyses of torus performance using the single-flit buffer analytical model show significant differences among processor efficiencies at different locations in the system, and these observations are corroborated by simulation (see Section 4.1). To illustrate the imbalance, Table 4 To quantify the performance imbalance at particular parameter settings, we use the ratio of the maximum processor efficiency to the minimum processor efficiency. Figure 4 .11 plots this ratio as a function of the request rate for the unidirectional and bidirectional 8×8 tori, for N out = 1, 2, and 4. The figure shows that the imbalance becomes significant when network contention is moderately high, but includes cases that represent reasonable operating points (i.e., average efficiencies greater than 50%). For example, the 64-processor system with N out = 4 has average efficiency greater than 50% at most request rates, as shown in balanced virtual channels versus nodes that place somewhat greater load on high-and low-load channels are not likely to account for the fairly large observed imbalance in processor efficiencies.
More careful consideration of message pipelining and blocking behavior reveals a different and potentially substantial impact of the asymmetric loads that can also explain the particular patterns of imbalance observed in Table 4 .4. Specifically, certain nodes' outgoing messages (both requests and responses to other nodes) experience ----relatively severe blocking due to high-load virtual channels, after leaving the node. Such nodes will see significantly higher contention for their node-to-switch link because of the pipelining and blocking of messages.
Since each processor is the heaviest user of its own node-to-switch link, increased contention on this link results in lower efficiency for the processor at such a node. Furthermore, in the unidirectional torus, it is the nodes near the center of the network whose outgoing messages experience most severe blocking due to high-load channels, whereas in the bidirectional torus it is the nodes near the edges. This leads to the different patterns of imbalance in the two cases.
To demonstrate the phenomenon quantitatively, consider the nodes on some fixed column (as in Figure 2 Thus, the ultimate effect of high-load channels is the same in both networks: they cause more severe blocking for outgoing messages of some nodes, which produces much greater contention for the node-to-switch link at these nodes. However, the pattern of use of the high-load channels by outgoing messages is different in the two networks. In the bidirectional case, blocking of outgoing messages is more severe for nodes near the edges because these nodes' outgoing messages make much greater use of high-load virtual channels compared to nodes near the center. For example, a node on row 7 in an 8×8 bidirectional network sends all its outgoing messages on channels that carry 100% of their links' traffic (NL and SL), whereas the outgoing messages for a node on row 3
are mostly concentrated on channels with low to moderate load (NL and SH). In the unidirectional case, however, the nodes near the center, which have poor performance, make only slightly greater use of high-load virtual channels than nodes near the edges. In these networks, it appears more significant that outgoing messages from nodes near the center travel from channels with a low to moderate load into channels with a high load, whereas for nodes near the edges the opposite is true (refer to Figure 2 .4). The pipelined routing causes significantly greater blocking ----for the former than for the latter.
As the preceding discussion indicates, the precise explanation of the relationship between the asymmetric channel loadings and the performance imbalance is fairly subtle and non-intuitive. Formulating and validating the explanation required significant insight as well as analysis detailed metrics obtained from the analytical model.
Furthermore, the imbalance cannot be detected or analyzed using models or simulations that ignore the virtual channel loadings or the finite switch buffers. Finally, note that the waiting times for the node-to-switch links, and hence the imbalance itself, might be reduced by the use of multiple physical node-to-switch links and/or multiple virtual node-to-switch channels.
Conclusions.
We simulation. We are not aware of any previous work that has used similar models of blocking in these networks.
We believe the validation results are important evidence that approximate mean value analysis is a viable technique for modeling complex systems.
We used the models to analyze various issues that arise in the design of 2-dimensional (mesh) networks.
These results (summarized below) should prove useful for engineering high-performance systems based on lowdimensional k-ary n-cube networks.
Some of our results confirm and quantify existing intuition about mesh interconnection networks. With processors that block on every request we have shown that contention in the network is low, and the three network topologies (bidirectional and unidirectional torus and bidirectional mesh) show little difference in performance.
Multiple outstanding requests can help increase performance, but also cause increased contention. Thus, in this case, substantial performance gain is achievable by increasing buffer size.
We also gained new intuition from some of our results, including:
Under uniform workloads absolute performance is higher with multiple outstanding requests, but network performance does not scale well with increasing system size.
Communication locality improves system performance, particularly for multiple outstanding requests, but at least 70-80% of each processor's traffic must be directed to its nearest neighbors before the case of 4 outstanding requests scales well.
With multiple outstanding requests, the bidirectional torus performs significantly better than the other two topologies. Furthermore, it exhibits much lower performance imbalance (see below) due to the deadlock-free routing algorithm than the unidirectional torus.
Open system models can yield extremely pessimistic results in hot-spot studies. When processors block after making a few requests, only high fractions of hot-spot traffic cause significant performance degradation in 64-processor systems with single-flit buffers. Furthermore, traffic to the non-hot processors is not much affected by hot-spot traffic in these systems (i.e., tree-saturation is not observed).
At some plausible operating points (i.e., in cases where average processor efficiency is reasonably high), there is perceptible difference in the efficiencies of processors at different locations in the mesh. This imbalance is due to asymmetric loads on the virtual channels by the deadlock-avoidance algorithm.
A number of related issues for k-ary n-cube networks remain to be studied. The models developed in this paper can be used to study network performance for message-passing and hierarchical systems. The conclusions from the experiments need to be examined for 3-dimensional networks. The result that a communication hot-spot does not significantly slow down other traffic in the system needs to be re-examined for larger systems. A related question that needs to be answered is what buffer sizes are required to approximate infinite buffer performance, under various workload assumptions. However, for interconnection networks with pipelined routing in particular, modeling the performance with larger finite buffers is a difficult problem. (Previous analytical models of interconnection networks that allow finite buffer sizes are based on a decomposition approximation in which each queue is analyzed in isolation, thus ignoring the dependencies between network stages caused by the blocking and pipelining of messages; for example, see [15] and the references therein.) Finally, it would be worthwhile to develop a deadlock-free routing algorithm for the mesh network that does not lead to the imbalance in processor efficiencies that we have observed.
Acknowledgements.
We thank Amarnath Mukherjee and Derek Eager for valuable discussions during the development of the model. We also thank Sarita Adve, Anant Agarwal, Derek Eager, Mark Hill, Susan Owicki, Gurinder Sohi and an anonymous referee for valuable comments on earlier drafts of this paper. the second term in equation 6 is no longer required and the first term does not have a summation over input ports. In both cases the reason is that the buffer capacity for this link is unbounded.
ii.
The third term in equation 6 used the probability that there is a waiting request, for each input port i. Now, however, we require an actual queue length, q node, j,s | q , denoting the average number of requests of type j waiting for the processor link in node s when a request of class q arrives. This can be calculated as follows: 
----Queueing at the processor.
As defined in Appendix A, r proc (i,n) is the average residence time at the processor for a customer of class i, when there are n customers in its class. Hence r proc [i ] =r proc (i,N out ) by definition. r proc (i,n) is calculated by recursion on n.
r proc (i,n) = [q proc (i,n −1) − u proc (i,n −1)] × τ + u proc (i,n −1) × res proc , n > 1,
q proc (i,n) = r proc (i,n) + r network [i ] + r remote [i ] n × r proc (i,n) , 
res proc = (τ−1) × ( P 1 γ L resp 1 −1 + P 2 γ L resp 2 −1 ), γ = 1 − (1/τ).
Residence time at the remote node.
The equations for the residence time at the remote node are developed assuming that the request is serviced at the memory of the remote node without interrupting the remote processor (we denote r remote as r mem here). For example, in a shared-memory system remote memory accesses could be of this type, with msg 1, msg 2, resp 1 and resp 2 corresponding to read, write, data and acknowledgement messages respectively. The time to access one word is D mem, 1 for a read request and D mem, 2 for a write request. We assume that a request is queued for a memory module only when its last flit is received at the node. We also assume that the memory at each node is interleaved and, to simplify the analysis, all accesses read or write the first byte from the first module, the second byte from the second module and so on. This implies that D mem, 1 (D mem, 2 ) cycles after a read (write) request begins service, the next memory request can begin service. Further, for a read request the response (data message)
is queued up to be transmitted as soon as the first word of data is read out from the first memory module, with subsequent words being transmitted one per cycle. An acknowledgement in response to a write request is queued when the last byte has been written, i.e. after L msg 2 cycles. We do not limit the number of requests that can simultaneously be queued up for a given module.
The method for calculating the mean residence time at memory is the same as in [22] . 
Finally, just as in the processor queueing equations, the residual life of a memory request in service has to be calculated as seen by the tail flit rather than as seen by the head. Defining res mem, j′ | j to be the residual service time of a type j′∈{1,2} request as seen by the tail flit of a message of type j∈{msg1,msg2}, we have:
Waiting time for the first virtual channel in the network, with multiple processor-switch channels.
The equations of the model described so far have been the same for single or multiple processor → switch channels. The only exception is (10) for w node,s | q , which now would be denoted by w node,c | q , and must be calculated separately for each outgoing channel c from node s. Similarly, q node, j,c | q , b node, j,c | q [k ] , and r node, j,c [k ] have to be calculated separately for each c; however, in all cases the equations remain essentially the same.
The waiting time for the buffer in the first switch is the only part that needs to be calculated somewhat differently, as described in Section 3.5. The equations are as follows. The first term in Equation (24) has been explained in Section 3.5, and the remaining two terms are similar to the second and third terms in Equation (6) . Then, as explained in Section 3. 
The first line of (26) corresponds to waiting for messages that were blocked on input port i≠PROC by the preceding message on the processor-to-switch link, when the processor-to-switch link is found busy. When it is found idle but channel c is busy serving a message that arrived from input port i′≠i,i′≠PROC, the tagged message also has to wait for messages on input port i that were blocked by the message occupying c. This is the second line of (26).
This completes the equations for the case with multiple processor switch channels, and the description of the model.
