Many studies of advanced router features consider only their bene ts, not their costs. In this paper, we examine the cost in router complexity of adaptivity and virtual lanes in a class of wormhole routers by examining a set of router designs. Increased router complexity a ects on achievable router latency and bandwidth. Our studies establish the cost of adaptivity and virtual lanes allowing cost to be compared to performance bene t.
Introduction
In concurrent computers, interconnection networks are used by the processing nodes to exchange data and synchronize with each other. Network performance is often critical, as the performance of large-scale parallel machines is sensitive to network latency and throughput. While multicomputers have been touted as scalable parallel architectures, their scalability is limited by the performance of their interconnection networks.
An interconnection network is de ned by its topology, routing, and ow control. The topology is the pattern of network node interconnection via physical communication channels. The routing algorithm speci es how packets choose paths through the network. Flow control deals with the allocation of channel and bu er resources to packets as they proceed through the network. This paper focuses evaluating the cost of a variety of routing features involving routing and ow control.
Deterministic, dimension-order routers are used in a variety of multicomputers because they are exceedingly simple and provide low latency and high bandwidth. However, deterministic routers have a number of signi cant disadvantages: poor performance under non-uniform tra c loads and poor fault tolerance. Adaptive routing is a promising approach to alleviate these problems. But, because adaptive routers can be more complex and this complexity leads to legitimate concerns about their speed and tangible bene ts, adaptive routers have yet to gain acceptance in many commercial machines.
Recently, adaptive routing and virtual lanes have been touted as practical approaches for improving network performance. A number of dramatically simpler adaptive routing algorithms have been proposed 28, 22, 5] . This breakthrough makes adaptive routing feasible, but not without cost. Deciding whether or not to incorporate adaptive routing into a router is still a complex costperformance tradeo with the cost side of the equation still largely unde ned. Virtual lanes have been proposed as a mechanism to improve router performance 13, 7] . However, adding virtual lanes not only increase channel utilization, they also increase router complexity, slowing implementations. While virtual lanes are also attractive, they have yet to attain widespread acceptance in commercial machines.
In this paper, we examine the cost of adaptivity and virtual lanes based on a series of gate level router designs. The studies provides a basis for assessing the cost of adaptivity and virtual lanes, allowing their cost to be assessed and weighed against their bene t. These studies show that both adaptivity and virtual lanes can incur signi cant penalties in router latency and attainable network cycle time. Thus, the bene ts of these router features must be carefully weighed against their costs before deciding to include them. This paper makes two signi cant contributions. First, it gives a detailed description of an adaptive wormhole router, characterizes the functionality and speed of each module, providing a basis for estimating router speeds. Second, it examines the speed of the baseline router and a variety of enhanced routers with increased routing freedom and numbers of virtual lanes. This not only allows the speed of adaptive routers to be compared to existing deterministic router designs, it also provides a basis for assessing the cost of adaptivity and virtual lanes, admitting a cost performance tradeo .
To assess the cost of adaptivity, we examine a series router designs which range from one to eight degrees of routing freedom. Our studies show that the cost of two degrees of routing freedom, planar-adaptive routing, can be modest. However, when higher degrees of adaptivity are used, the cost increases dramatically. The designs show that 50% or greater increases in channel utilization are required to justify each additional degree of routing freedom.
To assess the cost of virtual lanes, we examine router designs with one to sixteen virtual lanes.
There are several proposed router architectures for virtual lanes, so we rst examine each of these and select the most attractive, a fully expanded crossbar. Our design studies show that virtual lanes are expensive, but less so than increased adaptivity. Each additional virtual lane requires an increase in channel utilization of 30% or more to be cost e ective. The large cost of each virtual lane is due to larger crossbars and much larger virtual channel controllers. Given published studies of the bene ts of virtual lanes, a few virtual lanes might give enough throughput increase to justify this cost, large numbers of virtual lanes are likely to be unacceptably expensive.
Overview The remainder of the paper is organized as follows. Section 2 describes the context of this work, direct networks, wormhole routing, and planar-adaptive routing. Section 3.1 presents cost-performance metrics which are used throughout the paper to evaluate router designs. Our baseline router design, a planar-adaptive router is described in Section 3.2. With a baseline clearly established, Sections 5 and 6 considers the cost of adaptive routing and virtual lanes. The overall performance results are summarized in Section 7. Section 8 discusses related work, and in Section 9, we summarize the results presented in the paper.
Background
Communication performance depends critically on a network's topology, ow control, and routing. We focus on k-ary n-cubes, direct networks with radix k and dimension n 12]. By varying choice of k and n, this family of networks represent a wide range of choices in density of interconnection. We also focus only on routers that use wormhole routing, a low cost approach to ow control that allows small simple routers. Wormhole routers for k-ary n-cubes have been used in a variety of commercial and research machines 11, 27, 14, 2, 1, 24, 3]. Communication performance also depends critically on the routing algorithm used to map communications to hardware resources. Routing approaches can be divided into two categories: deterministic and adaptive routing. In deterministic routers, each message is routed along a xed path, determined by the source and destination of the message. Deterministic routing's main advantage, hardware simplicity, is directly tied to its primary disadvantage, a lack of routing exibility which limits network performance and fault tolerance. Any particular xed choice of routes will produce poor performance for some communication patterns.
Adaptive routing can alleviate such problems by mapping communications to paths exibly, based on network loading. The exibility in routing improves performance on non-uniform workloads 22] and can provide a measure of fault tolerance 8]. The major disadvantage of adaptive routing is the greater complexity required to support the additional routing exibility while assuring deadlock-freedom. This increase in hardware complexity can signi cantly reduce router speed, decreasing total network performance. To reduce the cost of adaptive routing, many approaches based on limited adaptivity have been proposed. In particular, we focus on one family of adaptive routing algorithms which de ne a family of deadlock-free algorithms with a range of adaptivity and hardware complexity. 1 This family allows routing freedom to be traded o against router speed while assuring deadlock-freedom. The simplest adaptive router in this family, a planar-adaptive router, is described below. 1 Many di erent approaches could have been chosen. We selected this one due to the apparent simplicity of hardware implementation.
Planar-adaptive Routing
Planar-Adaptive Routing (PAR) is a limited adaptivity routing algorithm. PAR has many implementation advantages, most notably hardware simplicity. PAR uses only three virtual channels for deadlock prevention and small crossbar switches regardless of the number of dimensions 8, 21, 4] .
The idea in planar-adaptive routing is to provide limited adaptivity by routing adaptively in a series of two-dimensional planes. As the packet progresses towards its destination, it passes through a series of adaptive planes and eventually, the packet completes routing in all dimensions and is delivered to the destination. By limiting adaptivity to two dimensions and structuring the passage from one adaptive plane to the next, we reduce network cost while maintaining deadlock-freedom. Within each network, tra c is routed adaptively towards its destination in any of the productive channels. When the d i address is correct, routing is completed in plane A i , so proceed to the next high-level step.
In the high-level routing, the basic idea is to route successively in the adaptive planes. Routing in adaptive plane A i reduces the distance in d i to zero. After routing in all of the adaptive planes, the packet has reached its destination. For d n?1 , there cannot be any adaptivity left for a minimal router, so the packet is routed directly to its destination. In the low-level routing, the scheme is adaptive, as multiple paths can be chosen within each adaptive plane. 2 The order of dimensions is arbitrary. The plane A i is divided into two virtual planes, increasing A i +, and decreasing A i ?.
They are completely decoupled.
F-at Adaptive Routing
The planar-adaptive routing algorithm can be generalized to support higher degrees of adaptivity. The basic idea is to increase the degree of routing freedom at each low-level routing step (adaptive planes to adaptive cubes, etc.), producing the class of f-at routers. F-at adaptive routers are deadlock-free and allow a range of routing freedom choices. An f-at adaptive router allows routing in the f-at subspace of the n-dimensional space, giving f degrees of adaptivity. Thus, a planaradaptive router is a 2-at adaptive router. The composition of adaptive spaces is handled in an analogous fashion to planar-adaptive routing. Any deadlock-free adaptive routing algorithm can be used with the f-at. Increasing the routing freedom by increasing f can improve channel utilization, but each such increase requires additional hardware, incurring increases in router latency and router clock periods.
Methodology
To evaluate the performance impact of adaptive routing on multiprocessor routing networks, we rst de ne performance metrics for the network router. These metrics are a ected by topology, routing algorithm, and implementation technology. We rst focus on the internal router issues where speed is determined largely by routing algorithm and technology, producing a characterization of the internal router delay for a variety of router designs. These estimates quantify the performance impact of including advanced router features such as adaptivity and virtual lanes. Subsequently, we consider the system level issues { clocking scheme, clock synchronization, and channel delay { combining them with the internal router delays to estimate the network router performance metrics de ned below.
Cost and Performance Metrics
Router performance can be characterized by several metrics: channel utilization, router setup latency and achievable clock rate. These metrics are de ned below.
Channel utilization measures the router's ability to make productive use of the physical channel resources. It is characterized by the fraction of channels utilized for a given tra c load.
Router setup latency is the delay from router input to output. Combined with the channel delay, router setup latency determines the network's zero-load delay. For wormhole-routed networks, this is often close to the typical network delay.
Achievable clock rate is the maximum rate for which the router operates correctly. For an asynchronous designs, this is the maximum operation rate. For synchronous designs, it is the maximum clock rate. In both cases, this is the primary determinant of the channel clock rate.
Most previous studies of router enhancements have focused on channel utilization, a measure of performance improvement 28, 22, 15, 5, 6] . One reason for this is that channel utilization can be studied independent of implementation issues. In this paper, we focus on the cost of adaptive routing and how it a ects the router setup latency and achievable clock rate. Increases in setup latency and achievable clock period can also be viewed as the cost of router enhancements. Setup latency depends on intra-router delay (logic delays within the router) and inter-router delay (channel latency determined by topology, packaging, and synchronization time). Achievable clock rate depends on both the clocking scheme, inter and intra-router latency. For now, we focus on two internal router performance measures path setup, the intranode setup latency, and data through, the intranode ow control time, closely tied to achievable clock rate.
The Baseline Router
Our baseline router is a planar-adaptive router (PAR) as described in Section 2.1. In this section, we describe the architecture of the router, carefully describing the functionality of each module. This gives the reader insight as to how and why router changes produce decreases in performance. In following sections, the cost of each additional router feature is calculated by comparing the performance of the enhanced router to this baseline design. A planar-adaptive router consists of a series of composable modules, one for each adaptive plane. The external interface of one such plane consists of four bidirectional links and a pair of ports to allow composition (see Figure 2 ). The bidirectional connections to neighbors, denoted L1, L2, L3, and L4 are each implemented with dual unidirectional channels, so the planar-adaptive router has inputs and outputs as shown in Figure 2 . The y-dimension links use two virtual channels to support deadlock-free adaptive routing (labeled iy and dy for increasing and decreasing y). One router input and output is connected to other adaptive planes or to the local processing node.
Design and Technology Assumptions Because pin-limitations are a concern for routers, throughout, our designs use data channels with 16 bits of data and 7 additional control signals. 3 This produces a router with well below 250 data pins well within the range feasible for a pin grid array or more advanced packaging technology. Our designs are based on a 0.8 micron CMOS gate array library from Mitsubishi Electric Corporation. All timing estimates are based on conservative routing estimates, nominal processing, and nominal operating temperature. Crossbar Switch (CB) connects input channels to output channels. Virtual Channel controller (VC) multiplexes virtual channels onto the physical channels.
Overall Design

RD
Because the router interfaces are asynchronous, the VC also synchronizes the ow control signal. The VC and the IFC manage intranode data ow cooperatively to minimize internal delays.
The router is synchronous internally, but asynchronous externally. Internal synchrony makes internal coordination, particularly fair arbitration and selection for virtual channels, inexpensive. An asynchronous interface between routers is due to the di culty of distributing a high speed, low skew clock to a large system. 4 We assume that all routers operate with a clock of identical frequency but with di ering and even slightly variable skew. Di erences in clock phase between nodes are handled by synchronizers.
path setup and data through are the internal router operations which a ect router setup latency and achievable clock rate in the network router, respectively. A path setup operation involves the following steps: the AD generates requests based on the header, the RD assigns a path and sets up the switch, and data ows through the switch to the output (VC if appropriate). The data through operation determines the achievable channel bandwidth because it de nes the rate at which its can move through the router. A data through operation consists of the following steps: the IFC sends data forward, the data moves through the CB, and the data is accepted for transmission at the VC. To ensure correct operation our planar-adaptive, wormhole router must manage the following tasks: ow control, routing, and virtual channel multiplexing. In the following sections, we discuss how each of the tasks are accomplished.
Flow Control
Wormhole routers can have extremely low bu er requirements. However, one consequence of wormhole routing is that ow control must be rapid and must be performed done on small units of data to prevent bu er over ow. A competing goal is to maximize the channel bandwidth usable by a single packet. To achieve these goals, our design fully pipelines ow control operations, allowing a single packet to use the full bandwidth of a physical channel.
Between nodes, both the data and ow control signals are asynchronous, so the synchronization time increases the e ective delay in both directions. The XFC synchronizes the incoming data, and the VC synchronizes backward ow control signals using an synchronizer based on a Muller C element 26] to sample the input signal.
Pipelined ow control, synchronization penalty, channel delay, and clock skew all increase the bu er requirements of wormhole routing. Synchronous pipelined ow control with unit delay channels requires two it bu ers, and the synchronization delay increases the bu er requirement to four its. Thus, the minimum con guration for unit delay channels is four it bu ers per channel. Adding one more it bu er dramatically simpli es bu er control, so our designs all include ve it bu ers.
Routing Decision
Routing decisions in an adaptive router are based on a packet's destination address and the current state of the router. If several messages arrive simultaneously, several connections may be set up in the same cycle. For each channel, the AD decodes the message header and generates requests for the permissible paths; trivial for a planar-adaptive router. The RD arbitrates between simultaneous requests and enforces resource constraints (no more than one packet connected to each output). Our RD design uses the straight-rst selection policy 21] when there is no contention, giving the packets going straight priority over those turning. Fair arbitration is used to prevent starvation when a packet has already been forced to wait.
Switching
A router must switch the packets, conveying data from appropriate inputs to outputs. This is done by the crossbar (CB) which not only forms the forward path for data, but also the reverse path for ow control signalling.
Multiplexing Physical Channels
Virtual channels share physical channels, so the virtual channel controller (VC) multiplexes virtual channel tra c on to the physical channel. A good VC design utilizes the channel e ciently and prevents starvation for any virtual channel. However, to achieve these goals, a VC must coordinate the movement of data through the crossbar and the scheduling of virtual channels onto the physical channel. Figure 4 shows three virtual channels sharing a single physical channel.
The VC and IFCs cooperate to move data through the crossbar. When the channel is about to become idle, the VC requests data from all virtual channels which have not been halted by ow control (there are empty downstream bu ers). Subsequently, the VC sequences each virtual channel's data over the physical channel. To achieve this collection and sequencing without losing cycles, the VC needs IFC bu er status and ow control information. ready signals from IFCs indicate the bu er states, allowing the VC to schedule only those virtual channels with data to send. The empty signals allow the VC to only request additional data from virtual channels which are not blocked downstream. In Figure 5 , the inputs to the VC come from the left, and the physical channel is to the right. Our VC uses two levels of arbitration in a collection phase followed by a delivery phase. First, arbitration at the collection phase (when data is accepted from all of the virtual channels into the staging bu ers) decides which virtual channel sends rst. Second, arbitration at the delivery phase sequences the data from the staging bu ers (losers in the rst round arbitration) over the channel. Fixed priority arbitration in both cases simpli es the VC. Starvation is prevented by assuring that all collected its are transmitted before another collection phase.
Performance of the Base Router
In this section, we discuss the performance metrics de ned in Section 3 and how to relate them to the internal delay measures for our base router design. Router setup latency includes both internode and intranode delay; each of which can be broken down into component delays. In this section, we estimate both contributions to router setup latency. The achievable clock rate depends on the clocking scheme and the data through delay which characterizes the basic rate at which its can move internally. In this section we rst discuss internode latency which a ects router setup latency and then intranode latency which a ects both router setup latency and achievable clock rate.
Internode Delay Internode delay contributes to router setup latency. Internode delay includes the time to get o chip, across the wires, and onto the destination chip (bu er, propagation, input latch, synchronizer and synchronization delays). For standard output bu ers and input latches, the nominal performance of gate array library gives the delays shown in Figure 6 . The output bu er delay includes line charging time, characterized by the loading. Our analysis makes no attempt to account for long channel delays. The synchronizer delay is due to gate delay in the synchronizer. This is in addition to the synchronization delay depends on the clock skew. Based on these numbers, we can estimate the best case and worst case skew.
Based on the xed components of internode delay, if skew is less than T ? 4:9ns along the forward path (where T is one half a clock period and is greater than 4:9ns), this is the best case, and the channel crossing takes only one cycle. If the skew is greater than T ? 4:9ns but less than 2T ? 4:9ns, the crossing will take two cycles. If T is less than 4:9ns, the channel crossing may take two or more cycles. Even for clock rates of several hundred megahertz, 2T ? 4:9ns is an achievable skew for a large scale system.
Intranode Delay There are two important types of intranode delay: path setup and data through delay which contribute to router setup latency and achievable clock rate respectively. While the internode delay depends primarily on topology and packaging, the intranode delay depends strongly on router features. The data through delay determines the ow control rate, thereby a ecting the maximum achievable clock rate. In this section, we characterize the intranode delay for the base router, using these delays as a point of reference for the remainder of the paper. Parts (a) and (b) of Figure 7 show the critical path and timing of a planar-adaptive router at path setup. Figure 7 breaks the overall delay down into constituent module delays. The majority of the setup delay is in the AD, which latches the header from the XFC then generates route requests. Most of delay in the AD is due to the data latch, L. The RD arbitrates the request signals, generates crossbar control signals, and tells the IFC which path was chosen. With knowledge of which path will be taken, the IFC selects the appropriate new header (all possible updated headers are waiting). Simultaneously, the CB is setup, and a data ready signal from the IFC passes through the CB, arriving at the VC. The crossbar setup and data ready signal operations are not on the critical path; their delay is masked by the larger header selection time. The updated header it passes through the CB, arriving at the VC where it is latched at ph1. Figure 8 shows the critical path and components of delay of a planar-adaptive router at data through. 5 The data from the XFC is latched inside the IFC (in L), and then sent through the CB to the VC. In the VC, the data must wait for the arbitration amongst the virtual channels, even if no others are trying to send at this time. The arbiter's output controls selector S. After passing through the selector, the data is latched in L in the VC by ph1. It can now be transmitted to the next node. 5 In our simpli ed reporting of performance gures, the VC appears to have a di erent delay in the two situations: path setup and data through. This is because the overlap of operations is slightly di erent in each case.
Discussion
Determining the router setup latency is fairly straightforward, but determining achievable clock rate based on data through delay involves consideration of internode and intranode delays as well as clock skew margins. In our design, a it crosses the channel and the router in a single clock period (channel and synchronization delay is a half period from ph1 to ph2, and router delay is the other half period from ph2 to ph1). If we assume a two-phase clock with equal length phases, then whichever is delay is larger, the internode delay or intranode delay, determines the network clock rate. A more thorough description of the assumptions for determining achievable clock rate are given in Section 7.
For our base line router, the data-through delay of 5:7 ns for our baseline router dominates likely internode delays and skew margins, and thus a clock period of 2 5:7 = 11:4 ns is achievable. Such a clock period would allow a generous clock skew margin of 0:8 ns. Channel delays of two cycles would allow and easily achievable skew margin of 6:5ns.
While our adaptive router can sustain high speed and is low latency, it is slower than our design of a deterministic router. Based on the same assumptions, a dimension-order router design has delays of 5:68ns and 3:0ns at path setup and data through, respectively. The major reasons for the speed di erence are the lack of serialization between routing decision and header selection (routing choices and header updates are xed) and the absence of virtual channel controllers on the critical path. In terms of intranode speed, the DOR is as nearly twice fast as our baseline adaptive router. However, since intranode delay at data through in DOR (3:0ns) is less than internode delay (4:9ns), the router clock rate is can be dominated by internode delay, limiting the achievable clock period for the DOR to 9:8ns, only 16% faster the planar-adaptive router.
The Cost of Adaptivity
In this section, we characterize the cost of adaptivity by examining router designs with a range of adaptivity. These routers are all taken from the class of f-at adaptive routers 9]. The f-at adaptive routing framework can be used with any deadlock-free adaptive routing algorithm within each f-at; we assume a Linder-Harden router 25] with fully adaptive minimal routing.
Increasing routing freedom not only increases the complexity of individual router modules, many more modules are needed. The hardware module requirements are generalized to f-at routers below: To give the reader an idea of how the resource requirements (crossbars and virtual channels) increase, consider a 3-at adaptive router. Each 3-at (or cube) corresponds to a plane in planaradaptive routing and is divided into four virtual subnetworks (x+, y+, z), (x+, y-, z), (x-, y+, z) and (x-, y-, z) for 3-at deadlock-free routing. This requires two virtual channels in the rst two dimensions and four virtual channels in the third dimension. If a series of 3-ats are composed for a higher-dimensional network, eight virtual channels per physical channel are needed.
As routing freedom is increased, not only does the number of router modules required increase, some of the router modules become more complex, thereby becoming slower. We consider each router module in turn. IFC and AD delays do not change because their designs require only modest changes for higher degrees of routing freedom.
The delay in RD with f-at adaptivity may be estimated a follows:
T gate denotes the basic gate delay. The basic structure of the RD consists of f + 2 connection controllers, whose inputs feed into f+2-input priority encoders. Each controller controls ith priority signal, and the outputs of the priority controller are used to determine the cnct signals. 6 The term on the right comes from the lowest priority connection controller, and is proportional to f because the controllers are daisy-chained together. The term on the right arises from the combining logic which grows in proportion to the number of cnct signals.
The CB delay increases because even with partitioned crossbars, f + 2 ports are required per CB. The CB delay is described below:
The VC delay for f-at routers increases because their number of virtual channels to be multiplexed on each physical channel increases, requiring larger (deeper) arbitration circuits and selectors. T V C(3) denotes the basic delay of a virtual channel controller for three virtual channels.
In the VC, there are two arbiters. The delay of each arbiter increases in discrete jumps with the number of virtual lanes in a VC. This increase is represented by the last term in the equation. For path setup, the overlap of the IFC delay causes the last term to be irrelevant when f is smaller than four. Combining these terms gives overall formulas for router delay with f-at adaptivity:
At path setup:
The cnct signals are used grant output port request from the ADs.
# of The delay Delay Ratio
The delay Delay Ratio f-at at path setup path setup at data through data through Based on our cost model we can estimate the speed of routers with a range of adaptivity (see Figure 9 ). From these estimates, it is clear that router delay increases signi cantly with adaptivity, but much slower than anticipated. In general, each f-at can be partitioned, producing much smaller crossbars. For example, the CB in the router with 3-at is only 5x5. However, the number of virtual channels required for deadlock prevention is primarily responsible for the large increase in delay. The 3-at adaptive router requires eight virtual channels just to prevent deadlock. The increased crossbar sizes and large numbers of virtual channels make routers with higher adaptivity much slower. The 3-at router is 50% and the 4-at is 190% slower than the planar-adaptive router at data through with much of the additional delay comes from exponential increases in the numbers of virtual channels for deadlock prevention. It should be noted that these increases are much larger than the throughput bene ts claimed by most adaptive routers.
In this section, we characterize the cost of virtual lanes by examining router designs with from one to sixteen virtual lanes. Virtual lanes can increase channel utilization in a network by multiplexing the physical channels, allowing packets to pass one another 13, 23] . Though both virtual channels and virtual lanes use additional hardware bu ers, virtual lanes require greater connectivity in the router as each virtual lane is interchangeable within its virtual channel class. Essentially, this means that the crossbars cannot be partitioned. In this section, we rst consider the pros and cons of several proposed architectures for virtual lanes, then estimate the speed and cost of the most attractive architecture. In 13], Dally proposes three alternatives for implementing virtual lanes, which di er primarily in the size of the crossbar switch and how it is multiplexed (Figure 10) . A 2-input, 2-output CB is used for illustrative purposes. Adding virtual lanes to a planar adaptive router would require beginning with a 4-input, 4-output CB. The basic options for adding m virtual lanes to a basic p x p crossbar are:
Architectural Alternatives for Virtual Lanes
A. An m p x m p crossbar switch. B. A fully multiplexed p x p crossbar switch. C. A partially multiplexed m p x p crossbar switch.
Option A uses no multiplexing, adding crossbar ports for each virtual lane. Because there is no multiplexing, this approach requires the simplest control and arbitration. Option B shares the switch amongst the virtual lanes. Option C represents a compromise, providing inputs for each virtual lane, but sharing the CB outputs. The major distinguishing characteristics of these options are internal blocking and switching time for its. Figure 11 : An internal data con ict using Option B. At time t3, one of the its on the lower port is blocked because the CB input is busy.
Internal Blocking Because the physical channels are often a critical limiting resource, router designs minimize internal blocking. Both options A and C are internally nonblocking. On the other hand, option B does have internal blockage (see Figure 11 for an example). Blocking on option B arises from the the interaction of ow control and crossbar multiplexing, and can cause performance losses. Because of the importance of minimizing internal blocking, we rule out option B.
Switching Speed How the switch is multiplexed a ects the critical path length for data through and thus the achievable router speed. Option B has been eliminated on the basis of blocking, so we consider options A and C. As shown in Figure 12 , option A each data transmission requires only pass through the switch while option C requires several passes. In option A, the crossbar con guration is xed, and the xed connections operate identically to the base router. In option C, because the outputs are shared, the switch settings for each cycle are determined by which virtual channels will use the physical channels this cycle. Thus, the VC and IFC's must collaborate to control the switch based on data status information from the IFC's as well as the empty signals from downstream nodes. This approach requires three passes through the switch, the rst to get the data status information to the VC's, the second to setup the switch and send the enable signals to the chosen IFC's and nally for the data to pass through the crossbar.
In addition, option C requires extra arbiters, as shown in Figure 13 to manage the switch multiplexing; one for each switch output, managing the virtual lanes in a single virtual channel class. These additional arbitration steps are more expensive for larger numbers of virtual lanes and increases not only path setup time, but also data through delay (switching speed), directly reducing network throughput.
Because latency and bandwidth are rst priorities, option A is the most attractive. Gate count is not a major constraint for most router designs, and for modest dimension networks and virtual lanes, the required crossbar switches are feasible. For example, going from one virtual lane (base router) to two virtual lanes produces a router design with 8x8 crossbar switches and 6 input virtual channel controllers (see Figure 14) . The alternatives, options B and C, may be attractive for routers with large numbers of virtual lanes.
Performance of Routers with Virtual Lanes
In this section, we characterize the speed of routers supporting virtual lanes with architectural alternative option A. Overall, adding virtual lanes using option A requires minor modi cations to the RD, CB and VC. First, the RD must connect messages to virtual lanes, not just physical channels. Second, adding one virtual lane requires VC's that can support six virtual channels (the former three virtual channels multiplied by two virtual lanes). Finally, the crossbar size is increased. A modi ed base router with two virtual lanes is shown in Figure 14 . Based on this architecture and our designs, the speed of a router with m virtual lanes can be estimated as follows: The RD delay at path setup increases linearly in m, the number of virtual lanes, because there are m times more inputs to the crossbar switch. The CB and VC delay increase slowly based on the number of virtual lanes due to increasing depth of the switching and arbitration circuits. The second term in VC delay for path setup is a factor of two smaller that that for data through because of the di erent overlap at path setup (see Section 5) . These delays are summarized for a range of virtual lanes from one to sixteen in Figure 15 . Clearly, adding virtual lanes signi cantly increase router setup and data through latency. For example, if data through latency determines throughput, going from one virtual lane to two virtual lanes requires more than a 30% throughput improvement to be worthwhile. 
Overall Cost Summary
Our studies show clearly that adaptive routing and virtual lanes can have a signi cant impact on router delays. To relate these results to existing channel utilization studies for routers, we translate the delays into achievable clock rates, allowing us to estimate how much better the channel utilization numbers would have to be to justify each type of router feature.
We convert the router delays to achievable network clock rates based on four assumptions: two phase clocking, equal length phases (approximately 50% duty cycle clocks), path setup in one and a half clock cycles, and the achievable clock rate determined by the delay at data through. The rst two conditions match the intranode delay (at data through) and internode delay. The third condition assumes we can achieve a router setup latency of two clock periods. The nal condition assumes a simple router architecture with identical it and phit (physical transfer unit) size.
Figures 16 and 17 summarize the intranode delay, achievable clock rate, delay per hop and required channel utilization improvement rate of routers with f-at adaptivity and m virtual lanes, respectively. The required channel utilization gures show the performance increase required to justify addition of the feature.
Using the collected information, one can compare the cost of adaptivity and virtual lanes. Previous simulation studies show that network channel utilization bene ts from a mix of the two features 22]. The gures show that adaptivity based on the Linder-Harden algorithm is more expensive than virtual lanes, and higher degrees of adaptivity based on this approach are probably not feasible. A 3-at adaptive router incurs an increase in delay as large as a router with three virtual lanes. The 4-at router's delay is nearly as large as a router with seven virtual lanes. 7 One conclusion is that routers with modest adaptivity and larger numbers of virtual lanes are most attractive. Further, higher degrees of adaptivity or large numbers of virtual lanes are probably not viable, as their cost-e ectiveness depends on four-fold increases in channel utilization. 7 The cost of a 2-D router is nearly equal to that of higher-D routers with the same adaptivity and number of virtual lanes in terms of latency, because higher-D routers are built from 2-D routers by combining them in parallel. 
Related Work
In this section, we survey the related router design studies. Because we know of no published studies on adaptive router implementation, all of the work surveyed here involves deterministic routers. While di erences in router functionality and implementation technology make comparisons di cult, the comparison shows that our router design is competitive.
Caltech Routing Chips The Torus routing chip (TRC) is a dimension-order router for k-ary n-cube networks 17] which implemented wormhole routing and used virtual channels to prevent deadlock. Channel throughput was 8 MB/s with byte wide self-timed communication channels (8 Mhz) . This was about an order of magnitude better than contemporary communication networks The J-Machine Router The J-Machine is a ne-grained concurrent computer developed at MIT 16, 29] . The J-machine network is a three-dimensional mesh, with bidirectional 9-bit channels, and dimension-order, wormhole routing. The J-Machine network uses two virtual channels to support two logically independent message priorities and a globally synchronous clock. The data throughput is 36 MB/s (32 Mhz). The latency of the routing is 62:5 ns per hop.
Recent Router Designs The latest Caltech EMRC routing chips are also dimension-order, wormhole routers 31]. These chips are self-timed and use byte wide channels to achieve 166 MB/s. The typical path formation latency for the head of a packet is approximately 30ns. The Intel Paragon router is descended from the original Caltech MRCs 11, 18] . The Paragon router is a deterministic router and comparable to our designs, as it is implemented in a similar technology (0.8 micron CMOS gate array) and gives performance comparable to our designs. Published gures for its delay and channel bandwidth are 40 nanoseconds and 200 megabytes/second respectively.
Summary and Future Work
In this paper, we have described the design of a planar-adaptive router, and used that design to analyze the cost of a basic adaptive router. Our router is internally synchronous and externally asynchronous. Based on a 0.8 micron gate array technology, we characterized the speed of our design. The intranode router delay is 10:3ns and 5:7ns for path setup and data through respectively, supporting a maximum signalling rate of 87 Mhz or 174 MB/s per physical channel (sixteen-bit channels). Our design could be improved by using a single-phase clock and edge-triggered latches, potentially raising performance to 348MB=sec.
Using the planar-adaptive router design as a baseline, we explored the cost of adaptivity and virtual lanes. Our studies show that higher degrees of adaptivity can be extremely expensive. Justifying the increased cycle time due to adaptivity requires that it deliver huge increases in channel utilization. For example, in going from two-at adaptivity to with 3-at adaptivity, the increased adaptivity must improve at least a 50% channel utilization. Compared two-at adaptivity to four-at adaptivity, the increase to a 3-at router the improvement must be extremely large, 190%. Simulation studies show that improvements in channel utilization due to adaptive routing are likely to be more modest.
Our studies show that virtual lanes are less expensive than adaptivity, but still quite expensive. To justify the increased cycle time, the rst additional virtual lane has to provide at least a 30% increase in channel utilization. Justifying second and third virtual lanes require 30% further increases in channel utilization for each. Published simulations studies show that such increases are possible for a modest number of virtual lanes, but performance increases su cient to justify larger numbers of virtual lanes appear unlikely 13, 22] .
By examining the implementation complexity of adaptive routing and virtual lanes, we seek to balance their cost and bene t. While much research has been published on the advantages of these features, we hope to provoke debate on their cost and real bene ts. For example, some proponents of adaptive routing have claimed lower latency at low loads as a performance advantage. Our design studies show that increases in router complexity and intrarouter latency are likely to overwhelm such bene ts. Both adaptive routing and virtual lanes can increase network throughput, but our design studies show that their complexity can produce compensating reductions in network bandwidth. Measuring the cost of network features allows us to weigh their bene t against their cost and make informed tradeo s.
There are still many avenues open for future work in this area. Though our design presents a basic evaluation of the cost of adaptivity and virtual lanes, it is based on a single technology point and a particular class of router architectures. Other technology points and router architectures should be examined to see if they give qualitatively di erent results. This study also examined basically one approach to routing, others studies of this type 10] will certainly explore the cost of alternative approaches to adaptive routing. The ultimate goal is to integrate the optimization of concerns to include routing algorithm, network topology, routing freedom, and virtual lanes, allowing the major choices which a ect network cost and performance to be related in a global perspective on network design.
