Considering continuous routing, we analyze the transient behavior of n n routers with input bu ering, split input bu ering, output bu ering, and central bu ering with dedicated virtual circuits, one for each source-destination pair in a network. Assuming similar bu er space requirements, output bu ering has the highest throughput. Split input bu ering and central bu ering have comparable performance; split input bu ering slightly outperforms central bu ering for large switches. Input bu ering is known to saturate at packet generating rates above 0.586. By extending these models, two 1024-node, unique-path multistage networks con gured with (approximately-modeled) input bu ered STC104 32 32 switches, or central bu ered Telegraphos 4 4 switches (Telegraphos I version) are compared. Surprisingly, the network con gured with smaller switches performs better. This is due to the higher peak bandwidth of the Telegraphos switch and saturation of the input bu ered STC104 switch.
Introduction
Parallel and distributed systems consist of processors communicating through an interconnection network. The network is usually con gured with communication switches (also called routers), responsible for reliable and e cient message transmission. Routers are usually compared to the ideal crossbar switch; this switch never delays a packet waiting for transmission over an idle link. Additional switch functionality, such as packet multicasting, or combining is sometimes provided.
We analyze n n switches which may simultaneously send (and receive) one packet to (from) each of their n output (input) ports. We consider a) one FIFO 1 bu er at each input port, b) FIFO bu ers at each input port, partitioned per each output port, c) one FIFO bu er at each output port, and d) a central data bu er. We assume xed size packets, preventive ow control (no packet loss or retransmission), and similar switch startup time (T s clock cycles), spent in bu ering and prerouting before packet transmission. A packet is transmitted only if there is free bu er space at the subsequent switch. If we were to adopt Jenq's model, packets headed to a full bu er could also be accepted if another packet is simultaneously transmitted out of this bu er. However, ow control propagation time becomes signi cant, especially for large, sparse networks, making this model ine cient. 4;5 The n n input data bu ered switch (IDB) has n FIFO bu ers. The packet to be transmitted through an output port is selected from all input bu ers in round robin fashion. Any packet directed to a full bu er is stalled.
The n n split input data bu ered switch (SIDB) has n 2 FIFO bu ers, one for each input-output port combination. The SIDB switch can preroute a packet to determine if it is directed to a full bu er. If all packets corresponding to a given output port are directed to a full bu er, tra c is temporarily stalled. Otherwise, the packet to be transmitted is selected using round robin.
The n n output data bu ered switch (ODB) implements FIFO bu ering at the output ports. All packets arriving simultaneously can be wired to the appropriate output bu er(s) within T s clock cycles. The ODB switch also uses prerouting to determine whether packets are directed to a full bu er. All such packets are temporarily stalled.
The n n central data bu ered switch (CDB) provides one dedicated bu er for each source-destination pair in the network. Flow control is per virtual circuit (VC), where VC header information identi es uniquely the packet's source and destination. Round robin is implemented at each output port, for selecting the packet to be transmitted.
In continuous routing, a constant packet arrival rate is applied at each input port. Generated packets head to independently chosen random output ports. This rate is reduced (and may drop to zero) whenever bu ers become full, thus taking into account preventive ow control (back pressure) of the switches. We focus on the following performance criteria. Switch throughput S is the average number of packets routed per clock cycle through an output port of the switch. Switch latency L is the average packet delay in the switch, measured from packet header arrival to departure time of the last packet bit.
Network throughput S Net is the average number of packets routed per clock cycle through an output link of a last stage switch. Network throughput is usually normalized by the maximum possible throughput, corresponding to the network bisection throughput.
Network latency L Net is the average delay, measured from packet header arrival to departure of the last packet bit from the network. It is computed by adding the corresponding switch latencies for all network stages .
In Section 2, we present performance models for all four n n switches under continuous routing assumptions. At rst, we review the Markov analysis for the IDB switch. Then, we extend existing analyses for the SIDB and ODB switches, and develop a Markov model for the CDB switch.
In Section 3, assuming similar bu er space requirements, we compare switch throughput and latency measures for all four switches. While the IDB switch saturates for packet generating rates above 0.586, both the SIDB, and CDB switches maintain slightly lower throughput than the ODB switch; SIDB slightly outperforms CDB for large switches. Furthermore, for large generating rates, the CDB and SIDB switches have smaller latencies than the ODB switch.
In Section 4, we extend our models to compare (approximately) two 1024-node multistage networks con gured with STC104 32 32 and Telegraphos 4 4 switches. Surprisingly, the Telegraphos network performs better, due to its higher link bandwidth and saturation of the input bu ers of the STC104 switch.
In Section 5, we provide conclusions and general remarks on our methods and switch approximations. In particular, we discuss the e ect of cut-through packets and the design of a more accurate STC104 model.
Continuous Routing Models
Continuous routing forms a reasonable tra c model for parallel computers and communication networks. Each processor generates packets at a xed rate, with each created packet assigned to an independently chosen random destination. Packets are routed in parallel to their nal destinations. We adapt the same discrete-time model for switches, taking into account nite bu ering and preventive ow control.
We assume that the bu er size is k packets and bu ers are initially empty. The packet generating rate at each input port is held constant (0 < p 1), and each generated packet heads to an independently chosen random output port. Since preventive ow control is used, the packet arrival rate is reduced appropriately (it may drop to zero) whenever corresponding bu ers become full.
Markov chains provide a probabilistic method for studying variations which occur in an ordered sequence. The model consists of a nite number of states and a collection of transition probabilities between states. By applying Markov chain theory to continuous routing, we can derive the probability of bu er size i, 0 i k, and compute the throughput and average latency for each switch. In the following analysis, the switch startup time is T s = 1 clock cycle, and the packet size is one unit.
Sometimes network latency may include processor arbitration delay.
Input Bu ered Switch
The n n IDB switch has been analyzed. 4;11;12 We include this standard analysis here for completeness; an extension is discussed in Section 4.
The packet arrival rate to an input bu er is p. This rate drops to zero when the bu er is full. Let R be the probability that a packet at the front of a bu er can move forward, and i represent the probability that the queue length equals i. Fig.  1 (b) shows the Markov chain which models the queue length of any input bu er in Fig. 1 (a) . We have, (1 < i < k ? 1) 
We now compute the throughput at a given output port. The probability that an input bu er is nonempty is 1 ? 0 . Since tra c from each input port is equally distributed to all output ports, the probability that a packet from this bu er is directed towards a given output port is (1 ? 0 )=n. Hence, the probability that no packet (from n input bu ers) selects the given output port is (1 ? (1 ? 0 )=n) n , and the throughput is
The probability that a packet at the front of a bu er moves forward is equal to the packets transmitted out of the switch per clock cycle (n S) divided by the number of nonempty bu ers n (1 ? 0 ). Hence,
Let C be the event that a packet can move forward. A packet can move forward only if a) it leads its bu er (event D), and b) this bu er is selected for transmitting a packet. Hence, P rob C ] = R P rob D ].
The probability that the packet leads its bu er depends on the average queue length. Since that bu er contains at least one packet (given packet), we obtain
The latency is L =
Our analysis of n n SIDB is a simple extension of the analysis of the 2 2 statically-allocated fully-connected switch previously considered. 4;6 Since the packet arrival rate at each input port is p, the packet arrival rate at each of the n partitioned bu ers is P = p=n. This rate P drops to zero when the bu er is full. Let R be the probability that a packet at the front of a bu er can move forward. Fig. 2 (b) represents the Markov chain which models the queue length of any bu er in Fig.  2 (a) . Let i represent the probability that the queue length equals i. Assuming 0 = (1 ? P ) 0 + (1 ? P ) R 1 1 = P 0 + (P R + (1 ? P ) (1 ? R)) 1 + (1 ? P ) R 
The throughput is
The probability that a packet at the front of a bu er moves forward (R) is
Let C be the event that a given packet can move forward, and D the event that this packet leads its bu er. Notice that P rob C ] = R P rob D ]. The probability that the packet leads its bu er depends on the average queue length. Since this bu er contains at least one packet (given packet), we obtain
Output Bu ered Switch
Since the packet arrival rate at each input port is p, the probability c j (j = 0; 1; : : : ; n) that a particular output bu er receives j new packets follows the binomial distribution c j = B(n; j; p=n), with j successes in n trials, and p=n success probability of a single trial. Fig. 3 (b) represents the Markov chain describing the queue length of any bu er of an n n ODB switch with maximum bu er size k = 5 shown in Fig. 3 (a) . Notice the transitions controlled by binomial coe cients, due to simultaneous arrivals of packets headed to the same output port. Let i denote the probability that the queue length is i, 0 i k. Also, we extend, l = 0, for l < 0, and c l = 0, 
The throughput represents the probability that an output bu er is busy. Hence,
Let C be the event that a given packet leads a bu er, and therefore it can move forward. The probability of event C depends on the average queue length. Since, this bu er must contain at least one packet (given packet), we obtain
The latency is L = By using a ow model and two types of tra c, Katevenis et al. proved that the expected performance of a CDB switch is good. 19 In our model, incoming VCs are evenly distributed to output ports, with VCs per port. We may view each V C as a bu er of size one. Only if this \bu er" is empty, a packet with this VC header can be accepted. Since the packet arrival rate at each input port is p, the corresponding arrival rate for a given V C is P = p= if the VC \bu er" is empty, otherwise it is zero. Let R denote the probability that a packet with given VC can move forward. Fig. 4 (b) represents the Markov chain describing the queue length of any VC \bu er" of the switch shown in Fig 4 (a) . Let i represent the probability that the queue length equals i. We have,
The throughput S corresponds to at least one of \bu ers" being nonempty. S = 1 ? 0 (13) R is related to the event that the correct VC \bu er" is selected. We obtain,
Performance Comparisons for Generic Switches
For a single IDB switch the de nition of R, as the probability that a packet at the front of a bu er moves forward, hides dependencies introduced by head of the line (HOL) blocking. However, the analytical approximation for IDB is usually within a small range (1-2%) of simulation results. Notice that, such HOL phenomena do not occur with the other three switches.
We now assume that all switches have the same bu er space, startup time, and process packets of equal ( xed) length. We compute throughput and latency for various packet arrival rates, by solving systems of non-linear equations using an iterative (trial and error) approach; if 0 is chosen in 0; 1] with at least 4 digits of accuracy, then the error on S and L is always less than 0.5%. Because of the monotonic relation of the involved functions with respect to 0 , it is much faster to binary search the solution space for the best 0 value. We compare the throughput and latency for various packet generating rates, and switch sizes. In Figure 5 (a) and (b), we consider 4 4 switches which can store 48 packets. Hence, for the SIDB switch, each of the 16 input bu ers has size k = 3, while for the IDB and ODB switches each of the four bu ers has size k = 12, and for the CDB switch there are = 12 VCs per each output port. Similarly, in Figure 6 (a) and (b), we consider 16 16 switches which can store 512 packets. Hence, for the SIDB switch, each of the 256 input bu ers has size k = 2, while for the IDB and ODB switches each of the 16 bu ers has size k = 32, and for the CDB switch there are = 32 VCs per each output port. From the gures, we observe that the 4 4 (and 16 16) IDB switch saturate at input tra c rates above 0.68 (respectively, 0.64). The saturation rate for IDB approaches 0.586 (for large switches). For large input rates, the ODB maintains the highest throughput, while the SIDB and CDB switches have comparable throughput; SIDB seems to outperform CDB for large switches.
For large arrival rates both SIDB, and CDB o er smaller latencies than the ODB switch. However, in all four cases, the average latency does not consider packets stalled due to a full bu er. In reality, all such packets would be retransmitted, slightly increasing the average latency. 23 For a packet generating rate approaching one, the average latency increases (1?S)=S%, e.g. for the 4 4 switches the increase is 3%, 11%, 15% and 46% for the ODB, CDB, SIDB, and IDB switch, respectively.
Case Study: Real Communication Routers
So far, we have assumed unit packet size and switch startup time. For real routers, the packet size is expressed in bits, the throughput in megabits per second, and the latency in seconds. Thus, real bandwidth and latency measures must be appropriately adjusted.
Assuming no congestion, the minimum packet latency through a real switch is L 0 = T s + K 1 b seconds, where T s is the switch startup time (the interval between header arrival and header departure from the switch), K 1 is the number of bits after the header (including any termination bits), and b is the transmission delay for one bit along the outgoing link. The switch bandwidth takes into account only the number of data bits (K 2 ); we are ignoring the overhead from header, parity/CRC, and termination bits (these bits are used in computing latency). Thus, the actual bandwidth (S 0 ) in bits per second, and average latency (L 0 ) in seconds, can be obtained using the previously computed values of S and L as follows.
Next, we analyze two routers, the STC104 32 32 switch, 16;21 and the Telegraphos I 4 x 4 switch. 17;18 The STC104 is used in building parallel systems by connecting together T9000 transputers. 20 The STC104 and T9000 use the Data-Strobe (DS) 100 Mb/sec link protocol (this protocol is now included in the IEEE standard P1355). Similarly, the Telegraphos switch connects DEC Alpha workstations into a parallel system, supporting distributed shared memory by communicating remote access and atomic instructions to the network, over Alpha's I/O turbo-channel. Both switches support xed size packets. Long messages are supported on the STC104, by sending a series of packets.
Design parameters for the two switches are examined in Sections 4.1 and 4.2. Using these parameters, the actual bandwidth S 0 and latency L 0 for each switch is estimated. S 0 is derived by ignoring overhead from header, parity/CRC error checking bits, and termination bits, while (b) L 0 corresponds to the average packet delay when no congestion is present; this is measured from packet header arrival until departure of the last packet bit. Performance comparisons based on a large multistage network are presented in Section 4.3.
STC104 Routing Parameters
Let's de ne an STC104 byte as 10 bits with 2 bits parity. Thus, an STC104 packet has one byte (10 bit) or two byte header, 32 bytes body, and 4 termination bits. The switch startup time is T s = 500 nsec. The number of bits after the header is K 1 = s + 32 10 + 4, where s = 0 or 10, for one byte or two byte headers, respectively. Finally, the useful data is K 2 = 32 8 = 256 bits, and the link delay per bit is b = 10 nsec, assuming links operating at 100 Mb/sec. Thus, assuming one byte headers, the minimum packet latency is L 0 = 500 nsec + 324 10 nsec = 3:74 sec.
The STC104 provides one input data bu er at each of its 32 input ports. Each such bu er can store only one packet (one VC in STC104's terminology). Hence, we use the input data bu ering model of Section 2.1 (with bu er size k = 1) to obtain throughput (S) and average latency (L) for various packet generating rates. Then, the actual bandwidth (S 0 ) and average latency (L 0 ) are computed as follows. (19) Notice that, the link speed of 100Mb=sec allows a peak, bi-directional bandwidth through the STC104 of 270MB/sec.
Telegraphos Routing Parameters
Let's de ne a Telegraphos byte as 8 bits. A packet on the 4 4 Telegraphos switch consists of a header byte and eight body bytes. The switch startup delay (including header transmission) is T s = 160 nsec (4 clock cycles). The number of bits after the header is K 1 = 8 8bits = 64 bits, and the number of useful data bits is K 2 = 59 bits (17 memory address bits, 32 data bits, 3 epoch counter bits, and 7 opcode bits). The link delay per bit is b = 5 nsec, since one byte is transmitted in one cycle (40 nsec). Hence for the Telegraphos, the minimum packet latency is L 0 = 160 nsec + 8 40 nsec = 480 nsec.
The Telegraphos switch operates as a central data bu er switch. In current early implementations (1994) (1995) , only 32 VCs per port are available because of limitations in the turbo-channel I/O ports. With current or new implementations (e.g. using DEC's PCI bus), 1024 VCs can be supported without signi cant changes in the packet format; either limit the existing CRC bits, or condense the opcode eld (7-8 local/remote instructions are supported) and utilize another two bits in the packet format (currently padded with zeros). Next, we will assume that there Switch models are not su cient for selecting a particular bu er strategy. Instead, it is necessary to consider complete networks. In this Section, we consider larger switches with a unique-path multistage topology derived from the binary butter y (or the shu e-exchange) by grouping buddies of 2 2 switches. For example, Figure 7 (a) shows a decomposition of the 64-node binary butter y into a three stage network con gured with sixteen 4 4 switches per stage. Grouping of switches is based on the network \buddy" property that the four output links from two stage j switches constitute the four inputs for two stage j + 1 switches; these four switches can be replaced by a more "powerful" 4 4 switch. An advantage of unique-path networks, besides their incremental network design, is their deadlockfree routing property 2 ; essentially, one needs only allocate separate bu ers (or VCs) for request and reply packets. Using the circuit switching model and similar switch performance characteristics, Gottlieb and Schwartz have shown that the newly obtained network has higher performance. 8 However, as we shall show later in this Section, the network con gured with larger switches is not always better.
With a similar decomposition, the N = 2 10 -node binary butter y can be con gured as a two stage network with 32 STC104 32 32 switches per stage, or a ve stage network with 256 Telegraphos 4 4 switches per stage. The unique-path property of both network con gurations, along with the symmetric tra c assumptions of the continuous routing model imply that, bu ers assigned to di erent network paths are independent. However, bu ers on a given network path are interdependent, since a packet can move forward only if the next bu er is not full.
Hence, network throughput (tra c rate at the output link of a last stage switch) and network latency (sum of individual-stage latencies) are obtained by considering a queuing system with bu ers connected in tandem. The number of bu ers equals the number of network stages, e.g. ve bu ers for the Telegraphos network and two bu ers for the STC104 network. Analytical models for networks con gured with IDB and CDB switches correspond to systems of network equations; there is one such system for each network stage. Furthermore, for all stages but the last, network equations are appropriately adjusted: a) the probability of a packet moving forward (R) is reduced by multiplying the right hand side of Eq. (3) or Eq. (14) with the probability of a nonfull bu er at the next stage (1 ? k ), and b) throughput rates at one stage are mapped to input tra c rates at the next stage. We now consider network performance. For each bu er in Figure 7 (b) (corresponding to a network stage), we obtain a system of nonlinear equations. Between di erent stages there are dependencies. Our solution method is iterative until convergence of queue length, throughput, and average latency measures, at all stages is established; convergence, with 0:001, was always achieved after at most 5 iterations. In Figures 8 (a) and (b) we plot absolute network bandwidth and latency vs. packet generating rate; throughput and latency measures are adjusted, as in Sections 4.1 and 4.2, to re ect actual switch parameters.
From Figure 8 (a) we notice that the Telegraphos network bandwidth is higher, especially for large arrival rates (122 Mb/sec vs. 44 Mb/sec for STC104 network). The main reasons are: saturation of the STC104 input bu ers, and higher peak bandwidth of Telegraphos switch (compare the constants in Eq. (19) and Eq. (21)). Furthermore, the Telegraphos network does not show any signs of saturation (linear bandwidth); the main reason for this is the large number of VCs (1024) available at each Telegraphos switch.
Network latency is also smaller for the Telegraphos network. For large arrival rates, network latency is 15 usec for Telegraphos vs. 25 usec for STC104 network. Delays refer to packets not succeeding in cut-through (both switches allow for virtual cut-through). For cut-through packets, the basic latency at each stage (L 0 ) is reduced by a portion of the switch startup delay; the time spent in bu ering and prerouting before the header is output. Hence, the basic latency (L 0 ) becomes 100 nsec + 324 10 nsec = 3:34 sec for the STC104 switch, or 40 nsec + 8 40 nsec = 360 nsec for Telegraphos. Thus, successful cut-through in all network stages reduces latency (Figure 8 (b) ) by 10% for STC104 switches, or 25% for Telegraphos switches. The overall latency, considering cut-through and non-cutthrough packets, could be easily modeled using the short-circuit approach previously described. 15 Although, the hardware implemented, cut-through condition depends on the exact timing of packets, we also expect that the overall latency would be a ected only for low generating rates.
Notice that the relative shape of the network bandwidth and latency curves remains una ected when the constants T s ; K 1 ; K 2 and b are changed. The absolute bandwidth is always proportional to K 2 =(T s + K 1 b), while the average latency is proportional to T s + K 1 b. Then, a switch designer could consider optimizing the bandwidth, or the bandwidth over latency ratio.
Conclusions and Final Remarks
We have examined analytical performance models for communication switches and networks under continuous tra c assumptions. We have considered input bu ering, split input bu ering, output bu ering, and central bu ering with virtual circuits for each source-destination pair in the network. Assuming similar bu er requirements, output bu ering o ers the highest throughput. Input bu ering is known to saturate at generating rates above 0.586, while split input bu ered and central bu ered switches have comparable performance; split input bu ering slightly outperforms central bu ering for large switches.
We have extended these generic models to a 1024-node, unique-path multistage network con gured with STC104 32 32 switches, or Telegraphos I 4 4 switches. The STC104 is modeled as an input bu er switch, while the Telegraphos switch is modeled as a central bu er switch. Surprisingly, the Telegraphos network, although con gured with smaller switches, performs better. This is attributed to the Telegraphos higher maximum link bandwidth and saturation of the input bu ered STC104 switch. Continuous routing has been implemented on a logic (Verilog) simulator of a stand-alone 4 4 Telegraphos switch for verifying the results of our models. The results show no sign of saturation except for very high input rates. 10 The analytical results tend to show slightly better performance than simulation results; in reality, a queued packet contends for access to a given output port, while in analytical models, this packet contends for access to any port with the same probability.
We have not considered adaptive/randomized routing; this is possible only on the STC104, but quite unlikely to help with any unique-path (or dense) network. In this study, switch cost has not been an issue; the Telegraphos is a prototype switch and not a commercial router. However, central bu ering is an expensive option for high performance interconnections; for symmetric systems with shortestpath routing bu er space is grossly unutilized; e.g., 0:21 % of all VCs are used in a 7-dimensional hypercube, and 6:25 % are used in a 16 16 mesh. Split input or output bu er switches, could achieve similar or better performance at a fraction of bu er cost. For better bu er utilization more complicated switches should be considered, e.g. the dynamically allocated fully connected (DAFC) switch. 6 For very high speed link protocols, non-FIFO crossbar switches are attractive. A crossbar matches incoming packets to output links, avoiding bottlenecks associated with shared bus lines and centralized shared memory switches. The saturation rate of a non-FIFO crossbar asymptotically approaches 63% if packets are served on a totally random basis. A simple iterative approach, based on distributed random scheduling, achieves performance close to an ODB switch with just 3 or 4 iterations. 1 Analytical modeling is a powerful technique for studying the performance of networks and computing systems. However, any real system is too complex to be modeled accurately. Certain features in our STC104 model could be improved. A better, but complicated STC104 model could consider wormhole routing (i.e. cut-through packets), and an input-output abstraction of the STC104's switching fabric; the output bu er size would also be one packet and the model could use link speeds and internal chip delays for all STC104 input bu ers, output bu ers and intermediate crossbar switch. 22 A small variation from our rst-order, input bu ering STC104 approximation is expected, e.g. refer to analysis for large n n input-output switches. 14 An object-oriented simulation framework for large communication networks con gured with the STC104 (DS links) and RCube (HS links) is currently being developed and calibrated. 7;13 Continuous routing can be used to evaluate bu ered switches with retransmissions, by comparing cell loss probabilities ( k ). It is very interesting to extend this model to hot-spot, bursty tra c 3 , and continuous multicasting tra c. 9 
