Abstract-Because of the unscheduled nature of arrivals to a packet switch, two or more packets may arrive on different inputs destined for the same output. The switch architecture may allow one of these packets to pass through to the output, but the others must be queued for later transmission. We study the performance of four different approaches for providing the queueing necessary to smooth fluctuations in packet arrivals to a high-performance packet switch. They are 1) input queueing where a separate buffer is provided a t each input to the switch; 2) input smoothing where a frame of b packets is stored a t each of the N input lines to the switch and simultaneously launched into a switch fabric of size Nb x Nb; 3) output queueing where packets are queued in a separate Erst-in first-out (FIFO) buffer located a t each output of the switch; and 4) completely shared buffering where all queueing is done at the outputs and all buffers are completely shared among all the output lines. Input queues saturate at an offered load that depends on the service policy and the number of inputs N, but is approximately 0.586 with FIFO buffers when N is large. At the expense of an increase in the switch fabric size and latency, the lost packet rate for input smoothing can be made small by increasing the frame size b. Output queueing and completely shared buffering both achieve the optimal throughput-delay performance for any packet switch. However, compared to output queueing, completely shared buffering requires less buffer memory a t the expense of an increase in switch fabric size.
INTRODUCTION
N the move toward high-performance packet switching I for integrated service networks [ 11 and multiprocessor interconnects [2] , attention is focusing on packet-switching architectures that provide many simultaneous input/ output paths through the switch fabric and allow the internal paths to be time-multiplexed in a statistical rather than deterministic fashion. Such architectures provide the capability for high-speed transmission ( 1-200 Mbits/s ) on each input/output with a total switch capacity of 1-200 Gbits/s. Because of the high-speed operation of the switch, the processing of packets is largely hardwarebased, with packet headers containing address information that is used by the switch fabric to route packets from inputs to outputs on the switch. Depending on its design, the switch fabric may be blocking. That is, the switch fabric may be unable to provide simultaneous, independent paths between arbitrary pairs of inputs and outputs. However, even if the switch fabric is nonblocking, congestion in the switch will still arise because, unlike a circuit switch, arrivals to a packet switch are unsched- Manuscript received November 2, 1987; revised June 11, 1988 . This paper was presented at INFOCOM '88, New Orleans, LA, March 1988. This work was performed while M. G. Hluchyj was with AT&T Bell Laboratories.
M. G. Hluchyj is with Codex Corporation, Mansfield, MA 02048. M. J. Karol is with AT&T Bell Laboratories, Holmdel, NJ 07733. IEEE Log Number 8824400.
uled: two or more packets may arrive simultaneously on different inputs destined for the same output. One of these contending packets for an output may be allowed to pass through the switch, but the others must be queued for later transmission on the output. This form of congestion is unavoidable in a packet switch and dealing with it often represents the greatest source of complexity in the switch architecture.
In this paper, we examine the performance of four different approaches for providing the queueing necessary to smooth the statistical fluctuations in packet arrivals to the switch. The switch fabric in all cases is assumed to be nonblocking and, as illustrated in Fig. 1 , operates synchronously with fixed-length packets arriving on the N inputs in a time-slotted fashion. The four different approaches to packet queueing in the switch are described in Section 11, and the performance of each is analyzed in Section 111. Fig. 2 illustrates four approaches to providing the queueing for a high-performance packet switch. In this section, we describe how each functions to smooth (in time) the packet arrivals destined to a common output.
PACKET QUEUEING ARCHITECTURES

A. Input Queueing
With input queueing, illustrated in Fig. 2 (a), a separate buffer is placed on each input to the switch. Each arriving packet enters, at least momentarily, the buffer on its input where it awaits access to the switch fabric. Initially, we assume the buffers are served first-in first-out (FIFO), so that at the beginning of each time slot only the packets at the heads of the FIFO's contend for access to the switch outputs. If every packet is addressed to a different output, the nonblocking switch fabric allows each to pass through to its respective output. If k packets at the heads of the input FIFO's are addressed to a particular output, one is allowed to pass through the switch fabric, while the other k -1 must wait until the next time slot, when a new selection is made among the packets that are then waiting. Note that while a packet is waiting its turn for access to an output, other packets may be queued behind it in the FIFO and, consequently, blocked from reaching possibly idle outputs on the switch. As we shall see in Section III-A, this results in a maximum throughput, for large N, of ( 2 -h ) = 0.586 for input queueing with FIFO buffers.
The throughput can be increased by relaxing the strict first-in first-out queueing discipline of the input buffers. Fig. 2 . Four approaches to providing the queueing for a high-performance packet switch.
Each input still sends, at most, one packet into the switch fabric per time slot, but not necessarily the first packet in its queue, and no more than one packet is allowed to pass through the switch fabric to each output in a time slot. For example, at the beginning of each time slot, suppose the first w packets in each input queue sequentially contend for access to the switch outputs. The packets at the heads of the input queues contend first for access to the switch outputs. Those inputs not selected to transmit the first packets in their input queues then contend with their second packets for access to any remaining idle outputs (i.e., outputs not yet assigned to receive packets in this time slot). The contention process is repeated up to w times at the beginning of each time slot, sequentially allowing the w packets in an input buffer's "window" to contend for any remaining idle outputs, until the input is selected to transmit a packet. A window size of w = 1 corresponds to input queueing with FIFO buffers. Fig. 2(b) illustrates an arrangement where the arriving packets are not so much queued at each input but smoothed; hence, the name input smoothing. Specifically, the packets within a frame of b time slots are stored at each of the N inputs (i.e., demultiplexed) and simultaneously launched into a switch fabric of size Nb X Nb. At most, Nb packets enter the fabric, of which b can be simultaneously received at each output where the packets are then multiplexed onto the output line. Any more than b packets destined for an output are dropped (i.e., lost) within the switch fabric. In Section 111-B, we show that the probability of dropping a packet can be made small by making the frame size b large. This is analogous to fixed-length source coding in information theory where code words are only assigned to a subset of likely source sequences. By making the source sequence sufficiently long, the probability of a source sequence generated for which there is no assigned code word can be made arbitrarily small [3].
B. Input Smoothing
Note that although the switch fabric has been enlarged from N X N to Nb x N b , the speed at which each input to the fabric operates can be reduced by a factor of b. The Starlite Digital Switch [4] uses demultiplexing' to reduce the required switch fabric speed relative to the incoming line speed. Its use as a means to smooth traffic arrivals does not seem to have been exploited in any proposed switch architecture.
'With Starlite, fixed-length packets arrive to the switch multiplexed bitby-bit (i.e., the first b bits of the frame correspond to the first bit of each b packets, the next b bits correspond to the second bit, and so on). This has the same smoothing effect as the "packet multiplexed" approach analyzed here. However, bit-by-bit multiplexing reduces the latency through the switch since only b bits, rather than an entire frame of b packets, has to be accumulated at the input before entering the switch fabric.
C. Output Queueing
With output queueing, shown in Fig. 2 (c), all queueing is done at the outputs of the switch with a separate b packet FIFO provided for each output. One can think of the switch fabric as operating N times as fast as the inputs and outputs, so that if k (k = 1, ---, N ) packets arrive in a time slot on different inputs all addressed to the same output, all k can be routed through the switch fabric and into the proper output FIFO within one time slot. Only one packet, however, can be transmitted on the output line in a time slot; the remaining k -1 packets must wait in the output FIFO for transmission during subsequent time slots.
Note that with output queueing, unlike input queueing, arriving packets addressed to one output do not interfere with (i.e., block or delay) packets going to different outputs. It is only at each output that one finds the unavoidable congestion caused by multiple packets simultaneously arriving on different inputs addressed to the same output. The waiting time performance for output queueing represents the best achievable by any approach.
It is possible to implement output queueing without the N times speed-up of the switch fabric. The Knockout Switch [ 5 ] , having a fully interconnected switch fabric topology, uses an N to L concentrator at each output to reduce the number of buffers needed to receive simultaneously arriving packets. Packet loss is inevitable in any packet network; with L = 8, the probability of losing a packet in the concentrator is under lop6 for an arbitrarily large switch size N . A novel buffering scheme, combining L separate FIFO's into the equivalent of a single FIFO with L inputs and one output, is then used to queue at the output. Hence, the Knockout Switch achieves output queueing without requiring a speed-up of the switch fabric.
D. Completely Shared Buffering
The buffer architecture shown in Fig. 2 (d) still provides for output queueing, but rather than have a separate buffer for each output, all memory is pooled into one completely shared buffer. 1 , 2 , * -* , N ( b + 1 ) ) packets are addressed to the same output, the switch fabric will route one to the output and the remaining k -1 will be routed to k -1 of the Nb inputs to the shared buffer. These k -1 packets will wait until the beginning of the next time slot before reentering the switch fabric along with the other stored packets and any new arrivals on the inputs. The packets continue to recirculate through the switch fabric and shared buffer, *We assume no input smoothing as described in Section 11-B.
with the output removing one packet from the group each time slot.
Effectively, a separate queue is formed for each output of the switch, but physically, all queued packets in the switch share the same buffer space. We shall see in the next section that this sharing allows one to reduce the total amount of buffering in the switch, but at the expense of an increase in the size of the switch fabric.
PERFORMANCE ANALYSIS
In this section, we analyze and compare the performance of the four queueing architectures described in the previous section. In each case, we determine the probability of packet loss and the expected packet waiting time in the switch. In all cases, we model the packet arrivals on the N inputs by independent and identical Bernoulli processes. That is, in any given time slot, the probability that a packet will arrive on a particular input is p ; each packet has equal probability 1 / N of being addressed to any given output, and successive packets are independent.
A. Input Queueing
In this section, we concentrate primarily on the performance of input queueing with FIFO buffers. Unlike the other three architectures, the first-in first-out queueing discipline of input queueing limits the maximum throughput of the switch. Specifically, packets within the input FIFO's are prevented from reaching idle outputs, blocked by packets at the heads of the FIFO's contending for common outputs. At the end of this section, we show that the throughput can be increased by relaxing the strict first-in first-out queueing discipline of the input buffers.
To determine the maximum throughput of the switch, we examine the case where all the input queues are saturated. That is, packets are always waiting in every input FIFO, and whenever a packet is transmitted through the switch, a new packet immediately replaces it at the head of the input queue. We assume that if there are k packets waiting at the heads of input queues addressed to the same output, the selection of one to pass through the switch is done at random, each having equal probability ( 1 / k ) of being selected.
Following the analysis in [ 6 ] , we define Bk as the number of packets at the heads of the input queues destined for output i in the mth time slot, but not selected to pass through the switch. We define A: as the number of packets moving to the heads of the input queues during the mth time slot and destined for output i. Note that a packet can only move to the head of an input queue if, in the previous time slot, a packet was removed from that queue for transmission on an output. It follows that Bk = max ( 0 , BLpl + A; -1 ) .
(1) Although Bk does not represent the occupancy of any physical queue, notice that (1) has the same form as the fundamental queueing relation for a single-server queueing system [7] . For small values of N , a Markov chain analysis of the system throughput can be done, yielding the results given in Table I [6], [8] . From Table I and the simulation results shown in Fig. 3 , note the rapid convergence to the asymptotic throughput of 0.586.
Before saturation, a discrete-time Geom/G/ 1 queueing model is used to determine an exact formula for the expected waiting time for the limited case N = 00 [6]. The amval process to each input queue in Bernoulli: a packet arrives independently in each time slot with probability p , equally likely destined for each output. The "service time" for a packet at the head of an input queue addressed to outputj consists of the wait until it is randomly selected among all packets at the heads of input queues contending for outputj, plus one time slot for its transmission through the switch. As N + 00, the steady-state number of packet "arrivals" to the heads of input queues, and addressed to outputj, becomes Poisson with rate pO. Hence, the service time distribution for the discrete-time Geom/G/ 1 model is itself the packet delay distribution of another queueing ( N and w , respectively) . The values were obtained by simulation. Note that a big increase in the achievable throughput is possible by increasing the window size w from w = 1 (i.e., FIFO buffers) to w = 2, 3, and 4, with diminishing improvements thereafter. However, input queueing with even an infinite window (w = 00 ) does not attain the optimal throughput-delay performance of output queueing and completely shared buffering. Input queueing limits each input to send at most one packet into the switch fabric per time slot, presents preventing packets from reaching idle outputs.
B. Input Smoothing
With input smoothing, the packets within a frame of b time slots are stored at each input and then enter the switch fabric together on separate input ports [ Fig. 2(b) ]. With each output connected to exactly b output ports on the switch fabric, if k > b packets enter the switch fabric destined for a given output, then k -b packets will be lost. Defining the random variable A as the number of packets entering the switch fabric destined for a given output, we have
( 8 )
Hence, the probability that a packet is lost within the fabric is given by The packet loss probability increases with increasing N , and so (10) represents an upper bound on the lost packet performance for all finite N . As illustrated in Fig. 5(a) and (b), the bound is tight for N > 16. Fig. 6 shows, for N = 00, the lost packet performance of input smoothing as a function of the frame size b for offered loads between 0.7 and 0.95. The y axis has been scaled to make it easier to compare the performance of input smoothing to the other packet queueing approaches. Note from Fig. 6 that the decrease in the packet loss probability with increasing frame size b is slow. For example, to achieve a lost packet probability of at an offered load of 85 percent (Le., p = 0.85) requires the frame size b > 100.
For those packets not dropped in the switch fabric, the mean Equation (1 1) follows from the timing diagram in Fig. 7 . The first term on the right-hand side of (1 1) is the expected amount of time a packet has to wait while the frame is being stored at the inputs. The second term is the delay resulting from the fabric running at 1 / b the speed of the inputs and outputs. The last term represents the expected waiting time in the multiplexing operation at the outputs. Using (8), (1 1) may be rewritten as Taking the limit as N -+ 00, we obtain
The mean waiting time for input smoothing is plotted in Fig. 8 against the offered load p for N = 00 and various values of the frame size b . The mean packet waiting time curves for finite N 2 2 are only slightly below those shown in Fig. 8 . Note from Fig. 8 and (12) and ( 13) , that the mean waiting time increases proportionally with b.
In summary, for input smoothing to achieve a low packet loss probability requires a large frame size b. Unfortunately, a large frame size increases the size of the Nb X Nb switch fabric and also the packet delay through the switch. Hence, although intellectually interesting, input smoothing does not seem to have much practical value, other than allowing the switch fabric to run b times slower than the input and output lines.
C. Output Queueing
With output queueing, all queueing is done at the outputs with a separate b packet FIFO at each output of the switch fabric [ Fig. 2(c) ]. In the analysis, we fix our attention on a particular (i.e., tagged) output queue. Defining the random variable A as the number of packet arrivals destined for the tagged output in a given time slot, we have which, for N = 00, becomes
Letting Q , denote the number of packets in the tagged queue at the end of the mth time slot, and A, denote the number of packet arrivals during the mth time slot, we have 
where ak is given by (14) and (15) for N < 00 and N = 00, respectively. The steady-state queue size can be obtained directly from the Markov chain balance equations to yield where A packet will not be transmitted on the tagged output line during the mth time slot if, and only if, Q,-= 0 and A, = 0. Therefore, letting po denote the normalized switch throughput, we have Po = 1 -qoao.
(21)
A packet will be lost if, when emerging from the switch fabric, it finds the output queue already containing b packets. Dividing the utilization of the output line po by the arrival rate p, we obtain the packet success probability. Therefore, optimal throughput-delay performance of output queueing, but save on the total amount of buffering needed to achieve a desired packet loss probability. Because of the statistical nature of packet arrivals, more efficient use is made of the N b buffer locations when they are shared by all outputs, rather than dedicating b to each of the N outputs.
Packets that enter the buffer will recirculate through the switch fabric and shared buffer until they are transmitted 
where A; is the number of packets addressed to output i that arrive during the mth time slot. With a finite buffer size, packet arrivals destined for some outputs may fill the shared buffer at the expense of other arrivals in the same time slot; the resulting buffer overflow invalidates (25). We will use (25), however, since it is a good approximation in the region of interest: the low packet loss probability region (e.g., less than packet loss probability). For finite N , A', the steady-state number of packet arrivals destined for output i, is unfortunately not independent of A'( j # i ). At most N packets arrive to the switch, so a large number of packets arriving for one output implies a small number for the remaining outputs. As N increases, however, the A' become independent Poisson random variables (each with mean value p ) , and the steady-state number of packets in the buffer that are destined for output i, Qi, becomes independent of Q'( j # i ). We will use the Poisson and independence assumptions even for finite N , and show that the approximations are good for N 2 16. Our approach, therefore, is to model Q', the steady-state number of packets in the buffer, as the N fold convolution of N M I D / 1 queues. With the assumption of an infinite buffer size, we then approximate the packet loss probability by Pr Q i 2 N b ] . Fig. 12 (a) and (b) show the packet loss probability for completely shared buffering as a function of 6, the buffer size per output, for various number of users N , and offered loads p = 0.8 and 0.9, respectively. The results converge to the asymptotic limit of p 2 / 2 ( I -p ) recirculation ports per output; In this section, we have computed packet loss probabilities for uniform traffic models. A potential problem with completely shared buffering is that one heavily [IO] A. E. Eckberg and T.-C. Hou, "Effect of output buffer sharing on buffer requirements in an ATDM packet switch," in Proc. INFOCOM,88, Mar. 1988, pp, 459-466, loaded output might monopolize use of the shared buffer, thereby adversely affecting the performance of other outputs.
IV. CONCLUSION  Figs. 13, 14, and 15 summarize the results of this paper. The throughout of input queueing with FIFO buffers is limited to 0.586, but can be increased by relaxing the strict first-in first-out queueing discipline. Input smoothing increases the throughput, but at the expense of a large increase in switch fabric size and latency. Completely shared buffering requires less buffer memory than output queueing, but requires a larger switch fabric size. Both output queueing and completely shared buffering, however, achieve the optimal throughput-delay performance.
