Abstract-For advanced packet switches, output queueing has received increased attention owing to its performance advantages. However, practical output queue size limitations may require additional queueing at the inputs. This paper considers a single-stage nonblocking N x N packet switch with both output and input queueing. The limited queueing at the output ports resolves output port contention partially. Overflow at the output queues is prevented by a backpressure mechanism and additional queueing at the input ports. This paper analyzes the impact of the backpressure effect on the switch performance for arbitrary output buffer sizes and for large N ( N -t 00). ' h o different switch models are considered: a asynchronous model with Poisson arrivals and a synchronous model with Bernoulli arrivals. The investigation is based on two performance measures: the average delay and the maximum throughput of the switch. Closed-form expressions for these measures are derived for both models in the case of operation with fixed size packets. The obtained results demonstrate that a modest amount of output queueing, in conjunction with appropriate switch speedup, provides significant delay and throughput improvements over pure input queueing. With justifiable output buffer sizes, the ideal performance of infinite output queueing can be closely approached. The maximum throughput is the same for the synchronous as well as the asynchronous switch model, although the delay is different.
I. INTRODUCTION
HE traffic offered to the input ports of a packet switch has T a statistical nature. Hence, it is possible that several packets arriving at different input ports are destined simultaneously for the same output port. The resulting contention problem is usually referred to as outputport contention. This phenomenon is inherent to any switch. It is obvious that, in addition to the routing function of the switch, some queueing function has to be applied to resolve the output port contention. Several basic approaches for packet queueing in packet switches are known [l] . The two classical alternatives are referred to as input or output queueing [2].
In the case of classical input queueing, a simple queue is provided at each input port of the switching fabric. There, packets have to wait until they can be transmitted to the output port. From the implementation point of view, input queueing has the advantage of relatively low complexity and, consequently, low cost. However, there is a performance problem associated with it which is referred to as head of the line blocking. This means that a packet which waits at the head of an input queue for the resolution of output port contention may block other packets waiting in the same queue which may be destined for momentarily idle output ports. Actually, the head of the line blocking phenomenon limits the maximum throughput of the switch to about 58.6 percent for symmetrical loads [2]. In the case of asymmetrical loads, this phenomenon becomes even more pronounced and may influence low-load traffic streams as severely as high-load traffic streams.
On the other hand, the output queueing approach assumes that individual queues are located at each output port of the switching fabric. This concept only makes sense if all packets arriving simultaneously at the input of a specific output queue can be accepted by that queue immediately. This assumes that each output queue logically represents a multipleinput port queue. This can be realized by operating a simple queue at appropriately higher speed or by a highly paralleled implementation. Both methods have the disadvantage of being complex and expensive. However, output queueing provides optimal performance.
The idea behind the model presented in this paper is the combination of both classical approaches of input and output queueing (see Fig. 1 ). This concept has been proposed, for example, as one of the alternatives of the high-performance switch fabric published in [3]. Certainly, the primary aim is to achieve the best possible performance by output queueing. However, economic considerations might require restrictions on output queueing in favor of additional input queueing. The input queues do not necessarily have to be viewed as part of the switching fabric. They could be located within the transmitting switch adapter and might be based on relatively cheap memory technology. However, the output queues have to be viewed as part of the fabric, since they require the multiport property. If this buffered fabric is assumed to be implemented in VLSI technology, the buffer space within the fabric, i.e., the output buffer space, is the cost-sensitive issue. However, it turns out that output buffers of relatively small capacity, say, b packets, are sufficient for closely approaching the ideal delay performance of infinite output queues. Thus, the required speedup or parallelism for the output queues can be limited to a factor b, which can be much smaller than the factor N (the switch size) required in the case of the ideal output queueing. Some switch fabrics are capable of transferring only a limited number of packets to any output queue in a packet transmission time. Therefore, the speedup factor may be smaller than the capacity of the output queues (see, e.g., [4] ). This paper considers the case where the speedup factor is 0090-6778/93$03.00 0 1993 IEEE 1 -T-+ greater than or equal to the output queue size. Nevertheless, an acceptable loss performance would probably not be obtained without additional input queues. In this way, packet losses due to output buffer overflow are avoided by a backpressure mechanism which causes packets to wait in additional input queues if the output buffers are full. Packet losses due to overflow at the input queues can be reduced to negligibly low probabilities by generously dimensioning the less critical input buffer space within the switch adapters.
The performance behavior of both pure input queueing as well as pure output queueing is well understood, especially for a synchronous or slotted mode of operation [2]. In the present paper, an analytical performance study of the behavior of the combined input and output queueing model is presented. In Section 11, closed-form expressions for the average delay through the switch are derived for both an asynchronous model as well as a synchronous model, assuming that the number of input/output ports N is large ( N -+ a ) . In Section 111, the maximum throughput of the switch is derived for both models, and it is shown that it is the same in both cases for a given output buffer size. The extent to which contention at the head of the line occurs is evaluated in Section IV. Finally, Section V presents some numerical results.
SYSTEM ANALYSIS
We consider a single-stage N x N packet switch as shown in Fig. 1 . The number of input/output ports N is assumed to be large ( N t 00). Previous studies have shown that the performance of space-division packet switches depends upon their size. However, the difference in maximum throughput and average delay becomes negligible when the size grows beyond 16 x 16 [2], [3]. It turns out that the difference is also negligible for other quantities of interest, such as the queue length distribution at the input queues. This has been verified, both by means of simulation and by experimental results obtained from the prototype of a switch developed in the Zurich Research Laboratory [3] . Therefore, the results presented here can be considered a very good approximation of the performance of switches with 16 or more inputs/outputs.
Packets are assumed to have a fixed length and a fixed transmission time h. Henceforth, we select this transmission time h as a unit of time. Furthermore, it is assumed that the traffic is symmetric and randomly distributed. That means that the destination of an arbitrary packet can be any of the N output ports with equal probability 1 / N .
The output queues are provided to resolve output port contention up to a certain degree. It is assumed that each output queue is operated on a first-come first-served (FCFS) basis and has a finite holding capability of at most b packets. At the input side, when a packet reaches the head of an input queue, the transfer process to its destination port is initiated. If the output port is idle, then the packet flows through the switch, in a cut-through fashion, without experiencing any delay. If the output port is busy and the output buffer associated with this port is not full, then the packet is transferred and stored in it. However, if the buffer is full, then a backpressure signal is applied which causes the packet to wait at the head of its input queue. At this moment, there may be other packets waiting at the heads of input queues and contending to be dispatched to the same particular output buffer. We refer to this as head of the line contention. The arbitration and resolution of the head of the line contention is discussed in more detail in the following sections.
The objective of this paper is to analyze the impact of the backpressure effect on the performance of the switch. The degree to which this effect depends upon the size b of the output buffers is demonstrated. Our development considers two different switch models. First, we consider an asynchronous model where the arrival as well as the transmission of packets to the output ports occur in as asynchronous fashion. The arrival process is assumed to be Poisson. Then d e consider a synchronous model where packets arrive and are transmitted synchronously, i.e., in a time-slotted fashion. The arrival process is assumed to be Bernoulli.
Our analytical approach follows the one used in [2] for the analysis of a pure input queueing switch. The efficiency of the switching fabric is assessed based on two measures: the average delay and the maximum throughput of the switch. The packet delay consists of three components: 1) waiting time within the queues until the head of the queue is reached, 2) waiting time at the head of the input queues due to head of the line contention, and 3) waiting time at the output queues due to output port contention.
The second and third delay components are explicitly calculated based on a specific study of an equivalent single-server queueing system. Then, the first delay component is obtained from another single-server queueing system, whereby the service characteristics are provided by the previous study. The analysis is carried out for both the asynchronous and the synchronous model.
A. An Asynchronous System
In this section we derive the exact mean delay of a packet through the system. It is assumed that packets arrive in the N input trunks according to independent and identical Poisson processes at a rate of X packets per unit of time. Under the uniform destination assumption, the destination of an arbitrary packet can be any of the N output ports with equal probability 1 / N . Each output port has a buffer capable of holding up to b packets. When a packet reaches the head of its input buffer, it requests to be transferred to its destination output. If at that instant the corresponding output buffer is not full, the packet can leave the input queue. However, if at that instant the buffer is full, then a backpressure signal is applied and the packet has to wait. At this given instant, there can be many packets waiting to be dispatched to this particular buffer. In this paper we assume that these packets are arbitrated according to the sequence in which the backpressure signals were applied to them. In other words, they are transferred to the output buffer in a FCFS order.' It is shown that both the waiting time due to head of the line contention and the waiting time at the output queue can be obtained by a specific study of an equivalent MIDI1 queueing system. Let us turn our attention to a particular output queue (the tugged queue) and introduce some definitions.
~( t ) :
The number of packets in the output queue at time t.
From this definition and from the system operation description presented in the preceding section, if follows that ~( t ) 5 b.
c(t):
The number of packets at the heads of input queues at time t waiting to be transferred to the tagged queue.
Proposition I : For A < 1, the first two moments of the waiting time W b are given by and where and intervals become statistically independent, which implies that these instants form a renewal process with intensity A/N. Also, as N goes to infinity, the N processes of this kind, which correspond to the N input queues, become independent.
Their superposition forms the process of the instants at which packets destined to the tagged queue appear at the head of their input queues. By virtue of Palm-Khintchine's theorem [6, p. 1561, this process becomes Poisson with parameter A as N goes to infinity. A more rigorous proof of this claim goes beyond the scope of this paper.
At a typical instant, t, when a packet appears at the head of an input queue, it will continue its transmission to the output port provided that this port is idle. If the port is busy, then it will either be transferred to the output buffer if x ( t ) < b, or it will have to wait. Let wb be the waiting time due to backpressure, i.e., the delay a packet experiences from the instant it appears at the head of its input queue, until the instant it starts its transmission to the corresponding output port of output queue. Let us also consider the delay, D2, from the instant that the packet appears at the head of its input queue until the instant it begins its transmission at the output port. From the above discussion, it follows that the measures of interest can be evaluated by studying an equivalent MIDI1 queueing system. The average delay 0 2 is given by
This is a well-known result obtained for an MID f 1 queue with arrival rate A, service time one unit, and , consequently, utilization factor A. 
. .
where the second term in p j is ignored for i = j . The corresponding p.g.f. is
Proof: See Appendix A. Now let us examine the input queues. Owing to the traffic symmetry, all of the queues have identical behavior. As the number of queues is infinite, it turns out that they also behave independently, therefore it suffices to consider only one input queue [2] . The arrival process is Poisson with parameter A. The service time of a packet at a typical input queue consists of two components, the waiting time wb until it is selected for transfer plus one unit for its transmission to the output port or the output queue. Consequently, the waiting time of a packet at the input buffer until the head of the buffer is reached, W;, is that of an M I G I 1 queue where the service time T is equal to W b + 1.
Proposition 2:
The first moment of the waiting time W; is given by
AT2(A)
for AT(A) < 1 (2.9)
(2.10) (2.11) -Proof: Equations (2.10) and (2.11) are direct consequences of the fact that T = W b + 1. Equation (2.9) is the expression of the mean waiting time of an M/G/1 queue as a function of the arrival rate and the first two moments of the service distribution T.
Q.E.D. From the above discussion, it follows that for a given load A the total delay D(A) of a packet through the system is given by
where the first term of the summation accounts for the waiting time at the input queue until the head of the queue is reached, and the second term accounts for both the waiting time due to the head of the line contention, and the waiting time at the output queues due to the output contention. A closed-form expression for the average total delay is obtained with the following proposition.
Proposition 3: The average total delay D(A) is given by
where Qb,Qi and are given by (2.4), (2.5), and (2.6), respectively.
Proof: The above expression is obtained by substitution of the quantities involved in (2.12) using (2.9), (2.1), (2.6), (2.10), (2.11), (2.2), and (2.3).
Q.E.D.
Corollary I : For a given load A, the average total delay D(A, b) is a decreasing function in 6.
Proof: The second term of the sum of the right-hand side of (2.13) does not depend on b. One can easily show that Q,",Qb, as well as their difference s," -Qb, are decreasing functions in b. Consequently, the numerator of the first term of the sum of the right-hand side of (2.13) is a decreasing function in b and the denominator is an increasing function in b. Therefore, the right-hand side of (2.13) is a decreasing function in b.
Q.E.D. The extent to which the head of the line contention occurs can be assessed from the backpressure probability Pa, Le., the probability that when a packet reaches the head of its input queue will experience delay due to backpressure. From the above analysis, it follows that the probability Pb is equal to the probability that a packet will find the output link busy, for b = 0, or that it will find the output buffer, full for b 2 1.
Thus, --
(2.14) i=O Corollary 2: For a given load A, the proportion of packets Proof: This is a direct consequence of (2.14) and of the Q.E.D.
that experience backpressure decreases in b.
fact that pi > 0, V i 2 0,O < A < 1.
B. A Synchronous System
Here, we consider the switch under synchronous operation. Time is divided into units called time slots with duration equal to the transmission time of a packet. It is assumed that in any time slot the probability that a packet will arrive in a given input is p. Therefore, the average traffic of any input is p packets per unit of time. Successive packets of an input port and packets of different ports are assumed to be independent. Under the uniform destination assumption, the destination of an arbitrary packet can be any of the N output ports with equal probability 1/N. Each output port has a buffer capable of holding up to b packets. Packets are transferred from the input queues to the output queues or ports only at predetermined time instants, e.g., at the beginning of time slots. In general, at a given time slot that may be more than b packets trying to access a particular output port. In this case, some of these packets will remain at the head of their input queues causing head of the line blocking. If the output port is idle, one of these packets will flow through the switch without suffering any delay, and another b packets will be transferred to the output buffer associated with the particular port. These packets will be transmitted in a FIFO order in the subsequent time slots. However, the selection of these packets, as well as the order in which they are stored in the output buffer, is assumed to be random. The packets that remain at the head of the input queues will be transferred to the output buffer in the subsequent slots according to a random selection policy. Packets that appear at the head of their queues at a given time slot and are destined for a particular output are transferred before all the packets appearing at the head of the queues at subsequent time slots and having the same output port destination. It is shown that both the waiting time due to head of the line contention and the waiting time at the output queue can be obtained by a specific study of an equivalent discrete-time queueing system with bulk arrivals.
Let us turn our attention to the ith output queue. We introduce the following definitions.
A,(m): the number of packets destined for the ith output port which appeared at the head of their queues during the mth time slot. Let qj,; be the steady-state probability of having a total number of j packets waiting for transmission to the ith output port, and let ai,; be the steady-state probability of having j new packets appearing at the head of the queues during an arbitrary slot and destined for output port i. The corresponding p.g.f.'s are defined as follows: is the expected number of packets appearing at the head of queues at an arbitrary slot and destined to output port i. Due to the symmetry of the traffic, it holds that Subscript i can also be dropped, so that It can be proven that for N approaching infinity, the distribution of aj,i becomes Poisson
Ai(.) = A ( z ) = e-p(l-z). (2.22)
It is worth pointing out that this claim has been proven in [2] for the special case where b = 0. Interestingly, (Al) and (A2) of that proof hold also in the case where b > 0, and so does the proof.
At a typical time slot m, when a packet appears at the head of an input queue, it will continue its transmission to the output port provided that this port is idle. If the port is busy, then it will either be transferred to the output buffer, or it will have to wait in the input queue. Let W b , , be the waiting time due to backpressure, i.e., the number of slots that a packet has to wait from the instant it appears at the head of its input queue, until the instant it starts its transmission to the corresponding output port or output queue. Let us also define, by W,, the number of slots that the packet has to wait from the time it appears at the head of its input queue until the instant it begins its transmission at the output port. This delay is due to both the head of the line contention and the output contention. Expressions for the measures of interest in terms of the system parameters are given by the following propositions. where Qb and @ are given by (2.4) and (2.5), respectively.
Proof: See Appendix B. Now let us examine the input queues. As in the asynchronous system, it turns out that it suffices to consider only one input queue since it behaves independently from the others. The arrival process is Bernoulli with the probability p of a packet arriving at any arbitrary time slot. The service time of a packet consists of two components, the waiting time Wa,, until it is selected for transfer plus one unit for its transmission. Consequently, the waiting time of a packet at the input buffer until the head of the buffer is reached, W+, is that of a discrete-time Geom/G/l system with service time T, equal to Wb,s + 1. From the above discussion, it follows that for a given probability p , the total delay, D,(p), of a packet through the system is given by
where the first term of the summation accounts for the waiting time at the input queue until the head of the queue is reached, and the second term accounts for both the waiting time due to the head of the line contention and the waiting time at the output queues due to the output contention. A closedform expression for the average total delay is obtained in the following proposition. 
respectively.

Proof:
The above expression is obtained by substitution of the quantities involved in (2.30) using (2.23), (2.6), (2.27), (2.28), (2.29), (2.25), and (2.26).
Q.E.D. Corollary 3: For a given probability p , the average total delay D,(p, b) is a decreasing function in b.
Proof: The second term of the sum of theright-hand side of (2.31) does not depend on b. It holds that Q: and &a are decreasing function in b. Consequently, the nominator of the first term of the sum of the right-hand side of (2.31) is a decreasing function in b, and the denominator is an increasing function in b. Therefore, the right-hand side of (2.31) is a decreasing function in b.
Q.E.D.
that a packet which appears at the head of its input queue will experience delay due to backpressure is From the above analysis it follows that the probability
The above derivation is based on the results of Appendix B.
Corollary 4: For a given probability p, the proportion of packets that experience backpressure decreases in b.
Proof: This is a direct consequences of (2.32). Q.E.D.
SATURATION ANALYSIS
In this section we examine the impact of the different output buffer sizes on the maximum switch throughput for both the asynchronous and synchronous model. It is shown that for a given buffer size, the maximum throughput is the same for both models. It is also found that the maximum throughput increases as the buffer size increases. To prove that A i < At++,, it suffices to show that
One can easily show that &a is a decreasing function in b.
Thus,
The saturation loads, A i , for various buffer sizes are listed in Table I consequently, the maximum throughput is the root of the equation f ( A ) = 1 -A, i.e., A& = 1.
Note that in the synchronous case, (2.31) shows that the system is stable as long as 1 -p -&a@) 2 0. Therefore, the saturation probabilities are obtained as the roots of the equation f(p) = 0. As a result, the obtained values are the same as those found in the case of synchronous operation. The values listed in Table I represent the maximum switch throughput when the input buffers are assumed to be saturated. The system is stable as long as the average number of packets arriving in the input buffers is less than the saturation load. The process according to which packets arrive at the input queues has no effect on the stability of the system, although it does affect the delay characteristics. Observe that for a system with no output waiting space, the saturation load is equal to 0.5857, which is the same as the one obtained in [2] for synchronous system.
Iv. HEAD OF THE LINE CONTENTION
In this section we examine the dependence of the head of the line contention upon the load and the output buffer size of the switch. 
Fig. 4. sizes
The degree of the head of the line contention can be assessed based on the backpressure probability. This probability is defined as the probability that when a packet reaches the head of its input queue, it will experience delay due to backpressure. A plot of the backpressure probability versus the load for different output buffer sizes is given in Figs. 3 and 4 for the asynchronous and synchronous models, respectively. Both figures demonstrate that the head of the line contention is reduced by increasing the output buffer size. This observation has been proven in Corollaries 2 and 4. It is also interesting to note that at saturation, i.e., when the load is equal to or greater than the maximum throughput, the backpressure probability has a fixed value less than one. For example, in the case of the asynchronous model, at saturation, for b = 10 the backpressure probability is 0.038. The fact that the switch saturates although the backpressure probability is low can be interpreted as follows. When a packet reaches the head of its input queue, it has a small probability of being backpressured. However, if this happens, then it will block its input queue for a relatively long period of time. As a result, the input queue will eventually saturate. In other words, in order to determine the maximum throughput, one needs to know both the backpressure probability and the average amount of time that a blocked packet remains at the head of its input queue.
V. NUMERICAL RESULTS
We consider a single-stage asynchronous switch, as presented in Section 11-4 with a large number of input and output ports. In Fig. 5 , we plot the total system delay D as a function of the load X for various output buffer sizes using (2.13). The curves obtained are in agreement with the simulation results presented in [3] , and illustrate that significant delay and throughput improvements can be achieved by increasing the buffer size of the output queues. When b goes to infinity, a packet appearing at the head of its input queue can immediately be transferred to its destination output port or output queue. Consequently, at the limit when b = 00, the total system delay is two times the delay of and M/D/1 queue 
Fig. 5. Total delay versus load for various output buffer sizes (asynchronous
Next we consider a single-stage synchronous switch, as presented in Section 11-B, with a large number of input and output ports. In Fig. 6 , we plot the total system delay D, as a function of the probability p, for various buffer sizes using (2.31). The curves obtained illustrate that significant delay improvements can also be achieved in the synchronous case by increasing the buffer size of the output queues. As b goes to infinity, when a packet appears at an input port, in the following slot it is transferred to the corresponding output port or buffer without incurring any input of head of the line delay. Consequently, at the limit b = 00, the total system delay is equal to It can be proven that for fixed buffer size b, and for the same average load (A = p), it holds that
DdP, b) < D(P, b)
This difference can be interpreted from the fact that, in the asynchronous model, the generated Poisson type of traffic is more "bursty" than the Bernoulli traffic considered in the synchronous model. In the synchronous case there can be at most one packet arrival per port per slot, whereas in the asynchronous case there can be more than one. 
VI. CONCLUSIONS
A single-stage packet-switching fabric with queueing capabilities both at the input ports as well as at the output ports has been considered. The output buffers were assumed to be finite and were prevented from overflowing by waiting at the input buffers. Two models for operation were studied: an asynchronous model with Poisson arrivals and a synchronous model with Bernoulli arrivals. For both models, closed-form solutions have been derived for the average delay through the switching fabric and for its maximum throughput assuming as infinite number of inputJoutput ports. The dependency on the output buffer size has been shown by numerical results which also demonstrate that nearly ideal performance can be achieved with modest output buffer sizes and appropriate speedup. Furthermore, both models saturate at the same maximum switch fabric throughput, although the average delays are different.
In this paper we studied the performance of the switch operating under uniform traffic and with fixed length packets. The impact of asymmetric loads, bursty traffic, and packets of variable length on the switch performance is a subject of From the discussion presented in Section 11-A, it follows that the delay DZ is that of an M / D / 1 system with arrival rate A, as shown in Fig. 7 .
The total number of waiting customers in this M / D / 1 queueing system is equal to the sum ~( t ) + c(t). Since packets are transferred from the input queues to the output buffer in an FIFO order, and are subsequently transmitted in the same order, we deduce that the system operates according to a firstcome first-served discipline. It also follows that the waiting time Wb is the time interval from the instant a packet arrives until the instant it crosses point C (the instant at which its transfer process from the input buffer to the corresponding output port of output buffer begins).
Proof of Proposition 1: Let Q b and Q be the number of packets in the buffer positions left of the points C and A, respectively, as indicated in Fig. 7 . Let Wb and W be the time from the instant a packet arrives until it crosses points C and A, respectively. From Little's theorem, we have the following relations for the first moments of the waiting times:
The second moments are given by [7, also be of interest to consider switches with different memory management policies, such as sharing the output buffers fully x -2(1-A) or partially, and to explore their potential benefits.
APPENDIX A WAITING TIME ANALYSIS-ASYNCHRONOUS SYSTEM
Let us introduce some definitions.
w b : This is the delay of an arbitrary packet from the time it appears at the head of its input port until the time it starts its transmission to the corresponding output port or output queue. This delay is due to the head of the line contention.
Dz: This is the delay of an arbitrary packet from the time it appears at the head of its input port until it starts its transmission at the output port. This delay is due to both head of the line contention and output contention. Let us introduce some definitions.
Wb,s: This is the number of slots that an arbitrary packet has to wait from the time it appears at the head of its input port until it starts its transmission to the corresponding output port or output queue. This delay is due to the head of the line contention.
W,: This is the number of slots that an arbitrary packet has to wait from the time it appears at the head of its input port until it starts its transmission at the output port. This delay is due to both head of the line contention and output contention.
Let us also define Qb) = p(0) +
