Abstract-It has been shown that maximal weight matching algorithms for input-queued switches are stable under any admissible traffic conditions with a speedup of 2. As link speeds increase, the computational complexity of these algorithms limits their applicability in high port-density switches and routers. In Frame-Based Maximal Weight Matching (FMWM), a new scheduling decision is generated once every several packet times, as opposed t o on a packet-by-packet basis. Between new scheduling decisions, the configuration of the crossbar switch remains unchanged. We show that the FMWM algorithm is also stable with a speedup of 2 and obtain the speedup required for stabiIity in systems with non-negligible reconfiguration delays. Simulation results illustrate the impact of the frame size on the delay experienced by arriving packets under different traffic scenarios.
I. INTRODUCTION
Input-queued packet switching architectures are commonly utilized in Internet routers as they offer pragmatic scalability while requiring moderate memory bandwidth. In such architectures, arriving packets are buffered a t the ingress ports before traversing a crossbar switch en route to their destination (egress) ports. Typical switching fabrics partition variable size packets into fixed sized cells, each corresponding to a single internal time slot., which are later reassembled into packets prior to departing the router.
A conimon technique for overcoming potential blocking and congestion at the input ports is called virtual output queueing (VOQ) . In VOQ, a separate queue is maintained at the ingress port for each of the N output destinations.
Since each queue contains packets designated for a distinct output, the head-of -line blocking phenomena is avoided.
A scheduler, whether implemented in a centraIized or d i s tribukd mnner, is responsible for receiving transmission requests from the virtual output queues and determining a matching (or scheduling) configuration between the inputs and the outputs, whereby a t most one input is matched to one output a t any given time, and visa versa.
A switch with a speedup of 1 is said to allow a t most one packet from each input to traverse the crossbar during one time slot. If a switch has a speedup of s: where s E { I , ..., N } , it is said to issue s scheduling decisions, and correspondingly s transmissions of packets from input buffers to output ports, during a single time slot. When s > 1 buffers are required at the output ports as well. Such architectures are commonly referred to as combined input-and-outpubqueued (CIOQ) [l] . Naturally, the need for speedup greater than 1 introduces stringent hardware constrains. as it necessitates high memory bandwidth i-esources. Many scheduling algorithms have been proposed in recent years, with a common goal of which to offer scalability (with respect to port densities and link rates) as well as high-performance. In the context of performance, a fundamental requirement from any scheduling algorithm is stability. Stated coarsely7 a switch is said to be stable if all its queues are bounded and: hence, never backlog in- This paper aims to address these important questions.
The rest of the paper is structured as folIows. Section I1 is dedicated to the stability proof of the FMWM scheduling algorithm. In Section 111 a bursty traffic model, which can generate 100% traffic loads, is presented. Section IV discusses siniulation results, while in Section V the conclusions are drawn.
STABILITY O F THE FMWM ALGORITHM
Consider a CIOQ switch with N ports, as illustrated in figure 1. Let Q z j ( t ) denote the VOQ size a t input i holding packets destined to output j a t time t. We define the corresponding random arrival process, Aij ( t ) { O , l ) , with a mean rate of packet arrivals from input i to output j , E[Aij(t)] = X i j 2 1. We consider the simple FMWhl which consists on an iterative process whereby in each iteration the maximal weight is Found and a match is registered between its associated input-ouput pair. Each time a niatch is generated, the respective input and output are removed from contending during the following iterations. Assuming the weight matrix is not completely null, the number of iterations ranges from 1 to N . The configuration of the crossbar, which is the outcome of the FMWh4 algorithm! can be represented by the permutation matrix S ( t ) = ( S i j ( t ) } , where Sij(t) = 1 if input i is matched to output j at time t , otherwise S i j ( t ) = 0.
. ' Without loss of generality, let us assume that a t time t we have explicit knowledge of Qij (t). We thus assign a weight value, equal to the queue length, to each VOQ based on which the scheduler establishes the maximal weight matching for k consecutive time slots. As a result, at time t+k, we have a new matching/schedulhg matrix Sij(E + k ) . As depicted in figure 2, the matching matrix remains unchanged during the following k time slots. It should be noted that although we restrict our attention to a weighting scheme which reflects only on the queue occupancies, a broader definition of queue weights may be applied.
There have been many crossbar technologies introduced in the literature with applications to input-queued packet switching fabrics. These range from electronics-based to optical technologies, such as wavelength division multiplexing (WDM) passive stars [3] . In most practical systems, there is a non-negligable reconfiguration delay that is intre duced when the crosspoint switch is configured. Such delay can range from several nanoseconds to a few tens of microseconds. In addition: high-speed serializer/deserializer (SerDes) devices typically require somewhere in the order of 100 bit-times to lock on to a new signal. This locking Proof: We will derive the sufficient speedup value, 7, as follows. Since at most k packets may arrive during I; time slots, when applying the FhlWM algorithm the following inequality holds
froni which we can write 
L ( t + t -1 ) -L ( t ) =

C ( Q i j ( t + k -l ) -~~j (~)~( Q *~(~~~-l )
+ & i j ( t ) ) . i j
By part.itioning the above into the case of Q l j ( t ) 4 v k and
Q,j(t) 2 qk, we deduct the following
where A = IJ&j 11 denotes the admissible arrival rate matrix, which is doubly stochastic. We further observe that for all
which stems from the fact that FMWM guarantees that S i j ( t ) # 0 always points to the largest value an raw i and column j,respectively. Since FMWM removes raw i and column j after each iteration, (8) holds for all iterations.
Hence, since &t is referred to identically on both sides of the inequality in (7), we conclude that (A, Q t ) < 2 (S, Qt) .
However, the latter does not take into account the additional speedup needed to overcome the reconfiguration delays. If KR denotes the portion of the frame that is "wasted" on reconfiguration, it implies that by speeding up the transniission of actual payload bits by 1 +yB: the deadtimes can be compensated for. Since the algoritlm speedup requirement of 2 is independent of the additional speedup needed to compensate for reconfiguration dead-times, we conclude that for 7 2 2 I t -8 we have (A,Qt) < 7 ( S , Qt) = qR~F"l""(t). This suggests that there exists a value 6 < I for which (A, Qt) < 6qTVFhfw'"f ( t ) . ApplWg the latter to (7) yields
which concludes the stability proof.
MULTI-QUEUE O N / O F F ARRIVAL PROCESS WITH
In order to better evaluate the behavior of the FMWM To overcome the maximal traffic load constraint, we introduce the ON/OFF/n Markov-modulated arrival process, whereby in a transition froni the ON state the process visits an C2 state while generating an additional packet to the same destination, It is only from the R state that the process can transition back to the OFF state. Consider a general case where an arriving packet can be destined to each of the N posible destination. As shown in figure 3 , an arrival destined to the output i is generated for each time slot that the Markov chain spends in the ONi state, i=1,2,. . . ,N, while no arrivals are noted when in the OFF state. Instead of directly transitioning from ONi states to OFF, a transition from ONi to Q is pre- It can be shown that all transition probability can be obtained by solving a set of linear equations which are determined by the mean burst and arrival rates for each destination and Visa versa. The mean arrival rate per output can be expressed as A, = TON; +nn<, while the mean burst size for output i is given by MBSi = 1 + I-a,. For the case of uniform distribution and identical mean burst sizes (pi E /3, A, = X/N), we have 7 . . = s, vi # j. The key attribute of this model is that gven any set of mean burst sizes and any traffic load distribution, the Markov chain in figure 3 can be constructed so as to yield the desired traffic generation engine. More importantly, the latter can achieve 100% traffic load.
%!
Iv. SIMUL~4TlON RESULTS In order to evaluate the performance of the FMWM algorithm under different traffic conditions and interval durations, three simulation sets were carried out. In all simulations a Bport switch was considered with a speedup of 2 (f +%). Tn the first simulation set, the arrival process was Bernoulli i i d . with uniformly distributed destination distribution. Figure 4 depicts the mean delay when employing FMWM with different switching frame sizes ( k ) .
As can be intuitively appreciated, the longer the frame the larger the mean delay, which stenis from the fact that during many sw-itching intervals less than IC consecutive packets are being trammitted. Moreover, it is noted that larger frame sizes exhibit faster delay growth (steeper slope). This is because once a matching matrix is generated, the unmatched VOQs will not transmit any cells during the follow k time slots, yet still buffer newly arriving cells (which contribute to the average waiting time).
The second set of simulations examined the impact of bursty traffic on the performance of the FMWM algorithm. Uniformly distributed bursty traffic with identical mean burst size (MBS) was employed. Figure 5 depicts the r e sulting mean delay for different frame sizes (I; = 1,2: 4: 8) but same mean burst size ( M B S = 8). An interesting observation here is that the difference in delay between the first three frame sizes ( k = 1 , 2 and 4) is small relative to the delay increase shown for I; = 3. This can be explained by the fact that bursts which are smaller or equal to the frame size result in fully utilized transmission intervals, while packets in bursts that are larger than the frame size experience higher average delay. The last set of simulations was targeted a t examining the impact of the mean burst size on the delay performance. Once again, a 6 port switch was considered whereby bursts are uniformly distributed across the outputs. Figure 6 shows the mean delay as a function of the mean burst sizes €or a fixed frame size of 8 cells. A clear difference in relative performance is observed between the lower load and higher load regions. It appears that the larger the mean burst size the slower the relative increase in delay due to higher loads. This could be explained by the fact that the large mean burst sizes already introduce considerable delays at lower loads due to "overflowing" the frame boundaries. As such, the increase in the average delay is not as drastic as that of traffic which carries smaller mean burst sizes.
V. CONCLUSIONS
This paper studies the framebased maximal weight matching algorithm as a scalable scheduling scheme for large port-density input-queued switches. Through the use of Lyapunov functions, a sufficient speedup needed to guarantee stability has been obtained. Since the service discipline governing FM WM is inherently correlated, it has been shown that packets belonging to bursty traffic often experience lower average delay than that experienced by packet arrivals which are Bernoulli i.i.d. This is an important observation in view of the fact that real-life data traffic tends to be correlated on different levels, Moreover, 0-7803-8924-7K)5/520.00 (0)2005 IEEE.
21s
A 6 ! ---- I l l Mean cell delay as a function of t h e offered load for uniformly distributed arrivals with mean burst size of S cells and difierent frame sizes (k). the frame switching analysis framework presented here can be broadened to address a range of input-queued scheduling algorithms.
