Abstract-Scheduler and switching fabric are two major hardware components of a cell switch. For a switch using a nonblocking switching fahric, the performance of the switch depends on the performance of its cell scheduler. We introduce the concepts of relative and universal scheduler scalabilities. Informally, a scheduler is relatively scalable with respect to a switching fabric if its structure is not more complex than the StNCtUre of its associated non-blocking switching fabric. A scheduler is universally scalable if its structural complexity is not larger than the structural complexity of any non-blocking switching fabric. Based on algorithm-hardware codesign, we present a universally scalable scheduler with O(N log N) Interconnection complexity. We show by simulation that the performance of the proposed scheduler is almost the same as non-scalable schedulen.
I. INTRODUCTION
Most high-speed packet switches employ cell switches as their cores. In such a switch. variable-length packets are segmented into fixed-size cells as they arrive, transferred across the switching fabric. and reassembled back into original packets before they depart. Different queuing schemes have been proposed for such a switch. Due to their capabilities of achieving 100% throughput and providing quality of service (QoS) guarantee, output queueing (OQ) switches are commonly employed for many commercial switches and routers. However, OQ is impractical for switches with line rates andor large numbers of ports since it requires that the switching fabric and memory run as fast as N times the line rate for an N x N switch. With the switching fabric and memory running at the line rate, input queuing (IQ) switches are scalable for high line rate andor large number of ports. However, due to the problem of head-of-line (HOL) blocking, the maximum throughput of an IQ switch with first-in-firstout (FIFO) queues is limited to 58.6% under uniform traffic [ll. Virtual output queues (VOQs) have been proposed for IQ switches to remove HOL blocking and to achieve the scalability of IQ switches. Fig. 1 shows an N x N IQ switch with VOQs, in which each input port maintains N virtual output queues (VOQs) with Q,J buffering cells from input port I, destined for output port 0,.
We assume that time is slotted and a cell slot equals to the transmission time of a cell on an inputloutput line. For an IQ switch with VOQs, its performance highly depends on the scheduling algorithm, which decides which N cells out of N 2 HOL cells to be sent across the switching fabric in each cell slot. A high-perfomance switch tends to have a large number of inputloutput ports. It is well-known that wiring takes the most chip area in a large digital circuitlsystem. Thus, the implementability, . cost and performance of such a large size switch depend on the interconnection complexities of its switching fabric and scheduler. W,: define the interconnection complexity (also called wiring density) of a circuit as the , and Csch to denote the interconnection complexity of the non-blocking switching fabric and the scheduler of a switch, respectively. The interconnection complexity of such a scheduler is even larger than that of some existing nonblocking and rearrangeable non-blocking switching fabrics. I, respectively. For such a switching fabric, the scheduler as shown in Fig. 2 is non-scalable, and its use cannot be well justified.
Clearly, R ( N log N ) is the lower bound for the interconnection complexity of any N x N switching fabric. Thus, any scheduler A with interconnection complexity C,,h(N, A ) = O ( N log N ) is universally scalable. This motivates our study on universal scalable schedulers. In this paper, we propose a universal scalable scheduler architecture based on the proposed round-robin priority matching (RRPhQ algorithm and a multiprocessor system. RRPM is an iterative algorithm with each iteration consisting of two steps. Request and Grant. Combining the features of DRRM and PJM, in RRPM. each input port selects one request according to the round-robin discipline and each output port grants one request randomly. We show that using a hypercube and a linear array, each iteration of The rest of the paper is organized as follows. Section II presents the RRPM algorithm and the hardware scheduler used to implement it. Section III gives the simulation results of RRPM and comparison with other scheduling algorithms.
Section IV concludes the paper.
THE RRPM ALGORITHM AND ITS IMPLEMENTATION
In this section, we first present the RRPM algorithm and then discuss its hardware implementation architecture.
A. The RRPM Algorithm Designing a scalable scheduler is an algorithm-architecture co-design problem. Algorithm-sbuctured scheduler is designed to fully explore the parallelism existing in the cell scheduling problem. By combining the features of DRRM and PIM. we propose the round-robin priority matching (RRPM) algorithm which uses the round-robin discipline to select a request at each input (port) and random selection to decide a grant at each output (port). As we will discuss in Section II-B, the random selection is implemented by selecting the smallest one among a subset of random numbers (or priorities), generated by inputs. Assuming that each input port I, is associated with a request pointer ri, which indicates the request starting point, RRPM operates iteratively with each iteration consisting of the following two steps.
Step 1: Request. If an unmatched I, has at least one request, it selects one request starting from the VOQ that r; points to in a round-robin manner, and sends the request to its comesponding output. T; is updated to one beyond the requested output if and only if the request is granted in Step 2 of the first iteration.
Step 2: Grant. If an unmatched Oj receives at least one request, it grants a request randomly. RRl' M stops either after a predetermined number of iterations or until no more matching can be found, which means a maximal size matching is found. Fig. 3 shows an example of RRF' M with one iteration for a 4 x 4 switch assuming that each input has no empty VOQ. Initially, we assume all the request pointers are pointing at 1. In the fint cell slot, all inputs send requests to output 1, which randomly grants one request (shown as dark edge in Fig. 3) , say from input 2. Then only the request pointer at input 2 will be updated to 2. In the second (cell) slot, input 2 will request output 2 and get granted, while other inputs will continue requesting output 1, which randomly grants one request, say from input 4. Then the request pointer at input 2 will be updated to 3 and the request pointer at input 4 will be updated to 2. In the third cell slot, inputs 2 and 4 will request to outputs 3 and 2 and get granted respectively, while inputs 1 and 3 continue requesting output 1. which randomly grants one request, say from input 1. OUT(i) = j if a request Q,,, is selected for scheduling according to the round-robin scheme, and OUT(i) = CO othenvise; and S(a) indicates whether I, is selected in an iteration; " 1 ' ' stands for concatenation operation. Each iteration of RRPM is implemented on an IMI by the following steps. the request pointers at inputs 1, 2, and 4 will be updated to 2, 4. and 3 respectively. In the fourth slot. request pointers are fully desynchmnized, each request is coming from a different input and will be granted. A maximal size matching of size 4 is found in this slot.
We can run RRPM for multiple iterations to enlarge the matching size found in each cell slot. For an N x N switch, it takes up to N iterations to find a maximal size matching. However, in practice, due to the desynchronization effect of the request pointers. it takes much less iterations for RRPM to find a maximal size matching. By simulations, we show that log N iterations are adequate to achieve satisfying performance.
An important objective of designing a scheduling algorithm is the simplicity in implementation. In the following, we discuss a unique hardware implementation architecture for
RRPM.
, . )
Step I:Sort. Sort W(i)'s using keys fz I f3 = OUT(i) I Srep 4: Spread. Send each request word with fs = 1 to the P E whose index is equal to its fz value.
Step 5: Grant. If PE, receives a request word in Step 4, do the following. If SO(i) = 0 then set SO(i) = 1 else 'set f4 = 0 in its received request word. Srep 6:Sort. Sort the request words with f4 = 1 using key f l in non-increasing orcier. Srep 7:Spread. Send each request word with f4 = 1 to the P E with index equal to its f l .
Step 8:AcceDt If PE; receiv'es a request word in steD 7, 
E. ScalaOle Hardware Implemenrarian
Our hardware implementation of RRPM is based on a modified hypercube (MH) with N simple processing elements (PE's). The inter-processor interconnection of MH is a combination of a linear array and a hypercube, as shown in Fig. 4 . We define a scheduling cycle as the process of finding a matching, which is not necessarily a maxima1 size matching. One can verify that this implementation of algorithm RRPM is correct. After sorting in
Step 1, at most one request word is selected for each output in Step 2. Steps 3 and 4 are used to check if the selected outputs have been matched in previous iterations.
Step 5 updates the status of matched outputs. Steps 6 and 7 inform the inputs whether or not their requests are granted.
Step 8 updates the status of matched inputs. Fig. 5 illustrates how the MH implements W M using an example for an 8 x 8 switch. Each rectangle represents a PE and the four numbers inside each rectangle representing the four fields of the request word in each PE. Assume that at the beginning of a schedule cycle, we have the initial request words in each PE as shown in the first row. All SI'S and SO'S are initialized as 0. After Step 1, we have the list sorted by fields fi and f3 (i.e.. OUT(i) and P(z)). We have in the sorted list five segments, which have fz = 1, fz = 4, fz = 5, fz = 6, and fz = 8,respectiuely. The first request word of each segment will be selected, as indicated by the change in field fa after Step 2. In Step 3, these request words are packed to the first five PES such that they can be spread to PES associated with their f~ fields after Step 4. In Step 5, those PES that receive request words in Step 4 will change their SO variables to I. These request words are sorted again according to their f1 fields in Step'6 and spread to PES associated with m -9 - 
PERFORMANCE EVALUATION
In this section, we evaluate the performance of RRPM in terms of the average cell latency under both Bernoulli and bursty arrivals. The cell latency is the time that a cell spends in a switch measured in number of cell slots. We consider a 16 x 16 switch assuming first that the arrival at each input is independent and identically distributed (i.i.d.).
We first show the performance of RRPM under Bernoulli arrivals, Fig. 6 compares the performance of RRPM, DRRM, SLIP, and PIM, all with one iteration, under uniform Beroulli arrivals. As shown in the figure, the average cell latency of PIM increases rapidly when the offered load exceeds 0.64. The average cell latency of RRPM is identical to that of i s L P and DRRM when the offered load is lower than 0.5, slightly higher than that of iSLP and DRRM when the offered load is between 0.5 and 0.85, and slightly lower than that of i s L P and DRRM when the offered load is larger than 0.85. This indicates that when rates become greatly high and there is no enough time for multiple iterations, RRPM is a better choice than DRRM and iSLIP for a large load range. I -p ) / p , where p is the offered load of each input source. We assume that the destination of each burst is uniformly distributed. Fig. 8 illustrates the performance of RRPM with four iterations under Beroulli arrivals and bursty arrivals with average burst lengths of 16.32, and 64. It shows that, with the average burst length increasing, the average cell latency increases correspondingly as expected. Fig. 9 shows that, under bursty traffic, RRPM with four iterations achieves better performance than DRRM with four iterations.
As discussed in Section 11-B, we can extend RFPM by allowing each input to issue multiple requests per iteration. We observe that by allowing each input~to issue only logN requests, RRPM achieves high performance comparable to that of iSLIP. Fig. 10 shows that the extended RRPM achieves almost the same performance as i s L P when the offered load Delay performaneer of RRPM extendcd with log N requerls per switches. Using algorithm-architecture eo-design, we proposed a universally scalable scheduler, which is based on a new scheduling algorithm RRPM and . % multiprocessor system. We showed that it has scheduling performance comparable to that of non-scalable schedulers. It remains a great challenge to design faster universally scalable: schedulen that have good performance.
