Abstract-This paper presents and analyzes a high-performance, robust, and scalable scheduling algorithm for input-queued switches called distributed sequential allocation (DISA). In contrast to pointer-based arbitration schemes, the proposed algorithm is based on a synchronized output reservation process, whereby each input selects a designated output while taking into consideration both local transmission requests and the availability of global resources. The distinctiveness of the algorithm lies in its ability to offer high performance when multiple cells are transmitted within each switching interval. Relaxed switching-time requirements allow for the incorporation of commercially available crosspoint switches. The result is a pragmatic and scalable solution for high port-density switching platforms. The efficiency of the scheme and its robustness in the presence of admissible traffic, without the need for speedup, is established through analysis and computer simulations. Performance results are shown for various traffic scenarios including nonuniform destination distribution, correlated arrivals and multiple classes of service.
I. INTRODUCTION

S
CALABLE packet scheduling algorithms that offer highperformance for input buffered switches and routers have been the focus of many academic and industry studies in recent years. The ever growing need for additional bandwidth and more intelligent service provisioning in next-generation networks necessitates the introduction of scalable packet scheduling solutions that go beyond legacy schemes. Recent switching applications, such as those introduced by next-generation storage area network (SAN) platforms, involve aggregate bandwidth requirements in the Tbit/s range. As a result, the challenges of designing high-performance and scalable switching architectures drive the need for innovative and efficient scheduling schemes.
As port densities and line rates increase, input buffered switch architectures are acknowledged as a pragmatic approach for implementing scalable switches and routers. In these architectures, arriving packets or cells are stored in queues at the ingress ports until scheduled for transmission. With the introduction of commercially available high port-density crosspoint switches ( [1] , [2] ), it has become more compelling to propose packet scheduling techniques that can interoperate with commercially available crosspoint switches to offer a low chip count, high-performance, and cost-effective switch fabric solution. The majority of the proposed scheduling algorithms targeting high port density switches are based on an iterative requestaccept-grant process [3] - [5] . Although such algorithms offer ease of implementation, their performance typically degrades in cases where the traffic is correlated or nonuniformly distributed among the destinations. The primary reason for the latter originates from the pointer-based mechanisms employed by these algorithms that tend to synchronize and, thus, limit the throughput and overall performance. State information is passed through consecutive time slots, whereby independent decisions are made at each port. As means of overcoming the inherent limitations of pointer-based schemes, speedup is usually deployed such that the internal fabric bandwidth is times higher than that of incoming traffic, where and is the number of ports. A speedup of would yield performance equivalent to that of an output queued switch. It has been shown [6] that a speedup of two is sufficient in virtual output queue (VOQ) architectures to accurately emulate the performance of an output queued switch. However, speeding up the fabric carries enormous ramifications in terms of chip count, power, flexibility, cost and feasibility, which motivates the search for switching architectures and scheduling algorithms what would overcome the need for a large speedup.
An additional drawback of many pointer-based schemes is the connectivity complexity which is known to be O . Consequently, these algorithms have been proposed primarily for switches with small number of ports (i.e., ) [3] , and are generally unsuitable for next-generation switches, where hundreds and even thousands of ports are considered.
In this paper, we proposed a scheduling algorithm called distributed sequential allocation (DISA) that represents a shift from legacy scheduling schemes by constituting a non-pointer based approach in which the contention resolution process takes into consideration both local and global resources availability. Moreover, the DISA algorithm requires only O connectivity, yielding a pragmatic solution for high-performance, crosspoint-based, single stage switching architectures.
The paper is organized as follows. Section II outlines the queueing notation and model formulations utilized throughout the paper for performance analysis. Section III describes the DISA scheduling algorithm and its corresponding switch architecture. Section IV presents analysis for Bernoulli independent and identically distributed (i.i.d.) arrival patterns, while Sections V and VI focus on nonuniform destination distribution and multiple classes of service, respectively. In Section VII, the algorithm is evaluated in the presence of bursty traffic. Section VIII addresses the scheme's hardware implementation considerations followed by a summary in Section IX.
II. NOTATION AND QUEUEING MODEL FORMULATION
We begin by establishing the notation and queueing formulation framework which will be used to derive the performance metrics. We assume a discrete-time VOQ system with a singleserver and infinite buffer capacity. The number of queues corresponds to the number of distinct destinations in the system. All events occur at discrete time slot intervals in which at most a single arrival and a single service event may occur. Let denote the probability of arrival to VOQ in port at time step . We label as the corresponding mean probability of arrival. In order to guarantee admissibility of the traffic, we require that , . For convenience, we employ the early-arrival model, whereby an arrival will precede a service event within any given time slot. As will be explicitly shown in the following sections, the service discipline governing each VOQ system (ingress port) can be approximated by an i.i.d. Bernoulli process, resulting in geometrically distributed interservice times. In the context of the proposed architecture, the task of the scheduler is to determine which of the queues within each VOQ is to be scheduled for transmission during the subsequent switching interval.
Let denote the probability of service to the VOQ during each interval. Once a service event occurs, an internal arbitration scheme determines which of the queues is granted service.
represents the queue occupancy of queue in port at time step , such that (1) where and are the number of arrivals and departures during time step , respectively. To ensure the existence of a stochastic equilibrium of the queueing system, the arrival rate for each queue should converge to the departure rate, yielding the condition (2) Recalling the statement of convergence between the arrival and departure rates, a generic balance equation under stability can be written as (3) where is the mean probability of arrival, is the steady-state queue size distribution (i.e.,
) and is the probability of service given that the queue size is . For a Geo/Geo/1 system in which , a direct outcome of the steady-state balance equation for the case of early arrival can thus be written as (4) where the term expresses the probability that the queue was empty prior to the arrival phase and no cell has arrived. When rearranged, (4) yields the well-known result for Geo/Geo/1 systems [8] (5)
The common goal for each of the examined scenarios is to obtain closed-form expression for the steady-state queue size distribution, . Of particular interest is , which represents the share of time that the queue is empty of cells. In some cases, is sufficient to derive the queue size distribution while in others additional information is required. Based on the queue size distribution, important performance metrics can be directly derived including the mean queue size, and using Little's theorem [8] , the mean queue latency, . Additionally, in any given stable queueing system is satisfied. Throughout the paper we call attention to the distinction between two independent attributes of traffic patterns: the destination distribution and the arrival process. Destination distribution corresponds to the probability of an arriving cell to be destined to each of the output ports. The most common and simple case is that of uniform distribution, whereby a packet has an identical likelihood of being destined to each output. However, real-life traffic has been shown to be characterized by nonuniform destination distribution throughout all levels of the network hierarchy [9] , [10] .
The arrival process pertains to the nature of the correlation that may or may not exist between consecutive cell arrivals. The simplest and most commonly deployed arrival process is Bernoulli i.i.d., which is a memoryless process generating an arrival with a given probability regardless of the history of arrivals. Once again, it has been extensively shown in the literature that pragmatic network traffic tends to be correlated or "bursty" [9] . Investigating performance under bursty conditions is, therefore, key for a comprehensive evaluation of any scheduling algorithm. By independently defining the destination distribution and arrival process, a wide range of fabric ingress stimuli can be generated.
III. DISA SCHEDULING ALGORITHM
A. Switch Architecture
We begin by presenting the switch architecture over which the DISA algorithm is implemented. Fig. 1 depicts a block diagram of the proposed single-stage, nonblocking switch fabric architecture. Packets received from the switch port at each line-card flow into a packet processing engine, such as a network processor or traffic manager, which performs important tasks including policing, shaping, and connection admission control. The processed packet streams are subsequently forwarded to an ingress fabric interface device, which typically segments the packets into smaller fixed-size cells and provides a buffering stage to the fabric.
In order to avoid head-of-line blocking [7] , virtual output queueing is deployed. In systems where multiple classes of service are supported, for each output port several VOQs are maintained (one per class of service), such that the total number of queues is , where is the number of ports and is the number of classes of service. Cells are transmitted using high-speed data links that exist between the linecards and the crosspoint switches. The task of the scheduling mechanism is to determine the transmission configuration of the ports at any given time. The efficiency of the scheduler has a paramount impact on the amount of buffering required at the linecards and, consequently, the latency through the system. Two types of control links are employed by the proposed architecture between the nodes and the fabric. The first is an -bit bus called the offered destination set (ODS), to which all nodes have read and write access. The elements of the ODS indicate the reservation status of the outputs. We let a logical "one" denote an available (unmatched) output port while a logical "zero" represents a previously reserved output. In addition to the ODS, each node receives a dedicated signal from a central arbitration unit (CAU) that indicates when the node is to perform reservation, as will be clarified in the following section. The switching granularity (or interval) ranges from a single time slot (cell-by-cell) to a number of time slots per interval, where a time slot represents the duration of a single cell. to perform output reservation. A random order of signaling the ports is drawn at the beginning of each interval. Upon receiving a signal from the CAU, each node performs output reservation according to two primary guidelines: 1) global resources (outputs) availability and 2) local considerations as reflected by its internal priority map. The term queue weight or priority may refer either to a single bit denoting queue occupancy (empty or nonempty), or a value pertaining to the queue occupancy. Once the last port completes the reservation process, the crosspoint switches are reconfigured and each port transmits cells to its designated output. Concurrently, a new reservation interval (for the following transmission cycle) begins, as illustrated in Fig. 3 . As a result, there is no transmission "dead time" involved, since transmission and scheduling occur simultaneously.
B. Scheduling Algorithm
C. Stability Under Admissible Traffic
In this section, we provide a fundamental theorem addressing the stability of the DISA algorithm. We begin with some basic definitions.
Definition 1: A phase is a cycle within the DISA algorithm, whereby a port is signaled to select an output. There are exactly phases within each switching interval.
Definition 2:
is defined as the mean number of nonempty queues belonging to the ODS during phase .
The following two definitions pertain to the arbitration mechanism employed in each ingress module. The goal of such schemes is to select which queue is granted service.
Definition 3: Random arbitration is a selection mechanism in which each nonempty queue belonging to the ODS has an equal probability of being granted service.
Definition 4: Longest-queue-first (LQF) is a selection mechanism in which the queue with the highest occupancy, out of the set of queues belonging to the ODS, is granted service.
An important property of is that it forms a monotonically declining series, i.e., , since , by definition, is a subset of the ODS which is a monotonically declining series. Consequently, under random arbitration the probability of service to a nonempty queue belonging to the ODS is . Definition 5: A scheduling algorithm is said to be strictly stable if the following holds:
where and correspond to arrivals and departures at time slot , respectively.
When employing this notion we observe that the probability of a queue being serviced equals the probability that it belongs to the ODS multiplied by the probability of it prevailing when contending against the other nonempty queues that belong to the ODS.
Theorem 1: The DISA scheduling algorithm with random arbitration and no speedup is strictly stable for any admissible uniformly distributed traffic.
Proof: Each of the input queues can be analyzed, without loss of generality, as a GI/G/1 queueing system with denoting the queue occupancy at time step . We label as the mean interarrival time and as the mean interservice time. It has been shown by Lindley et al. [11] , that for a GI/G/1 queue, if , then exists and, thus, the queueing system is stable. An alternative and identical interpretation of this stability criterion is that the mean probability of arrival is smaller than the mean probability of service . Under the assumption of uniformly distributed arrivals, the mean probability of arrival is . Let denote the mean size of the ODS at the beginning of the th phase . By observing that at most a single output is selected by each input during the reservation process, we have . Accordingly, we can write
Pr service phase (
The latter limit pertains to the worst case scenario, whereby in every phase an output is matched to an input and, thus, an element is removed from the ODS. If this is not the case, the probability of service increases since more destinations are offered due to a larger ODS. Since the probability that a queue contends for transmission during each of the phases depends on the port it resides in and is uniform and equal to , we have Pr service (7) However, by definition, we have leading to
Since we require for admissibility that , the system is always stable and, thus, the proof is completed. This stability holds for any admissible arrival process, be it correlated or not. The fact that no speedup is required to achieve such stability further strengthens the attractiveness of the scheme.
IV. UNIFORM BERNOULLI I.I.D. ARRIVALS
The most commonly examined traffic is uniformly distributed between the outputs and obeys a Bernoulli i.i.d. arrival process. Letting denote the mean size of the ODS at the beginning of the phase , we observe that since at the beginning of the first phase all elements (outputs) of the ODS are always unreserved, . Moreover, forms a monotonically nonincreasing series given that during each phase a node either selects an output, resulting in or, alternatively, it does not select any output implying that . The probability that a node selects no outputs may be interpreted as the probability that all nonempty queues belonging to the ODS are empty. In all other cases an output is selected. Employing the early arrival model described earlier, the probability of a queue not selecting any outputs equals the probability that all queues within the ODS were empty at the beginning of the time slot and no cell has arrived during the current time slot. Expressing the above mathematically gives us the following recursive expression for :
with prob. with prob.
where denotes the probability that all queues within the ODS are empty. Combining the probabilistic terms in (9) yields an expected value expression in the form (10) By rearranging (10) , it can be shown that a good approximation of the mean ODS size, , is given by
The reader may refer to the Appendix for details on this approximation. For high load conditions , the probability of a queue being empty is low resulting in the intuitive conclusion that , which is the mean of an arithmetic series decreasing from to one. Our initial examination is that of random selection arbitration, where nonempty queues contend for transmission in an equal manner. There are three conditions that must be met for a queue to transmit a cell: 1) the queue must reside within the ODS; 2) the queue must be nonempty; and 3) the queue must prevail when contending against the other nonempty queues in the ODS, such that prevails for transmission (12) Given (11), an approximation of the probability that an arbitrary queue belongs to the ODS is , while the probability that a queue is nonempty following the arrival phase is .Consequently, the probability of prevailing in the internal contention for transmission is prevails for transmission (13) where denotes the mean number of nonempty queues in (other than the queue analyzed) for which we add 1 to account for the analyzed (nonempty) queue. Substituting these terms in (12) yields (14) The service discipline is governed by a memoryless process, primarily since within each time slot decisions are made regardless of the outcome of previous time slots. To that end, the queueing behavior may be described using a Geo/Geo/1 model, whereby both the arrival and service processes are the outcomes of Bernoulli i.i.d. trials with parameters and , respectively. The probability of transmission equals the probability of the queue being nonempty multiplied by a term for the probability of service. Accordingly, we isolate the probability of service (15) Substituting (11) into (14) and rearranging, we obtain the following approximation:
Utilizing the results from the Geo/Geo/1 model we find the mean queue occupancy to be (17) and, accordingly, the mean queueing delay is (18) The latter implies that the mean queueing delay for large values of is independent of and is roughly . It has been shown in [7] that the mean queueing delay of an output queued switch is In practice, when is larger than 16 completing the scheduling cycle within a single-cell time slot (approximately 50 ns for 10-Gb/s links) is impractical. In view of the latter, we next investigate the performance implications of increasing the switching interval duration beyond one time slot. It is apparent from the nature of a multitime-slot service discipline that we are still considering a memoryless process, particularly since consecutive service (switching) events are independent of previous ones. Hence, we may continue to utilize the Geo/Geo/1 model providing that we find an expression for that reflects the lengthy switching intervals. Using similar rationale to that of (3), we write we find , from which the mean queue occupancy and mean waiting time have been shown to be derived. Fig. 5 depicts the mean latency as a function of the offered load for a 128-port switch with various switching interval durations. We next examine LQF arbitration for which, in contrast to random selection, the queue with the largest occupancy is selected for service. The significance of considering the queue size becomes apparent when the switching interval is larger than one time slot, thus allowing for larger accumulation of cells. Fig. 6 illustrates a Markov chain that describes the queueing behavior under LQF and a switching interval of time slots. The states correspond to queue occupancies, where the -step forward and reverse transition probabilities are defined as (22) We take an approximation approach, whereby we assume that there are states in the system, implying that the probability of a queue occupancies being greater than is negligible. Accordingly, given the states of the Markov chain, we obtain equations from its stochastic equilibrium. For an arbitrary state , where , a typical balance equation would be in the form (23) where denotes the probability of service to a queue given that the queue contains cells and are the stationary queue size distributions. As we can expect, form a monotonically increasing series since the larger the queue occupancy the higher the probability of it being serviced, i.e., . Hence, (23) expresses the required identity between the rate of arrivals and departures to/from each state. The variables included in this set of equations are and , amounting to 2 independent variables. Since the Markov chain is assumed to be in equilibrium, an additional obvious equation is . The remaining equations are given by (24) which is the expected probability that the queue is serviced multiplied by the probability that the size of all other queues in the ODS is smaller or equal to and the queue prevails in contending against the set of other queues with size equal to . The inner probabilistic term in (24) can be coarsely approximated as (25) reflecting on the probability that the size of queues are independently smaller or equal to . By setting a pragmatic value for (32 was found sufficient for most cases), numerical techniques may be applied to obtain values for and , from which the mean queue occupancy and mean delay are directly derived. Fig. 7 depicts the mean latency for different switching interval durations employing both LQF and random arbitration. As the load increases, replacing random selection with LQF yields a notable reduction in the delay. One explanation for the latter lies in the fact that under high loads queues accumulate more cells for which scheduling based on queue size is more efficient.
V. NONUNIFORM DESTINATION DISTRIBUTION
In the interest of modeling pragmatic traffic scenarios, we explore the impact that nonuniformly distributed cell arrivals have on the performance of the algorithm. Let denote the probability that a cell is destined to output , such that . The service discipline is independent of the arrival process and remains the same as before. We define as the size of the th queue, the latter having an arrival rate of . Utilizing the Geo/Geo/1 model for the same reasons described in the case of uniform distribution, we focus our analysis on expressing the steady-state probability of being empty, i.e., . Recalling the three conditions for transmission and replacing the generic term with , we have (26) where . Isolating yields
Letting and rearranging (27) for each of the queues, we arrive at a matrix solution in the form of (28) where (29) Solving this linear system, we find expressions for which, by letting and utilizing the Geo/Geo/1 result, allows us to find the steady-state queue size distribution , the mean queue size (30) and from Little's result, the mean latency
To illustrate the impact of nonuniform destination distribution, we choose a destination distribution function called Zipf's law, which was proposed by G. K. Zipf [12] . The Zipf law states that the frequency of occurrence of some events , as a function of the rank , where the rank is determined by the above frequency of occurrence is a power-law function: , with the exponent typically close to unity. The most famous example of Zipf's law is the frequency of English words in a given text. Most common is the word "the," then "of," "to," etc. When the number of occurrence is plotted as the function of the rank ( most common, second most common, etc.), the functional form is a power-law function with exponent close to one. It was shown that many natural and human phenomena such as Web access statistics, company size, and biomolecular sequences all obey the Zipf law with close to one. We use the Zipf distribution to model packet destination distribution. The probability that an arriving cell is heading destination is given by Zipf (32) where is the destination index, is the Zipf order and is the number of switch ports. Fig. 8 shows the Zipf distribution with and . While represents uniform distribution, as increases the distribution becomes more biased toward preferred destinations. In order to generate a stable and realistic traffic model, the average steady-state load at each input port must not exceed 100%. Similarly, the average steady-state aggregated traffic rate arriving from all input ports to any destination port must not exceed 100%. Admissible traffic scenario 
VI. MULTIPLE CLASSES OF SERVICE
Next, we address the issue of quality-of-service (QoS) support by associating multiple classes of service with each output. We deploy strict priority scheduling such that cells awaiting transmission in a given queue will always have preference in service over those in lower priority queues, regardless of the queue sizes. Assuming random arbitration, we let denote the number of classes of service (in typical high-end routers ), and define as the probability of queue in class being empty and as the probability that all queues are empty. Representing the aggregated number of cells in all class queues using a single queue, we find that the balance equation is (33) where is the size of the set of contending queues. For a given class of service considering the relative priority yields the balance equation (34) Labeling lower class indices as having higher priority, the product term component on the right side of (34) represents the probability that all of the higher priority classes are empty. Recursively dividing (34) by (33) and rearranging yields the following result for class :
where is found using (16). In particular, for , we note that which when applied to the single class of service case produces the expected result . As before, due to the memoryless nature of both arrival and service processes, the queue size distribution for each class is , from which the mean queue sizes and mean cell latencies can be directly obtained. Fig. 10 illustrates the mean latency obtained for four classes of service in a 16-port switch as a function of the offered load. The differentiation between the classes is noticeable only for high loads, suggesting that the efficiency of the scheduler allows all classes to be sufficiently served.
VII. BURSTY ARRIVALS
In an aim to evaluate the DISA algorithm under traffic patterns that more accurately portray the nature of Internet traffic, we focus our attention on the well-known two-state Markov modulated arrival process [3] , also known as an ON/OFF arrival model, which generates geometrically distributed bursts of cells. Fig. 11 illustrates the performance of the DISA scheduler for a 16-port switch with bursty traffic. The results are compared with those obtained by the iSLIP [3] algorithm with four iterations (iSLIP-4), as well as to the performance of an output queued switch. The mean burst size used is 16 cells. As is the case with iSLIP and output queueing, the latency using DISA grows in a linear proportion to the mean burst size. A similar relationship to that shown for the three curves in Fig. 11 is observed for mean burst sizes larger than 16.
The true strength of the DISA scheduler lies in its scalability. A common limitation of distributed scheduling schemes, such as the iSLIP algorithm, is
complexity. An interesting performance measurement is observed in Fig. 12 , where bursty traffic is applied to a 128-port switch with switching intervals of eight cells and LQF arbitration. The reader will note that the performance is relatively close to that of an output queued switch. The reason for the exceptionally low delay lies in the fact that correlated traffic results in temporarily unevenly populated queues, allowing the scheduler to more efficiently utilize lengthy switching intervals. As a result, the latency obtained under bursty scenarios is lower than that of Bernoulli i.i.d. traffic. The deployment of LQF allows the scheduler to grant service to the most populated queue, hence improving the switching utilization. Since real-life traffic does tend to be correlated on several levels, the robustness exhibited by the DISA algorithm under bursty arrivals is perhaps one of its key properties.
VIII. HARDWARE IMPLEMENTATION
An important aspect of any scheduling scheme is its ease of implementation. This section considers the complexity of implementing the DISA algorithm and the corresponding switch architecture, addressing both timing and area aspects. As illustrated in Fig. 1 , the connectivity requirements between the egress port modules and the centralized units is , where lines are associated with the ODS, while the additional lines connect the CAU to each of the ports. The logical process performed by each input ports consists of two stages: queue filtering and queue selection, as shown in Fig. 13 . At the queue filtering stage, the ODS lines either forward or discard queue priorities using -bit AND gates, where is the number of bits per weight. In the case of random arbitration, , denoting either an empty or nonempty queue. Only weights corresponding to available outputs advance to the queue selection logic.
It can be shown, in view of the above, that the aggregate gate count for each input port is . Using efficient macro logic components such as lookup tables (LUTs) and memory blocks, which are inherently available in FPGA devices and ASIC libraries, the arbitration logic can easily scale to hundreds of ports.
IX. CONCLUSION
This paper has presented the DISA scheduling algorithm with a complementary switch architecture forming a unique solution for scalable switch fabrics. Analytical foundations coupled with simulation results have been provided for evaluating the performance of the algorithm under different traffic scenarios. Robustness to the statistical nature of the traffic, both in terms of the arrival process and the destination distribution, has been established. Ease of implementation in conjunction with relaxed timing requirements allow for the incorporation of commercially available crosspoint switches, further accentuating the attractiveness of the proposed scheme for a wide range of high port-density switching platforms.
APPENDIX APPROXIMATION OF
This appendix provides a description of the approximation presented for the mean size of the ODS. We first recall the recursive relationship between consecutive ODS terms (A1)
We label the expected value of the mean ODS size at phase as and assume a linear declining series to replace the expressions for at the exponent and numerator terms in (A1), such that (A2)
Exploiting the approximation provided above and summing for , we have
Since the ODS size at the initial phase equals , we utilize (A3) to derive the mean size of the ODS at the last phase (A4)
Utilizing the following approximation:
and the known identity for the sum of elements in an arithmeticgeometric series [13] (A6)
we obtain (for , and ) (A7) When multiplied by , the above yields (A8) and in conclusion, using (A2), we have (A9)
Next, we approximate the mean ODS size by assuming that the ODS forms an arithmetic series for which . Substituting (A9) into the latter gives us (A10) Elaborating on (16), from the stochastic equilibrium equation for each queue, we may equate the mean rate of departures to the mean rate of arrivals, such that (A11) for which we get (A12) Finally, substituting (A10) in (A12) yields (13) which, for large values of and , can be quite accurately approximated by (A14)
As the load increases, the probability of a queue being empty decreases and visa versa.
