Abstract-The problem of providing quality-of-service (QoS) guarantees for multicast traffic over crossbar switches has received limited attention despite the popularity of its counterpart for unicast traffic. Providing a 100% throughput to all admissible multicast traffic has been shown to be a very difficult task, and it requires a very high speedup in the switching fabric. In this paper, we introduce the concept of rate quantization and use rate quantization to show an analogy between packet scheduling in crossbar switches and circuit switching in three-stage Clos networks. We exploit the analogy to adopt circuit-switching algorithms in wide-sense and strict-sense nonblocking Clos networks in order to construct nonblocking packet schedulers for unicast and multicast traffic. We illustrate a simple multicast nonblocking packet scheduler, for which a speedup of 6 log n= log log n is sufficient to support 100% throughput for any admissible multicast traffic in an n 2 n crossbar switch. Moreover, we revisit some problems in unicast switch scheduling. We illustrate that the analogy provides useful perspectives, and we give a simple proof for a well-known result.
I. INTRODUCTION
A PPLICATIONS requiring quality-of-service (QoS) support for multicast traffic remain important in small-and large-scale networks. The problem of providing QoS guarantees for multicast traffic over crossbar switches has received a limited attention despite the popularity of its counterpart for unicast traffic. One of the main reasons for this is the difficulty of the task. Indeed, it was shown in [1] and [2] that the problem of "optimal scheduling" of multicast packets over a crossbar switch is NP-hard, as it is dual to the fractional weighted graph coloring problem, and in [3] it was proved that the resource speedup necessary to achieve 100% throughput for all admissible multicast traffic grows unbounded with increasing switch size. These results hold even if the crossbar switch is multicast capable, i.e., it is capable of connecting an input to multiple outputs or if the crossbar switching fabric contains internal buffers at each cross- point [4] . Furthermore, it is stated in [2] and [3] that the numerical evaluation of the necessary speedup is prohibitive, as it is dual to the membership problem for the stable set polytope of a graph, and no scaling law has been given as to how the speedup scales with the switch size. In this paper, we present a simple algorithm to provide 100% throughput for all admissible multicast traffic and specify the sufficient speedup as a function of the switch size to achieve 100% throughput.
In the development, our main tool will be an analogy between middle-stage switch configurations of three-stage Clos networks and schedules for a crossbar switch. A similar analogy was first exploited in [5] for a certain set of time-division multiple access (TDMA) schedulers. We construct the analogy for frame-based schedulers in which the scheduler has the a priori knowledge of the matrix of real arrival rates, which is not necessarily supportable by (periodic) TDMA schedulers. We generalize the analogy to nonperiodic switch schedulers using the notion of rate quantization [6] . Note, however, that, as stated in [3] , the set of all frame-based schedulers is equivalent to the set of all slot-by-slot schedulers (no a priori information of the arrival rates), provided that slot-by-slot schedulers introduce an additional delay equal to the frame length. We make this statement precise here, also using rate quantization. We prove the main theorem of rate quantization, which was stated in [6] without a proof. Then, we generalize rate quantization to multicast rate matrices. Ultimately, the Clos network analogy will be valid not only for frame-based schedulers, but for all unicast and multicast schedulers. Indeed, we illustrate that once rate quantization is applied, the rest of the analogy is similar to the equivalence of space-time-space (STS) switching to time-space-time (TST) switching [7] for a fixed TDMA schedule.
We also discuss why achieving 100% throughput is difficult with multicast traffic in crossbar switches: We show that a Birkhoff decomposition-based scheduling (see, e.g., [8] ) is not possible since, unlike unicast scheduling, a traffic pattern is not necessarily sustainable, even if the matrix of arrival rates is within the convex hull of all possible matrices of multicast capable crossbar configurations. To further elaborate on the difficulty of multicast switch scheduling, we show that if fanout splitting of multicast packets is not allowed, an extra speedup of 2 is necessary for 100% throughput. This is true even when the arrival rates are within the admissible region for mere unicast traffic.
Next, we discuss the properties of the frame scheduling analogous to nonblocking switching in Clos networks. We specifically focus on strict and wide-sense nonblocking to reduce the complexity of circuit switching in Clos networks. For unicast traffic, strict-sense nonblocking leads to an analogous scheduler based on maximal matchings, for which we give an O time complexity algorithm on an crossbar switch. Also, exploiting the analogy for unicast traffic, we show that the result [9] that finding maximal matchings is sufficient to provide 100% throughput 1 with a speedup of 2 becomes straightforward. For the multicast traffic, we provide a simple nonblocking switch scheduling algorithm, which is analogous to an existing [10] wide-sense nonblocking circuit switch for Clos networks. Combining it with rate quantization, we show that a speedup of is sufficient to achieve 100% throughput for all admissible multicast traffic. Even though this is only a (possibly tight) upper bound for the necessary speedup for 100% throughput (using any algorithm), the simplicity of the algorithm enables efficient distributed hardware implementations of the scheduler with a time complexity of O per time slot, whereas finding the optimal schedule (with minimal speedup) is NP-hard. For both unicast and multicast schedulers, we illustrate tradeoffs between packet delay and necessary speedup. Since our main focus is exploring the necessary and sufficient resources to achieve 100% throughput, the delayspeedup tradeoff is the extent that we elaborate on the issue of delay.
The rest of the paper is organized as follows. After giving the model in Section II, we discuss rate quantization in Section III and state the basic theorems for rate quantization. We illustrate the analogy between crossbar switch schedulers and circuit-switching Clos networks in Section IV. In Section V, we provide nonblocking unicast and multicast switch schedulers motivated by strict-sense and wide-sense nonblocking Clos networks. Finally, we summarize the results and discuss some other potential directions in Section VI.
II. SWITCH MODEL AND MULTICAST TRAFFIC
We consider the combined input-and output-queued (CIOQ) switch architecture with a single crossbar fabric. We assume that the crossbar fabric is multicast capable, i.e., an input can be connected to multiple outputs, but the inverse is not allowed. We call a given set of connections a switch configuration.
We assume input and output links with identical capacities, and packets arriving over an input link are fragmented into fixed sized cells. We define a time slot as the time in which a cell can be transmitted over a link. In case an internal speedup is used, up to switch configurations can be set up in a time slot, and hence up to cells can be transferred to an output. Since more than one cell can be transferred to an output in a given time slot, output queueing as well as input queueing is necessary. Speedup is also referred to as no speedup. We call the time in which a switch configuration remains active a schedule slot. Hence, a schedule slot is of a time slot. Each cell arriving at an input queue has a fanout set, i.e., the set of the output links to which the cell needs to be forwarded. Unicast cells have a fanout set of unit cardinality. To avoid head-of-the-line (HOL) blocking [11] , we assume the presence of virtual output queueing (VOQ) at each input for every possible fanout set. As in [3] , VOQ at a per-fanout-set level is referred to as multicast virtual output queueing (MC-VOQ). In an switch, for any given input, there exist possible fanout sets. Due to this exponential growth, MC-VOQ has issues of scalability, and consequently it is merely a theoretical tool used to investigate the limitations of input-queued crossbar switches under multicast traffic.
A scheduler may choose not to place an arriving cell with a certain fanout set directly to the associated MC-VOQ. It is also possible that it duplicates the cell and places one copy to the MC-VOQ with a fanout set and the other copy to the MC-VOQ with a fanout set such that and . Hence, these two copies are transferred to the corresponding outputs at possibly different times. This process is called fanout splitting.
We assume that the cell arrivals are rate-ergodic and each MC-VOQ is associated with a certain cell arrival rate (before possible fanout splitting). For a given set of rates to be admissible, the total rate of cells arriving at each input link or destined to each output link cannot exceed one cell per time slot. Note that it may be possible that after fanout splitting, the total rate of cells arriving at the MC-VOQs of an input exceeds one cell per time slot. Now, let us consider frame-based schedulers, which have the information of the cell arrival rates for each MC-VOQ. A frame is a (possibly nonperiodic) collection of configurations. In an switch, each configuration can be represented with an configuration matrix, which has a single "1" in each column, and all "0"s otherwise. If the switch is not multicast-enabled, then each configuration matrix is a permutation matrix.
First, we focus on the case with only unicast cells. At each input, there exist virtual output queues, one for each output. These VOQ arrival rates can be written in the form of an rate matrix. 2 In the context of packet switching, throughput is defined as the fraction of the capacity of the output links that can be utilized if the input queues are completely backlogged (and hence the input links are fully utilized). Indeed, if all input queues are backlogged, then a switch is said to achieve 100% throughput if all output links can be fully utilized. It can be shown (see, e.g., [8] and [6] ) that 100% throughput is achievable if and only if lies in the convex hull of the set of permutation matrices, i.e., there exists a frame (containing possibly identical elements) of permutation matrices such that for some possibly infinite frame size . For multicast traffic, the rate matrix is such that is, as a fraction of the link capacity, the rate at which input wants to be connected to output . Hence, it is possible that . For example, suppose in every time slot only input 1 receives cells, each of which is to be broadcast to all outputs. Then, . On the other hand, for all output under all possible admissible traffic matrices since no output can be oversubscribed.
We can express a rate matrix as a sum of matrices: . Here, one of the matrices, , represents the rates for all unicast traffic (fanout set of unit cardinality), and the remaining matrices represent the multicast rates (fanout set of cardinality ), one for each fanout set. For instance, if the rate of multicast cells arriving at input , destined to outputs and , is , then the matrix associated with the fanout set has an in locations and . We call this expansion of a multicast rate matrix the fanout set expansion.
One fundamental difference of the multicast traffic from the unicast traffic is that, in multicast, the existence of a frame of configuration matrices for which does not necessarily imply that 100% throughput is achievable. Consequently, even with a multicast capable crossbar, some speedup is necessary for 100% throughput. Following is an example.
Example 1: First, consider the following unicast rate matrix:
Here, since there is only unicast traffic, fanout set expansion of is itself. Since can be written as the convex combination of two permutation matrices (i.e., unicast configuration matrices) with a weight 0.5, an equal timeshare between the two associated configurations suffices to provide the desired service (and hence 100% throughput) for this traffic.
Next, consider the following multicast rate matrix:
The first component of the above fanout set expansion contains the rates of the unicast cells. Here, there is a single class of multicast cells: Half of the cells arriving at input 1 are multicast cells with a fanout set . Thus, there is only one other component (for fanout set ) other than the unicast component in the fanout set expansion. Even though the sum of the rates in the first row of is 1.5, the total rate of the cells arriving at the first input is 1. Indeed, input links 1 and 2 are fully subscribed, and no input or input or output link is oversubscribed under this traffic. Therefore, to achieve 100% throughput with no speedup, at any point in time, input 2 must be connected to either output 1 or output 2, but not both, since all the cells are unicast at the second input. On the other hand, input 1 needs to be connected to these two outputs simultaneously half the time to transfer multicast cells. This implies that these two outputs can be let free by the first input only half of the time, as shown in Fig. 1 , where the time period illustrated can be arbitrarily long. Consequently, whenever the first input serves a multicast cell, input 2 must remain idle. However, since input 2 is fully utilized, some speedup is necessary.
The other alternative is the fanout splitting of the multicast cells. The total rate of cell arrivals at the VOQs of input 1 ex- Fig. 1 . No matter where the multicast flow is served, both outputs 1 and 2 will be idle simultaneously. ceeds 1 after fanout splitting; hence, without some speedup , it cannot be accommodated. 3 We conclude that without a speedup, is not supportable, with or without fanout splitting. This is valid despite the fact that matrix can be written as a convex combination of configurations matrices ( is in the convex hull of configuration matrices) for multicast-enabled crossbar. This example illustrates that multicast scheduling problem can be more complicated than unicast scheduling. Even with a multicast capable crossbar, some speedup is necessary for 100% throughput.
In this example, speedup is necessary and sufficient for 100% throughput for the given traffic matrix. In [3] , it was proved that the speedup necessary to achieve 100% throughput for all admissible multicast traffic grows unbounded with increasing switch size. It is also stated that "the numerical evaluation of the necessary speedup is prohibitive," and no scaling law has been given for the necessary speedup for 100% throughput. Also, finding the multicast schedule that works with the minimum necessary speedup is NP-hard, as shown in [1] and [2] . There are some obvious ways of simplifying multicast scheduling, such as ruling out fanout splitting. However, extra speedup is necessary to make up for the lost flexibility, as shown in the following theorem. Note that the motivation of this theorem is not to calculate the sufficient speedup for 100% throughput without fanout splitting. It is merely to illustrate that, without fanout splitting, a larger speedup may be required to deliver 100% throughput under multicast traffic.
Theorem 1: If fanout splitting of multicast cells is not allowed in an crossbar switch, then a speedup of is necessary to achieve 100% throughput for a doubly stochastic multicast rate matrix .
Before the proof of the theorem, note that if fanout splitting is allowed, no speedup is required to support any doubly stochastic rate matrix. Indeed, with complete fanout splitting, every cell can be treated as a unicast cell. Since the row and column sums of this matrix is 1, 100% throughput is achievable by treating the problem as a unicast switch scheduling problem. Thus, ruling out fanout splitting costs us some extra speedup or a reduced throughput at a fixed speedup.
Proof: Consider the following fanout set expansion:
Here, input 1 receives all unicast traffic with an equal rate of to every output. Every other input receives cells once every time slots to be multicast to all outputs (broadcast). For all , and consequently the overall rate matrix is doubly stochastic.
Since fanout splitting of broadcast cells is not allowed, one schedule slot must be occupied for each broadcast cell arriving at inputs 2 to . Along with any broadcast cell, no unicast cell can be scheduled from any of the input 1 VOQs. Thus, extra schedule slots are necessary to accommodate input 1 traffic. As a result, to support this traffic, a total of schedule slots is necessary in a span of time slots, corresponding to a speedup of , completing the proof.
III. RATE QUANTIZATION AND PERIODIC FRAME SCHEDULERS
Every entry of a rate matrix takes on any real nonnegative value as long as no input or output link is oversubscribed. Consequently, a given frame scheduler can end up having a nonperiodic schedule, of infinitely many configuration matrices. We show that using the concept of rate quantization, the frame schedule for a given can be made periodic, even with some arbitrarily small speedup. We first introduce rate quantization for doubly stochastic unicast rate matrices, and then state the generalization to multicast rate matrices in a follow-up corollary. Note that the following theorem for unicast rate matrices was first introduced in [6] with no proof.
Theorem 2: Let be an doubly stochastic matrix and be a rational number, which can be written as , where is an integer. There exists an matrix , where : 1) is a doubly-stochastic matrix with all entries integer multiples of ; 2) for all , and . Our proof is constructive. First, we introduce an algorithm to construct matrix (and thus the matrix Q) for a given . Then. we prove that the algorithm always ends up with the desired matrix. The details of the algorithm and the proof are given in Appendix I. Here. we give an example that illustrates the theorem and, at the same time, provides intuition.
Example 2: Let , and consider the following 3 3 doubly stochastic matrix:
Note that each entry of in (2) is within of the corresponding entry in , and the rows and the columns sum to . First, we illustrate how rate quantization leads to a periodic switch schedule for unicast traffic, even for a speedup arbitrarily close to 1.
Corollary 1: For any doubly stochastic unicast rate matrix and any given , there exist a sequence of permutation matrices, , such that
The proof follows directly from Theorem 2. Choosing , one can find a matrix , whose rows and columns sum to such that, for all is an integer multiple of and . Thus, any sequence of unicast configuration matrices (i.e., permutation matrices) that supports rate matrix will also support . As shown in [8] and [6] , the Birkhoff decomposition of terminates with permutation matrices (containing possibly identical elements) each with a coefficient . We complete the proof noting that the first permutation matrices are due to , and the remaining of them 4 are due to the constant matrix . Hence, , and periodically repeating the sequence of these switch configurations with a period of time slots suffices to provide 100% throughput to any given set of admissible rates. The required speedup is . This corollary also illustrates the relationship between frame scheduling and cell scheduling. In particular, a cell scheduler capable of providing 100% throughput without speedup (e.g., based on maximum weight matching [15] ) will choose a sequence of configurations within the next time slots such that, for , for all since VOQ arrival processes are ergodic. Moreover, for all , and hence, as , the frame scheduler and the cell scheduler serve each VOQ at identical rates and achieve the same throughput. Note that this asymptotic equality between the schedules of the periodic frame scheduler and the cell scheduler has a conceptual importance rather than a practical one. Indeed, the cell delay with the frame scheduler can be arbitrarily high, as . We just showed that the 100% throughput can be achieved for a unicast rate matrix with a periodic frame scheduler for any speedup . For multicast rate matrices, quantization is slightly more complicated. To obtain a periodic schedule with a period , the rate of the unicast cells as well as every class of multicast cells need to be an integer multiple of . Therefore, we quantize rates on a per MC-VOQ basis. The following theorem illustrates how per MC-VOQ rate quantization also leads to a periodic switch schedule for multicast traffic, for any speedup . Theorem 3: Let the fanout set expansion for a multicast rate matrix be given by . For any given , there exists a matrix such that: 1) for all and for all ; 2) there exists a fanout expansion such that, for all , and every entry of is an integer multiple of . Proof: Here, we quantize the rate of every MC-VOQ. Let . For all , every nonzero entry in is increased by a positive amount to the closest integer multiple of to form . Clearly, (2) is satisfied after the described quantization since each entry of is an integer multiple of and is no less than the corresponding entry of . Now, we need to show (1) for the new matrix . In the fanout expansion of , there can be no more than matrices for which any given input-output pair is nonzero. After quantization, each such entry is increased by an amount no more than . Since a fanout set has a maximum cardinality of , each row of matrix is increased by a total of no more than . Hence, for all . Repeating the same argument for each column (we skip this for brevity), one can show both (1) and (2) can be satisfied simultaneously, applying rate quantization on each component of the fanout set expansion.
Note that since the rate for every fanout set is made an integer multiple of some , similar to the unicast scenario, the corresponding schedule of configuration matrices is periodic. We provide such a scheduler in Section V. The period of the schedule (and hence the cell delay) for a multicast rate matrix is, though, higher than that for a unicast rate matrix by a factor of . In this section, we showed that with an arbitrarily small speedup used for rate quantization, there exists a frame scheduler that provides a periodic service of a period, inversely proportional to . We also argued that along with that delay, long enough for "sufficient averaging" of the ergodic input traffic, one can find a cell scheduler that provides identical service rates as the frame scheduler. Periodic frame scheduling also leads to the Clos network analogy, which we study in the following section.
IV. ANALOGY BETWEEN CROSSBAR SWITCH SCHEDULERS AND CLOS NETWORKS
A three-stage Clos network is a multistage switching architecture, and it is specified using three parameters as shown in Fig. 2 . There are middle-stage crossbars of size to connect input-stage switches of size to output-stage switches of size . Clos networks have been traditionally used for circuit switching, and a detailed treatment can be found in standard switching textbooks, e.g., [7] . Now, consider a frame-based scheduler with a schedule of period . Let the speedup be , so the switch goes through configuration matrices within the frame of time slots. The above scheduler is "time-analogous" to the circuit-switching Clos network as illustrated in Fig. 3 for the case of unicast.
If input of the crossbar switch requires to send cells to output within a frame of slots, then in the associated Clos network, of the input links of input-stage switch require to make circuit connections to of the output links of the outputstage switch . If has a "1" in position , then the th middle-stage crossbar connects input-stage switch to outputstage switch . Consequently, the middle-stage crossbars are set to have configurations . In a sense, the middle-stage crossbars of the analogous Clos network replicate the entire sequence of configurations of the original crossbar switch, from the top to the bottom in space. Note that configuration matrices can contain multicast connections as well, as illustrated in Fig. 4(a) .
Any periodic frame schedule with a period corresponds to a fixed circuit assignment in the Clos network. The ratio of the number of middle-stage crossbars (i.e., number of output links of each input-stage switch) to the number of input links of each input-stage switch is the speedup . The analogy is set for a periodic frame scheduler. For a nonperiodic (infinitely long) schedule, one can still set the analogy after the application of rate quantization. As shown in the previous section, for any , one can choose and to find the Clos network, analogous to the scheduler of the quantized matrix.
There is one more thing we need to describe to complete the analogy for the case of multicast. In the case of fanout splitting, different copies of a multicast cell are served over different configuration matrices within a frame. If fanout splitting is not allowed, each multicast cell can be served with only a single configuration over the frame. In the analogous Clos network, to replicate fanout splitting, multicast connections in input-stage switches are used as illustrated in Fig. 4(b) . If fanout splitting is not allowed, input-stage switches are limited to only unicast configurations since each cell (input link) can go through only one middle-stage crossbar. Note that Theorem 1 also shows that middle-stage crossbars are necessary to support multicast circuit switching in a Clos network in which only point-to-point connections are allowed at the input-and output-stage switches.
Based on our analogy, instead of asking the question "What is the necessary speedup to support all admissible multicast traffic in a crossbar switch?" we ask "What is the necessary number of middle-stage crossbars to support multicast circuit switching in a Clos network?" The second question is also difficult and, to the best knowledge of the author, unanswered. However, things become simpler once we focus on different "grades" of nonblocking Clos networks, which lead to an interesting set of packet schedulers, as we shall discuss in the following section.
V. NONBLOCKING SWITCH SCHEDULING FOR MULTICAST TRAFFIC
In a switching network, blocking is the failure to satisfy a certain set of connection requirements because of the absence of nonconflicting internal paths between the input links and the output links. A switching network is nonblocking if a connection can always be set up between any idle input and any idle output. There are multiple degrees of nonblocking.
If a connection between an idle input-output pair cannot necessarily be established without rearranging the existing connections, the network is called rearrangeably nonblocking. For a Clos network to be rearrangeably nonblocking for unicast connections, the number of middle-stage crossbars need not be more than the number of input links per input-stage switch, i.e., . This implies that no speedup is necessary for a frame scheduler (given a sufficiently high delay) to provide 100% throughput for any admissible unicast traffic. Indeed, Birkhoff-von Neumann switches [8] use a frame scheduler to achieve 100% throughput with no speedup. However, if a change occurs in some entries of , a new schedule needs to be constructed. It is possible that the change cannot be accommodated with a minor modification in the schedule.
A network is strictly nonblocking if a connection between an idle input and an idle output can always be established, without the need for a rearrangement of the existing connections. If a network is strictly nonblocking, then there exists a path between any idle input-output pair regardless of the existing configuration of the middle-stage crossbars. Thus, to satisfy an incoming connection request, a simple search for that middle-stage crossbar will be sufficient. It is well known [7] that a three-stage Clos network is strictly nonblocking for unicast connections if and only if . The analogous scheduling interpretation of strictly nonblocking is interesting for unicast frame schedulers. First, consider a unicast rate matrix such that each entry is an integer multiple of some . There exists a periodic frame schedule, which repeats itself once every schedule slots to provide 100% throughput for . In the associated Clos network, the use of middle-stage crossbars "decouples" the circuit setup process for input-stage switches in that each input-stage switch is capable of finding paths to the output-stage switches independently of the paths that the other input-stage switches set up.
Similarly, in the frame scheduler, the use of a schedule frame of size decouples the scheduling of cells at distinct inputs. For instance, suppose an input link has a cell to be sent to output . For a crossbar schedule, a strictly nonblocking associated Clos network guarantees that, among the frame of configuration matrices, the existence of a configuration matrix whose th output is not already reserved by some other input link. Hence, in order to achieve 100% throughput, it suffices that each input independently makes an exhaustive search over the frame of configuration matrices for all the outputs that it possesses cells to send. This leads to a frame scheduler based on maximal matchings, as we will outline next. Note that the speedup necessary to quantize any given matrix to have entries, integer multiples of , is . Thus, the total speedup necessary to achieve the above decoupling is for any given unicast rate matrix . Note that the necessary speedup decreases as is increased at the expense of a longer frame and consequently a higher delay.
A. Unicast Nonblocking Switch Scheduling Algorithm
For a given unicast rate matrix and some frame period , the scheduler goes through the following steps: 1) Apply rate quantization on to obtain with all entries integer multiples of and the rows and the columns of sum to .
Repeat (2) between the inputs and the outputs since there exists no with all "0"s in an entire row and an entire column , unless at that point. Note that the nonblocking switch scheduling algorithm terminates with a matrix with all "0" entries. This is guaranteed by the fact that the Clos network, , corresponding to our frame scheduler is strictly nonblocking. The significance of the result is that we built a frame scheduler for a crossbar switch based on the intuition we drew from the corresponding circuit-switching problem in the Clos network.
The algorithm works for any frame size . As , the necessary speedup and the contribution of the portion of speedup required for quantization goes to 1. Moreover, as , the nonblocking switch scheduling algorithm becomes identical to a cell scheduling algorithm based on maximal matching. This proves a weaker version of a result, which was initially showed in [9] : For unicast input traffic, a speedup of 2 along with maximal matching is necessary and sufficient for work conservation (which is stronger than 100% throughput) as we just showed. Also, an algorithm, LOOFA, with time complexity O is provided in [9] to obtain the desired maximal matchings. Note that here the tradeoff between the speedup and the delay is clear. The higher the frame size , the delay increases proportionally, whereas the speedup goes down to 2.
Above, we described a sequential nonblocking switch scheduling algorithm in which the configuration matrices are formed one by one. One can realize that this construction can be done online as the cells are transferred from the inputs and the outputs. To construct a configuration matrix, each input simply picks no more than a single entry that had not been claimed before. Including the search for that free output for each of the inputs, the time complexity of the algorithm is O per schedule slot, and it can be implemented in a distributed fashion at different inputs.
Next, we state the multicast nonblocking switch scheduling algorithm. Then, we discuss the necessary speedup and complexity. Strictly nonblocking schedulers allow the possibility of full fanout splitting, which leads to a necessary speedup of to achieve 100% throughput for all admissible multicast traffic. Consequently for multicast, we will exploit the notion of widesense nonblocking. 5 Wide-sense nonblocking in Clos networks 5 The analogous results to wide-sense nonblocking for unicast traffic include [16] , [13] , and [6] . In [6] , it is shown that with some speedup s > 1, the frame delay can be reduced by a factor O(n) for unicast traffic. Indeed, the delay is inversely proportional to s01, as the speedup s takes on values between 1 and 2.
is accompanied by a circuit-switching algorithm to guarantee that an incoming connection to a free output link can be accommodated. In what follows, we provide a scheduling algorithm, which is analogous to a circuit-switching algorithm [10] for wide-sense nonblocking Clos networks. We discuss the associated speedup later on.
B. Multicast Nonblocking Switch Scheduling Algorithm
The main difference between this algorithm and its unicast counterpart is the following. In the unicast version, each input takes turns to form configuration matrices in order, one by one. Here, again each input will take turns once every round, but the entire frame of configuration matrices is constructed, not necessarily one by one nor in any particular order. Instead, in every round, each input chooses an arbitrary fanout set for which it has nonzero rate of cells. Then, that input searches among the available configuration matrices for one that can accommodate the highest number of connections for that particular fanout set. Note that if there exists no matrix to accommodate the entire fanout set, only then fanout splitting is applied and different copies are served over different configuration matrices.
More precisely, for a given multicast rate matrix and a given frame period (we discuss how is chosen later), the scheduler goes through the following steps:
1) Quantize all entries of every nonzero matrix , , in the fanout set expansion of to make them an integer multiple of some . The quantized matrix satisfies for every column . Repeat (2)-(3) to construct configuration matrices ( is unknown at this point) until all entries of are set to 0. Initially, all configuration matrices contain all 0 entries. 2) Each input takes order (randomly or any arbitrary order).
The current input chooses an arbitrary element of the quantized fanout set expansion (possibly the unicast component) for which row has a nonzero entry, i.e., for some . Let be the fanout set of that and be the set of outputs , for which for all inputs . Input chooses the configuration matrix (3) If multiple matrices solve (3), one of them is chosen randomly.
3) For all , input sets and . Also, let the matrix be the component in the fanout set expansion for which . If , it means fanout splitting has to be applied. To handle the remaining part of the fanout set that has been left unscheduled, set for all . Even though multicast nonblocking switch scheduling algorithm forms the entire frame before scheduling, it can still be made online to work as a cell scheduler as follows. In every time slot, received cells at each input are placed in an appropriate MC-VOQ. Then, as in step (2), each input looks through all the configuration matrices that will be served within the next (unknown at this point) schedule slots in order to find the one that has the available set of outputs with the largest intersection set with the fanout set as in (3) . That portion of the fanout set is scheduled to be transmitted with that matrix. The process is repeated for the remaining part of the fanout set until the entire fanout set is covered for that cell. Therefore, any arriving cell is guaranteed to be transferred to its entire fanout set within the next schedule slots (corresponding to a worst-case delay of time slots) if is chosen appropriately. The choice of and also determines the speedup , where the first component of is required for quantization and the second component is required for the algorithm. We make use of the following existing results in circuit switching in Clos networks to evaluate that second component of .
Multicast nonblocking switch scheduling algorithm is analogous to the circuit-switching algorithm for Clos networks, given in [17] and later in [10] . There, it was shown that middle-stage crossbars is sufficient for the Clos network to be nonblocking with the circuit-switching counterpart of the above multicast nonblocking switch scheduling algorithm. Note that this value is a tight upper bound on the necessary speedup since it is also shown that the necessary number of middle-stage crossbars grows as O with the switch size. Thus, by choosing the frame size , for some , one can show that a speedup is sufficient for multicast nonblocking switch scheduling algorithm to accommodate any admissible multicast rate matrix. Consequently, for each arriving cell, the scheduler needs to search within the next schedule slots to find (3). In [4] , which considers an internally buffered crossbar fabric, the sufficiency of the speedup is shown for a certain class of input traffic. With the analogy, we show that the same speedup scaling is sufficient, even for the unbuffered crossbar fabric.
The time complexity of the software implementation of multicast nonblocking switch scheduling algorithm is O per schedule slot: Up to cells are scheduled, and each search goes through up to possibly schedule matrices up to times (since the cell may need to be fanout split up to times). Each input needs to keep the schedule for every cell enqueued at its MC-VOQs. Since the delay is no more than for each cell, the required amount of memory per cell is no more than O . In order to keep the schedule for a frame of cells, the total memory requirement for the switch schedule, including the possibility of fanout splitting, is O . Note, however, that the "regularity" of the algorithm enables highly efficient, parallel hardware implementations as studied in [18] for circuit switching in Clos networks. There, the authors divided the required operation of finding the desired middle-stage crossbar into some basic types of functions such as finding the cardinality of a set and comparison. A total of counters are used to keep the number of occupied output links of each middle-stage crossbar. These counters are connected to a parallel comparator array, which finds the minimum cardinality middle-stage crossbar among a candidate set. For each connection request between an input-and an output-stage switch, this process of counting and comparing to find the appropriate middle-stage crossbar is repeated for O times, requiring O gate propagations. If the same system is adopted for cell scheduling, it will have a time complexity of O per schedule slot (since cells are scheduled in each schedule slot), much lower compared to the time complexity of the software implementation. However, the required amount of memory (hardware cost), O , is slightly higher than that of the software implementation.
Last, we would like to mention that the implementation of the MC-VOQs is the main difficulty in realizing the algorithm in hardware or software. One way of implementing them is by using linked lists, i.e., all the cells arriving at an input are stored in a single buffer and each cell is assigned a pointer to keep track of its fanout set. In this implementation, each input requires a single buffer (as opposed to separate ones), but the number of possible fanout sets that each input needs to keep track of is per cell. This corresponds to an extra memory requirement of bits per cell, which becomes impractical with increasing buffer sizes. Consequently, we would like to emphasize that the value of our scheduler is rather conceptual than it is practical.
Discussion: Unfortunately, parallel results do not exist for rearrangeably nonblocking multicast capable Clos networks. Using more complex schedulers compared to the given nonblocking switch scheduler, it may be possible to achieve an o scaling for the required speedup. However, since for 100% throughput (for any scheduler) we know that the necessary speedup grows unboundedly as the size of the switch, it is not clear whether the extra effort for lowering the throughput is worth it. Indeed, we also know that, achieving 100% throughput using the minimum speedup necessary is NP-hard, which is not the case with multicast nonblocking switch scheduling algorithm.
VI. SUMMARY AND CONCLUSION
In this paper, we use the analogy between packet scheduling in crossbar switches and circuit switching in a three-stage Clos network to study a number of issues involving providing 100% throughput to all admissible unicast and multicast traffic over crossbar switches. To set up the analogy, we presented a theory of rate quantization.
We showed that for a crossbar switch of size , a speedup of O is sufficient to support 100% throughput for any admissible multicast traffic, and we provided a scheduler for this task. For the scheduler, we exploited circuit switches for wide-sense nonblocking Clos networks. The problem of multicast switch scheduling with minimum necessary speedup is NP-hard, whereas the time complexity of multicast nonblocking switch scheduling associated with the efficient hardware implementations of our scheduler is only O per time slot.
We also showed that if fanout splitting of multicast packets is not allowed, a speedup of 2 is necessary, even when the arrival rates are within the admissible region for unicast traffic, for which no speedup is necessary to provide 100% throughput. Thus, disabling the fanout splitting of multicast cells may not be an efficient solution for the complexity problem.
Based on the well-known equivalence between three-stage Clos networks and frame-based scheduling, we revisited some problems in unicast switch scheduling. We illustrate that the well-known result that "a speedup of 2 is necessary for 100% throughput for all admissible unicast traffic using maximal matching" becomes a straightforward by-product of the Clos network analogy using strict sense nonblocking. We believe the analogy between the scheduling problem in crossbar switches and nonblocking circuit assignment in Clos networks with wide-sense nonblocking (see, e.g., [19] for various theorems on wide-sense nonblocking) can be insightful if one considers speedup values between 1 and 2.
APPENDIX I RATE QUANTIZATION ALGORITHM AND PROOF OF THEOREM 2
Our algorithm generates matrix (and thus matrix ) in two steps. In the first step [(1) in Example 2], a matrix with all entries integer multiples of is constructed. Every entry of the original matrix is increased by some nonzero amount, so they all become integer multiples of . The column sums of matrix are not necessarily identical.
In the second step [(2) in Example 2], sufficiently many entries of are reduced by to make the column sums of matrix equal 1. The challenging part of the algorithm is choosing which entries to reduce. To illustrate that this indeed is not a straightforward task, consider the above example and suppose we construct from starting with the first entry of the first row. Proceed with that row going through all the columns from left to right, reducing each entry by if the sum of the entries of that column is greater than 1, until the first row sum becomes 1. Once the first row entries sum to 1, proceed with the second row and repeat the process. After completing the second row, we end up with the following matrix, whose third row is yet to be processed:
As we proceed with the third row, the only entry that can be reduced is the final one, 0.7, since all the other column sums are already 1. However, it has to be reduced by 0.2 for the resulting matrix to be doubly stochastic. If we do so, we end up with . Hence, we cannot choose the entries to be processed arbitrarily, and we must be more careful in constructing since each entry of can be reduced by no more than once.
Rate Quantization Algorithm
We first give the algorithm formally, and then a detailed explanation of each step follows. Initial Values: Let and be such that and are the th row and th column sum, respectively, as illustrated in Fig. 5 . Let and .
Repeat (1)- (2) 
.
Setup: Given any , there exists a such that is an integer multiple of for all . Let be the matrix whose entry is . Define . All rows and columns of sum to integer multiples of . By definition, 1 is also an integer multiple of , and thus, as illustrated in Fig. 5 , we can represent the sum of the entries of the th row and the th columns and , respectively, where and are positive integers.
In the iterative step, the algorithm scans row by row, starting with the row with maximum row sum , and determines whether the entry will remain unchanged or reduced by before it is copied as the corresponding entry of the output matrix . Each row is scanned starting from the entry with the largest column sum and continuing with entries of decreasing column sums. If both and are positive for the current , that entry is reduced by , and otherwise it is copied directly as the corresponding entry of .
The described algorithm reduces the elements of each row of in the order of decreasing row sums. We prove Theorem 2 constructively by proving that the described algorithm indeed ends up with matrix of the desired form. Note that one might also randomize the procedure and work on a row randomly picked at every iteration. This modified algorithm and the proof of correctness for the modified algorithm can be found in [20] .
Lemma 1: Rate Quantization Algorithm successfully terminates with a matrix , which is doubly stochastic.
Before we give the proof of the lemma, note that and We can represent and as an entry of the vectors and , respectively. Let be the number of columns , for which . For example, if , then as illustrated in Fig. 6 . Proof: By induction. We shall first show that initially (4) for all . Thus, for any and for , which is the first row to be processed, the algorithm will always be able to find sufficient entries to reduce (by ) to make the row sum equal to 1. We will prove a more general version of (4) (5) namely, the vector is majorized by the vector . For the definition, see Appendix II or [21] for a complete treatment of majorization.
First, we prove that (5) holds at the beginning of the algorithm. Recall that . Hence
. Let the th column vector of be , and thus and , where . From Kemperman's theorem [22] , is majorized by any vector for which entries are , and the other entries are 0. Hence (6) Thus, the vector on the right side of (6) is the maximal vector (in the sense of majorization) of the set of vectors whose entries are between 0 and and . Let us denote the maximal vector of the th column vector by . Now, let us define a new matrix, , where each column is the maximal vector of the corresponding column of . Note that the vector of column sums for this new matrix is , and thus the corresponding distribution will be ; however, the row sums are not . Let the vector of row sums for our matrix be . Thus, is the number of columns with , i.e., is the number of columns with , i.e., and so on. More precisely, is the number of columns , for which . Thus However, the vectors , are order-symmetric (see [21] for the definition). Hence, we get the desired result using Day's theorem [23] (8)
We just showed that at the beginning of the algorithm, , and thus , for all . That is, the first step of the algorithm can be executed successfully to make the first row sum to 1. The partial sums 6 of the two sequences are illustrated in Fig. 7 . Such curves are called Lorentz curves and if, for two vectors, , then the partial sum curve for will always be above that of .
Next, we will prove that a similar majorization relation holds at the beginning of every step of the algorithm. We will use induction as follows. We have shown that at the beginning of the first step. We now assume that it holds at the beginning of the th step, , and show that it still holds at the end of the th step. As a by-product, we also show that the algorithm can successfully complete each step.
Suppose the algorithm successfully constructed the first rows of . We will show that (5) still holds at the beginning of the th step, and the corresponding row of can be formed successfully.
First, let us focus on the two vectors and at the beginning of step . At this point, , where is the th entry in decreasing order from the largest in at the beginning of the algorithm (before any row is processed). The sum of the entries of the row that is currently being processed is . By the induction hypothesis, we assume ; therefore, there should be as many 0's in vector as there are in (verified in Appendix II). Since there are at least 0's in , we have . At the beginning of the th step, 6 The mth partial sum of a vector,ṽ, is defined to be v . Recall that v ṽ if every partial sum ofṽ is at least as great as the corresponding partial sum ofṽ the entries of and (the decreasing rearrangement of the entries of can be listed as follows:
Since , there exists at least one entry in that is greater than or equal to . Let the smallest such entry be . Lemma 2: At the end of th step, the only change in is that the entries and will be replaced with and a 0. Proof: These two changes can be explained as follows. The algorithm will look into the current for the column with an entry that has not yet been reduced in step and that has the maximum column sum and reduce it by . Suppose this maximum column sum is for some . This operation will reduce the number of columns such that by 1. Thus, the only change in will be in the smallest nonzero entry, , which will decrease by 1. If that entry is greater than 1, then there were multiple entries with the maximum column sum. The algorithm continues with these other entries. Hence, if the original value of is greater than , then after processing entries, will become 0 and entries will be left to be decreased at the row currently being processed. The algorithm will continue with the entries that have not been reduced before and with highest possible column sums. At this stage, the new value of is 0, and the new value of is . Note that potential entries have already been processed, and if is greater than the second largest entry of , then will be reduced to , but no further beyond that since potential entries have already been processed. Similarly, each entry of , which is smaller than , will be replaced with the next entry in order. Finally, the first entry in that is greater than will be reduced by only . Hence, after the th row is processed, will have a 0 replacing and a replacing . Note that at the end of the th step, will be the same, except will be replaced with a 0.
This process is illustrated in Fig. 8 assuming , i.e., at the beginning of step . If , then at the end of step . Notice that 6 is the smallest entry in greater than or equal to . Hence, 6 and 4 are changed to and 0, respectively. Now, we show that at the end of the th step of the algorithm. However, before that, we present a graphical illustration of what happens in the th step. The Lorentz curves of and are illustrated in Fig. 9 . The entry is removed from . The new Lorentz curve for can be sketched from the old one by just removing the first segment of the curve and attaching the rest of the curve to the origin as illustrated in the figure. The new Lorentz curve for can similarly be sketched with some modification to the old one. The algorithm will find the segment with the smallest increment greater than . Then, it will reduce this increment by , remove , and attach the two separate parts. The two Lorentz curves intersect at 0 and at . Initially, these are the only two points they intersect, and the curve for is always above the curve for otherwise. We need to show that this is the case after the th step. This can be easily observed from Fig. 9 . Since the removed segment in is to the left of the reduced segment of , the distance between the two curves will only increase in between these modified segments and remain the same outside this region at the end of the th step. We can prove this statement as follows. There are two regions we need to consider as shown in the following table:
At the end of the th step, the partial sums of the two sequences are as follows. In region I, at the end of the th step, will be replaced with a 0, and it will no longer be in the second region. All the entries of will be unchanged up to . Thus, the partial sums will change in favor of by an extra from the beginning all the way down to . This entry is replaced with , and the next entry, , will be replaced with a 0 and removed from the second region. The total decrease in the partial sums of in the first region is . The extra gained in favor of earlier by the removal of from vector is good enough to make up for this loss of . The second region for both and is expanded similarly, with the addition of a 0. This will not affect the partial sums, and hence the majorization is preserved.
Thus, we proved that at the beginning of each step, (5) holds and , for all . Therefore, the algorithm will always be able to find the desired number of entries to reduce, and at the end of the algorithm, , for all . However, since (11) and , for all , it is also true that , for all , completing the proof. Lemma 3: Every entry of is an integer multiple of . Proof: The input matrix of the algorithm already has all the entries' integer multiples of . We complete the proof noting that the change in each entry from to is an integer multiple of (either reduced by or left unchanged).
Lemma 3: Every entry of is at least as great as its counterpart in decreased by (12) Proof: Note that (13) Since the algorithm reduces every entry by at most (14) Inequality (12) is immediate by (13) and (14) .
Combining Lemmas 1 and 3, we proved that is the doubly stochastic matrix, all entries of which are an integer multiple of . Since is defined as the constant matrix, composed of for all input-output pairs , all entries of matrix are integer multiples of , and the rows and columns of sum to
. Moreover, Lemma 4 shows that for all input-output pairs . Since during Rate Quantization Algorithm, no entry of is increased by more than to construct matrix , we can write . Therefore, putting Lemmas 1, 3, and 4 together, Theorem 2 is proved. In [20] , we prove that Lemma 1 holds even if Rate Quantization Algorithm processes the rows of matrix in an arbitrary order rather than processing the one with the maximum row sum in each step.
APPENDIX II BASIC DEFINITIONS IN MAJORIZATION
For any , let denote the components of in decreasing order, and let Suppose Subtracting both sides of (15) from , we get (17) which is equivalent to (15) . Hence, if and both have nonnegative entries, there are at least as many 0's in as in .
