Abstract. This paper focuses on the scalability problems for high-speed switches, and presents an integrated scheduling algorithm that supports unicast and multicast traffic efficiently in input-queued packet switches. Considering the tradeoff balancing complexity and performance, the proposed integrated algorithm performs without iteration, and reduces the scheduling overhead to O(N) with a two-phase (request-grant) sequential scheduling for unicast and multicast traffic. In addition, it can be implemented in a fully distributed way, which is more suitable for high-speed switches. Simulation results show that the proposed algorithm exhibits a good performance in terms of throughput and average delay, at different traffic compositions under various traffic patterns.
Introduction
The growing number of newly emerging applications on the Internet has created an increasing need for efficient multicast traffic support. As a result, with the continuous growth of bandwidth in fiber links, the need for switches/routers that are capable of switching unicast and multicast cells at very high speeds is urgent. To the best of our knowledge, the integrated scheduling algorithms presented are, in fact, a combination of earlier unicast and multicast algorithms unified in one integrated scheduler. The input queuing structure has also been a combination of unicast queuing structure and multicast queuing structure.
The widely used unicast queuing structure is the virtual output queuing (VOQ) since it can avoid the head-of-line (HoL) blocking problem, and 100% throughput could be achieved using schedulers such as iSLIP [1] and DRRM [2] . In a VOQ-based N × N switch, N queues are maintained at each input port; each queue contains packets having the same destined output. As for multicast traffic, a multicast packet can have more than one destination, known as its fan-out set. Consequently, multicast queuing structure can vary from just one multicast (first in first out) FIFO queue per input to 2 1 N queues per input, where N is the number of output ports of the switch, and considerable amount of solutions based on the architecture have therefore been proposed such as [3] [4] [5] [6] [7] [8] . The performance of such queuing structure was analyzed in [9] [10] [11] . Depending on the above input queuing structure, integrated scheduling algorithms have been proposed. They were mainly proposed for the input queued (IQ) crossbar-fabric-based switching architecture because of its scalability, low hardware requirements, and its intrinsic multicast capabilities.
Most of these algorithms were based on input VOQ for unicast traffic and one FIFO queue for multicast traffic, such as ESLIP [12] and others [13] . However, the HoL blocking problem of multicast traffic limits the throughput achievable by switches. Other algorithms [14] used VOQ for unicast traffic and a small
, of FIFO queues for multicast traffic to alleviate the multicast HoL blocking.
Algorithm in [15] used VOQ queuing structure for unicast and multicast traffic separately to deal with the HoL blocking.
Compare to the main constraint of limited energy of sensors in designing wireless sensor networks protocols [16] , with backbone networks, high-speed switches have very short time to perform scheduling as link speed grows dramatically, and as a result iterative design and high scheduling overhead with existing integrated scheduling algorithms become the bottleneck for integrated scheduler designs since scheduling overhead scales up very quickly as the link speed and switch size increase, and the need for simple and high performance switches which support unicast and multicast traffic simultaneously is urgent. For this reason, we propose a new non-iterative integrated scheduling algorithm named Unicast and Multicast Dual
Round-Robin integrated algorithm (UMDRR) which performs with only one matching cycle by a sequential scheduling for unicast and multicast traffic in a time slot rather than traditional log(N) iteration times, and reduces the multicast scheduling overhead from O(kN) to O(N), which makes it implementable at high speeds. Simulation results show UMDRR achieves a good performance under various traffic patterns.
The rest of the paper is structured as follows. In Section 2, we describe the system architecture and the proposed integrated scheduling algorithm with scheduling overhead analysis. In Section 3, we evaluate the performance of the proposed scheme by simulation. Finally, we conclude the paper in Section 4.
System architecture and the algorithm
The proposed integrated scheduling algorithm is targeted at N × N input-queued switches. We first describe the system architecture of the proposed algorithm and then elaborate on the details of the algorithm.
System architecture
The N × N switch system architecture of the proposed integrated scheduling algorithm is shown in Figure 1 . We fix our attention on synchronous slotted switch architecture. The incoming variable-sized packets are segmented into fixed-sized packets before entering input queues and segments are put back As illustrated in Figure 1 , two sets of queues are organized separately at each input port. For unicast traffic, VOQ technique is deployed and N VOQs are maintained at each input; for multicast traffic, a small number of FIFO queues are allocated at each input port. Unicast packets are assigned to the proper queues according to their destinations, while multicast flows are partitioned into the k queues according to a modulo multicast cells assignment described in [17] .
We first define the terms that are used throughout the paper. Let be the average arrival rates, equal to the input load, µ be the output load, and the unicast and multicast output loads be denoted as µ u and µ m , then the following relations hold:
where P u (P m ) represents the probability that an arrival packet is a unicast (multicast) cell, and the average number of destinations of multicast cells. The total length of unicast traffic queues and multicast traffic queues are denoted by L u and L m respectively, and then the total length of mixed traffic queues L is derived as
where ,
represents the length of the jth unicast (multicast) queue allocated at input i.
Integrated scheduling algorithm
By employing an existing unicast scheduling algorithm [2] and a new multicast scheduling algorithm, we propose a sequential integrated scheduling algorithm that supports unicast and multicast traffic efficiently. Unicast scheduling and multicast scheduling are coordinated together with a specific priority in a time slot. The scheduling procedure works as follows.
Both unicast traffic scheduler and multicast traffic scheduler are distributed at each input and output port. Each input scheduler maintains three priority pointers: a unicast pointer, a multicast primary pointer and a multicast secondary pointer. Primary pointers are designed to provide fairness among k multicast queues at each input, while secondary pointers are used to alleviate the HoL blocking and thus guarantee high performance. Each output scheduler maintains two priority pointers: a unicast pointer and a multicast pointer. All output multicast pointers point to the same input, and increase by one at each multicast time slot. This pointer update rule [4] is fundamental to guarantee that the scheduler can run in a fully distributed way. We denote the input preferred by all output multicast pointers as primary input, and the others as secondary inputs. A detailed description of the integrated algorithm follows, including three phases:
Phase 1: At the beginning of each time slot, determine the scheduling priority with the following probabilities:
A time slot identified to schedule unicast (multicast) traffic first is called a unicast (multicast) slot.
Phase 2: Serve the prioritized traffic. This process includes the following two steps:
Step 1: Request. In a unicast slot, each input sends an output unicast request corresponding to the first nonempty VOQ in a fixed round-robin order, starting from the current position of the unicast pointer. The unicast pointer of the input scheduler is incremented to one location beyond the selected output if, and only if, the request is granted in step 2. In a multicast slot, the primary input (each secondary input) sends multicast requests to all destined output ports corresponding to the first nonempty multicast queue in a fixed round-robin order, starting from the current position of the multicast primary pointer (secondary pointer). The primary pointer (each secondary pointer) of the primary input (each secondary input) is incremented to one location beyond the selected queue. Step 2: Grant. In a unicast slot, if an output receives one or more requests, it chooses the one that appears next in a fixed round-robin schedule starting from the current position of the unicast pointer. The output notifies each requesting input whether or not its request was granted. The unicast pointer of the output scheduler is incremented to one location beyond the granted input. In a multicast slot, if an output receives one or more requests, it chooses the one that appears next in a fixed round-robin schedule starting from the current position of the multicast pointer. The output notifies each requesting input whether or not its request was granted. The multicast pointer of the output scheduler is incremented by one.
Phase 3: Serve the nonprioritized traffic with the remaining resources. This process includes the following two steps:
Step 1: Request. In a unicast slot, each unmatched input sends multicast requests to all destined output ports corresponding to the first nonempty multicast queue in a fixed round-robin order, starting from the current position of the multicast secondary pointer. In a multicast slot, each unmatched input sends an output unicast request corresponding to the first nonempty VOQ in a fixed round-robin order, starting from the current position of the unicast pointer.
Step 2: Grant. In a unicast (multicast) slot, if an unmatched output receives one or more requests, it chooses the one that appears next in a fixed round-robin schedule starting from the current position of the multicast (unicast) pointer. The output notifies each requesting input whether or not its request was granted.
Scheduling overhead analysis
The major problem with existing iterative scheduling algorithms is that the scheduling overhead scales up very quickly as the link speed and switch size increase, which limits the scalability in high-speed switches having very short time to perform scheduling. This study overcomes the limitations and proposes a new integrated scheduling algorithm with reduced communication overhead.
We first define scheduling overhead as the information exchanged at an input port in one matching cycle. As we can see from Table 1 
Performance evaluation and analysis
In this section, we show some simulation results derived from OPNET Modeler [18] . To evaluate the performance of the proposed integrated scheduling algorithm, we consider several different traffic conditions and compare the algorithm with ESLIP [12] and fSCIA [14] for a 16×16 switch. The ESLIP algorithm is chosen for comparison because it is practical, and is being deployed on commercial switching products, while fSCIA using a number of queues for multicast traffic as well, and exhibits a good performance. The simulated switch is assumed to have sufficient buffers at the input. We consider the mixture of unicast and multicast traffic in this study, and algorithms perform with a single iteration for a fair comparison.
Traffic model
Two traffic scenarios are used to evaluate system performance. For Bernoulli (uncorrelated) arrival, in each time slot, the probability that a new packet arrives is independent of any other time slot; for Bursty Note that when 5 = 0, the load is uniform over all outputs and when 5 = 1, the unicast traffic load is totally unbalanced. Figure 2 and Figure 3 show the average delays against the output load for ESLIP, fSCIA and UMDRR under uniform traffic. With given traffic composition (P m =0.1), we can observe that as output load increase, UMDRR is very effective in reducing the average delay, and performs reasonably well. It has lower latency comparing to ESLIP and fSCIA with a single iteration. We also show the simulation results of ESLIP and fSCIA with log(N) iteration times for a reference. We can see that at the expense of high complexity, ESLIP and fSCIA with 4 iterations achieve lower cell delays, however, the difference is even not significant at high load for ESLIP. Note that the reason why ESLIP and fSCIA with one iteration perform not very well is that they experience an inefficiency matching where some of the grants can be wasted because of input contention, and as a result some outputs can be idle for the scheduling decision in the timeslot. 
Performance under uniform traffic

Performance improvement by increasing k
The performance of delay and throughput for the proposed integrated algorithm can be increased efficiently through increasing the number of multicast queues. It is of great importance that the multicast scheduling overhead of the proposed algorithm remains O(N) when k grows, which is different from the Figure 8 and Figure 9 present the improvement of delay and throughput performance introduced by increasing the number of multicast queues for Bernoulli traffic and Bursty traffic, respectively. For the given traffic composition(P m =0.1), we can observe that with a novel update rule for multicast requesting pointers at each input, the delay and throughput of UMDRR improve efficiently as k increases. The intuition behind this is that as k increases, the update rule of the requesting pointer allows more new cells to participate in scheduling during the next time slot, and as a result reduces the output contention and consequently alleviates the HOL blocking problem. From Figure 9 we can see that the improvement of the throughput is not significant when the multicast traffic fraction is small, and as the proportion of multicast traffic grows, the improvement is obvious. We can also observe that a high throughput can be achieved when k grows to 8, which corresponds to the conclusion in [9] that a small number of multicast queues (less than 10) are enough to obtain a high switch performance. 
Conclusion
In this paper, we present a scalable, fully distributed, fair, and simple integrated scheduling algorithm that supports unicast and multicast traffic simultaneously. From a practical point of view, the proposed algorithm reduces the multicast scheduling overhead from traditional O(kN) to O(N) , whereas provides a good performance in terms of delay and throughput. Simulation results show that the algorithm is more suitable for large capacity, high-speed switches/routers that have very short time to perform scheduling under various traffic patterns. In addition, several issues are not discussed in this paper including the improvement of switching performance with pure unbalanced unicast traffic and the analytical analysis of the proposed integrated scheduling algorithm, and it will be discussed in our further study. 
