Abstract: Recent advances in optical networking reveal that large-scale optical networks supporting heterogeneous traffic may soon become economical as the underlying backbone in wide area networks, in which optical routers play a key role. One big challenge in the design of future large-scale optical systems is packet scheduling for the core optical routers. The optical router essentially is a delay system with packets waiting at its ingress queues. A scheduler is necessary to allocate resources so that satisfactory delay and jitter performance of different types of traffic can be achieved and system capacity is efficiently utilized. This paper develops a nonblocking scalable scheduling algorithm and presents related performance evaluations in a multi-service high capacity core optical router. The proposed algorithm is based on a heuristic approximation of a Linear Integer Programming model. It is shown that the heuristic solution is 'close' to the optimal solution 'most of the time', yet it is much easier to implement.
INTRODUCTION
Internet traffic has explosively grown in the past few years. It has triggered significant research in the design of large-scale optical systems with very high-speed core optical switches and routers (e.g., [1] [2] [3] [4] ). One big challenge in design of a large-scale high-speed optical system is packet scheduling for core routers. A dynamic scheduling mechanism is necessary to control the switching fabric of the optical router, for the purpose of providing non-blocking transmission and dynamic adaptation to varying traffic patterns and volumes over time. The adaptation must be fast enough to support fairness and Quality of Service (QoS) requirements as measured in terms of delay, Bit Error Rate (BER), throughput, etc. On the other hand, frequent schedule changes may cause network instability on bandwidth control. An effective and ideal scheduling design is needed to offer a good balance among these factors.
Much research work has been conducted on scheduling optical switches and routers. A scheduling algorithm was proposed in [5, 6] to provide best-effect services in the Birkhoff-von Neumann switch. The problem was formulated as a resource sharing problem that can be optimized with respect to efficiency and fairness. Since it dealt with the best-effort services only, the QoS issues for different types of traffic had not been addressed in [5, 6] . A fair scheduler was presented in [7] that suitable in buffer-less circuit-switched blocking networks operating with distributed, asynchronous controllers and variable length messages. The tradeoffs and performance limitations of the fair scheduler were discussed. The circuit-switched optical networks rather than the packet-switched or IP based networks were studied in that paper. In [8] , a hierarchical scheduling framework was introduced in a class of photonic packet switching systems based on WDM, in which the flow scheduling was separated from the
OPTICAL SWITCH ARCHITECTURE
The proposed scheduling algorithm is presented using core optical router architecture as described in Stanford's "OR (Optical Router) Project" [1] . The base model of the proposed system architecture is characterized as a core optical router surrounded by remote edge routers. The edge routers are connected to the core using wavelength division multiplexing (WDM) links. In particular, we use the configurations in Figure 1 . to describe the proposed scheduling algorithm. However, our solution is not limited to those particular architecture configurations and parameters shown in below.
The core optical router consists of three stages: 1 core switch fabric, 4 (ingress, egress) edges, and 4 (ingress, egress) ports or Line Cards (LC) per edge. Each port carries OC-48 or 2.5 Gbps traffic, leading to an edge capacity at OC-192 or 10 Gbps. A port generates 16 virtual waves, each at a bandwidth of OC-3. Each edge is connected to the core using WDM links with 64 (16x4) virtual waves, for an aggregated capacity of 10 Gbps from each edge and a combined 40 Gbps for 4 edges. The core scheduler determines the scheduling patterns to grant, which changes as a function of the input traffic characteristics. For this particular optical switch, a fixed length scheduling cycle consists of 64 wave slots, with each wave slot at 1 µs. During each of 64 wave slots, the core switch fabric is capable of establishing a different mesh connectivity pattern from the ingress edges to the egress edges. With the given capacity and the size of the scheduling cycle, each wave slot will be able to switch a payload of 1250 bytes (=OC-192 * 1 µs). The core switching granularity is OC-3 so the bandwidth for port-to-port connection has to be allocated in increments or multiples of OC-3. The core dynamic scheduler remains in effect until a new schedule is deployed.
Ingress Processing
We assume five types of traffic are supported in the proposed optical core router: 
Figure 1. Optical router architecture
Traffic enters the optical router through a port in the ingress edge. An ingress port will either support TDM traffic (TDM port) or IP traffic (Packet over SONET or POS port), but not both. Each ingress port maintains a set of input queues, one for each ordered pair (egress port, QoS class). There are four IP QoS classes, one each for MPLS, DFS1, DFS2, BE traffic. Under these assumptions, there are in total 64 (16 egress ports * number of QoS/port (e.g. 4)) input queues for each POS port. For a TDM port, only 4 input queues are maintained, which enables the TDM ingress port to transport to any of the 4 egress TDM ports. The incoming packets are inserted into one of the input queues for that ingress port, based on its egress edge/port address and its QoS index. An ingress POS port has three key elements: input queues, traffic manager and port scheduler. For MPLS and IP traffic, the Packet Classifier identifies the destination edge and port, the QoS queue, and QoS parameters based on the built-in MPLS/OSPF routing tables. The MPLS and IP packets are then inserted into appropriate input queues based on the routing information and QoS index, while waiting to be scheduled through the core. The traffic manager periodically monitors all the input queues and collects necessary statistics for the scheduler.
Optical switch fabric
The optical core is essentially a fast switching fabric using a 4-by-4 crossbar Time Division Multiplexing (TDM) switch with a fixed duration wave slot at 1 µs. It creates a virtual fully connected mesh between ports by periodically reconfiguring the core to allow exchange of data from one ingress port/edge to another egress port/edge. Each edge sends a payload of no more than 1250 bytes on every wave slot. The packet transmission from all ingress ports/edges is synchronized with the switching cycle of the space switch fabric in the core so that the data is switched to the appropriate egress ports without any contention in the core. Each egress edge has a copy of the current schedule, and uses it to route all received traffic to the appropriate egress port within that edge.
SCHEDULER DESIGN
The proposed scheduler provides a schedule that will serve the highest priority traffic available and guarantee that there is no starvation or lock-out during switching. Since traffic patterns change over time, the scheduler must also adapt to these changes. A new schedule is created either when there are new TDM/MPLS connections accepted or when the current schedule pattern does not perform satisfactorily any more. When the core scheduler determines a new schedule is needed, it solicits the traffic demands from each ingress port. Based on the traffic demands the scheduler computes and deploys a new schedule based on the value and urgency of each port-to-port connection. The existing schedule pattern will be repeated as long as the core scheduler determines that the performance of the schedule remains sufficient. The proposed scheduler design is based on a two-phase algorithm.
• Step 1: Wave Slot Definition (WSD) determines the 'best' set port-to-port connections to make during a fixed scheduling cycle.
• Step 2: Wave Slot Assignment (WSA) to determine the ordering and timing of wave slots for the schedule that guarantees no blocking in the core.
WSD Optimization Model
The problem of determining a dynamically changing schedule can be formulated as a Linear Integer Programming problem. At ingress POS port i, all the input queues destined to the same egress port j (in total there should be 4 such queues per POS card) can be virtually consolidated into one queue, indexed by virtual queue (i, j). The virtual queue is computationally re-segmented into units at size of 1250 bytes, with each 1250 bytes traffic defined as a payload. Assume the given QoS value for type m traffic is q m , m=1,…,4. Then the total QoS value for a type m payload is 1250* q m . Let V ijk represent the QoS value of sending the k-th payload of the virtual queue (i, j). The optimal schedule with the blocking restrictions can be represented mathematically as a Linear Integer Programming (LIP) problem.
Parameters:
I: the set of ingress ports; J: the set of egress ports (|I|=|J| in our model); U: the number of wave slots allocated to each port; K ij : the set of payloads in the virtual queue (i, j); v ijk : QoS value for the k-th payload of the virtual queue(i, j). Decision Variables: x ijk = 0 if no connection is made for the k-th payload of the virtual queue (i, j) during the scheduling cycle. = 1 otherwise. The ILP model:
For the router configuration given in Section 2, we have I = J = U=16. The solution to this LIP model will: 1. Create port-to-port connectivity in an optimal way such that maximal total QoS values will be achieved.
2. Assign at most 16 wave slots to each ingress port and hence 64 wave slots to each ingress edge for a scheduling cycle. 3. Assign at most 16 wave slots to each egress port and hence 64 wave slots to each egress edge for a scheduling cycle.
The ILP problem above is a unimodular model that has integrality properties [16] . In other words, there exists an integer optimal solution to the non-integer linear model that relaxes the integer requirements, i.e., 0 ≤ x ijk ≤ 1.
Like any other pure Linear Programming models, the WSD optimization model is not NP-Complete. The WSD optimization problem is equivalent to a Weighted Bipartite Matching Problem with 2U|I| nodes and (U 2 |I|) 2 acres. Though a polynomial-time algorithm is possible to obtain the optimal solution, the computational complexity of the WSD optimization problem is O(U 3 |I| 3 ) if Dijkstra's shortest path algorithm is applied [16] to solve the equivalent Weighted Bipartite Matching Problem. Solving for the optimal schedule is too time-consuming and is not real-time applicable, especially when the router size grows. To address the need for a faster scheduler, we define a heuristic approximation to the ILP model in the following.
WSD frozen heuristic algorithm
This algorithm constructs a schedule that transmits the highest-valued traffic possible on a port-by-port basis. It differs from the more complex and time-consuming optimal algorithm, which chooses the overall highest-valued schedule and solves a global maximization problem. Though not yielding an optimal solution, the heuristic algorithm is fast and provides a 'good' schedule most of the time.
The heuristic algorithm uses the same traffic demands to determine a high-priority schedule. At the first iteration, the algorithm considers the 16 highest priority payloads from each ingress port. TDM flows or CAC based MPLS flows are treated with the highest priority. For example of U=16, 16 payloads from each ingress port results in an average of 16 payloads per egress port. If the first iteration results in exactly 16 payloads for each egress port, the schedule is complete. More likely however, some egress ports will be assigned more than 16 while others will be assigned fewer than 16. For each egress port with more than 16 payloads, retain only the highestvalued 16 and delete the remaining payloads. Now all ingress and egress ports have either 16 or fewer payloads assigned. Every ingress/egress port with exactly 16 payloads is "frozen": no payload will be added into or removed from a frozen port. Further more, the payloads associated with frozen ingress and egress ports are also frozen. Thus, a frozen payload could be assigned to the following ingress and egress port combination:
1. a frozen ingress port + an unfrozen egress port; 2. an unfrozen ingress port + a frozen egress port; 3. a frozen ingress port + a frozen egress port.
This represents the end of first iteration. At the end of each iteration, check if all ports have 16 payloads assigned. If so, the schedule is complete. If not, perform the next iteration. In the new iteration, we add the highest-valued payloads among the nonfrozen payloads to the unfrozen ingress ports to bring the total, including the frozen payloads, up to 16. Then repeat the actions for the egress ports introduced before. The algorithm must freeze at least one egress port and/or one ingress port after each iteration. Therefore, it is guaranteed to end within finite steps. The proof is as follows:
Proof: We assume input queue always has traffic to send, and thus there will be no empty wave slots in the schedule. If this is not true, we just insert empty payloads to fill up the scheduling frame. Assume there are n unfrozen ingress ports and m unfrozen egress ports at an iteration. Initially, n=m=16. The ingress side has the same number of frozen payloads as the egress side at any iteration. This can be easily understood because any payload is indexed by an ingress port and egress port pair. At the beginning of each iteration, we add new payloads with the highest QoS values to bring the total payloads, including the frozen payloads, up to 16 for each unfrozen ingress port. All the new payloads can only go to nonfrozen egress ports. Thus 16 payloads from each ingress port results in an average of 16 payloads per unfrozen egress port. So at least one unfrozen egress port can be frozen at an iteration.
The flowchart for the heuristic algorithm is shown in Figure 2 . 
S ta rt
R e c e iv e d e m a n d re p o rts fro m e a c h in g re s s p o rt C o m p ile th e lis t o f V ijk 's fo r e a c h p o rt
WSD non-frozen algorithm
The need to develop a non-frozen algorithm was recognized when it was observed that freezing the ports early would prevent moderate valued connections from being considered. The problem was even more apparent in certain 'hot spot' conditions. To address the hot spot problem, the non-frozen algorithm [2] does not freeze the ports with U assigned payloads so that they can be improved at later iterations. At the end of each iteration, unless all ingress and egress ports have U connections, the algorithm proceeds to the next iteration by including the highest QoS value payloads not yet considered to bring up the total connections to U to each ingress whose number of assigned connection is less than U. These new offered connections may be targeted to egress ports that currently have U connections from the prior iteration. This operation could replace some of the earlier accepted connections with higher valued payloads. The WSD phase of the scheduling algorithm is considered to be complete if all ports have U connections or if no candidate connections are available at any of the undersubscribed ingress ports. The algorithm then proceeds to the WSA algorithm. The non-frozen algorithm will return either the same algorithm as the frozen algorithm, or a better one. This does not imply, however, that the non-frozen schedule will necessarily result in the optimal solution. Though the non-frozen algorithm in general takes more time to compute than the frozen algorithm, its worst case computational complexity is still O(U 2 |I|).
WSA algorithm
The heuristic and optimal algorithms create the port-to-port connections in a scheduling cycle. They do not, however, spread the connections into wave slots so that the following edge-to-edge restrictions are satisfied:
• During one wave timeslot, no more than one ingress edge (port) can be connected to an egress edge (port).
• During one wave slot, an ingress edge (port) cannot be connected to more than one egress destination edge (port).
Thus the existing connections need to be further distributed into 64 wave slots that satisfy the above constraints. The WSD process will enforce these port and edge constraints. Notice that edge-to-edge restriction satisfaction implies port-to-port restriction satisfaction, not vice versa. Thus the port-to-port connectivity is first consolidated to its edge-to-edge equivalence. The port-to-port connectivity is expressed as a 16-by-16 matrix C, whose element C(i,j) represents the number of connections from port i to port j during a given scheduler cycle. Each row (column) of C adds up to 16. By combining port-to-port connections into edge-to-edge connections, we form a 4-by-4 edge connectivity matrix A, whose element A(i,j) represents the number of connections from edge i to edge j during a given scheduler cycle. Each row or column of A adds up to 64.
The WSA process begins by splitting the matrix A into two matrices A1 and A2, each having the same dimension as A and with rows and columns adding up to 32. The same WSA process is then applied to the resulting 2 matrixes, then to 4, 8, 16, 32 matrices. There are 6 separate and independent WSA iterations to produce the final 64 permutation connectivity matrices. For any WSA action n (i=1,2…6), the number of total resulting matrices is 2 n and the rows (columns) of each of the 2 n matrices add up to 64/2 n . Thus the summation for each row (column) is 1 for each of the 64 matrices at the final step, which imposes the restriction that one ingress (egress) edge can only be connected to one egress (ingress) edge within that wave slot. The 64 permutation matrices represent 64 wave slots and indicate in the time domain how the connections are established. The flow chart for the detailed WSA algorithm at iteration n is shown in Figure 3 , which elaborates on how edge matrix A' is successfully divided into two matrices A1' and A2'. 
If A'(i, j) = 2a+1, then A1'(i, j)= A2'(i, j) = a.
There will be (2a+1)/2 (integer division) connections in each half of the scheduling cycle. i'= i 0 a n d j'= j 0 ?
Is th e re a n y n o n z e ro B (i,j) i= i',j= j' i'= i 0 a n d j'= j 0 ?
Is th e re a n y n o n z e ro B (i,j) i= i',j= j' In p u t: m a trix A ': A '(i,j) re p re s e n ts th e n u m b e r o f c o n n e c tio n s b e tw e e n in g re s s e d g e i a n d e g re s s e d g e j O u tp u t: T w o s a m e s iz e m a tric e s A 1 ' a n d A 2 ' E u le ria n C irc u its E u le ria n C irc u its E u le ria n C irc u its 
SIMULATION MODELLING AND PERFORMANCE EVALUATIONS

Traffic Modeling
The described optical core router is able to support different types of traffic. A TDM (or POS) ingress port only communicates to egress TDM (or POS) ports. Composition of traffic over each ingress POS port is distributed among the types of MPLS, DFS1, DFS2, and BE, and the total traffic added together is 100%.
The various traffic types are treated as follows: 1. The interarrival time for TDM flow requests on each TDM port is Poisson distributed with an average rate of λ TDM ms. Holding time is also exponentially distributed with a mean of µ TDM ms. A TDM flow that cannot fit without blocking is ever rejected. Each TDM flow is generated at a rate of OC-12. For the results presented in this section, λ TDM =20 ms and µ TDM =100 ms. 2. MPLS connection requests are Poisson distributed with an average rate of λ MPLS ms, and the average session holding time is exponentially distributed with a mean of µ MPLS ms. For the results reported in this section, λ MPLS =10 ms and µ MPLS =100 ms. The MPLS connections generate IP packets with specified rate in the appropriate self-similar patterns. Individual MPLS Label Switched Path (LSP) is given an average rate of 250 Mbps, a peak rate of 430 Mbps and is characterized by a Hurst parameter of 0.7, to match the router bandwidth configuration. 3. For Diffserv and Best Effort traffic, only packet level modelling is needed. DFS1, DFS2, and BE packets are generated in self-similar patterns with a Hurst parameter of 0.7.
The POS packet size distribution is given in Simulations and analysis have been conducted to evaluate the performance of the proposed scheduling algorithms in the core optical router architecture introduced in Section 2. There are 4 ports per edge and 4 edges in total. Each edge has one TDM card and 3 POS cards. Thus there are 4 TDM card and 12 POS cards in the router under study. The simulations have been performed in OPNET by assuming various traffic scenarios. In the results reported in this section, composition of traffic over each ingress POS port is: 40% of offered traffic for MPLS; 20% for DFS1; 20 % for DFS2; 20% for BE. The destination of a TDM flow is uniformly selected among 4 TDM egress ports. The probability distribution of the traffic from a particular ingress POS port is as follows: 75% of the traffic from a particular POS ingress port will be distributed uniformly to 2 H-POS egress ports, 20% will be distributed uniformly to the other 4 M-POS egress ports and 5% will be distributed uniformly to the remaining 6 L-POS egress ports. The above distribution comes from the assumption that traffic tends to go to few popular destination ports. Small background/best-effort traffic (5%) is distributed uniformly (e.g., emails) to all 12 POS egress cards. In order to maintain similar traffic loads on 12 POS egress ports, the connection distribution will be rotated over ingress ports, as shown in Figure 4 . The rotation eliminates the effects of traffic congestion because of the traffic distribution, so the simulations can evaluate scheduling performances better. All patterns are rotated, i.e., ingress port i distribution becomes ingress port i+1 distribution, once every 30 ms. The variations on distribution shown in Figure 4 is introduced for the purpose of studying the adaptability of the scheduling algorithm to changes in the traffic patterns.
At the steady state, the offered load to the system and the throughput are both 23.5 Gbps. Among the 23.5 Gbps, 8.5 Gbps (85% of 10 Gbps total TDM capacity) is TDM traffic and, 15 Gbps (50% of 30 Gbps total POS card capacity) is POS traffic. Theoretically TDM could achieve 100% throughput or 10 Gbps rate. The actual throughput is only 8.5 Gbps due to the fact that a new request of each ingress TDM port has a random TDM destination. As more of the 4×4=16 available TDM connections are allocated, new requests are less likely to be accepted.
I n g r e s s P o r t 1 P o r t 2 P o r t 3 P o r t 1 2 The nonfrozen heuristic schedulers are evaluated in three slightly different versions: classical scheduler (CLS), scheduler with limited average rate allocation (LARS) and scheduler with limited peak rate allocation (LPRS). 1. CLS does not reserve any bandwidth for any POS traffic. CLS is slow in reacting to new traffic since entire schedule computation process takes about 2 ms or 32 scheduler cycles. 2. LARS reserves average rate equivalent wave slots for each MPLS flow before acceptance. The average MPSL flow rate is 250 Mbps. Notice that two wave slots actually correspond to a bandwidth of 312.5 Mbps. Any other traffic on the same port-to-port combination may temporarily use the two wave slots in periods of low MPLS traffic. Extra MPLS traffic beyond 312.5 Gbps will be queued and compete for the bandwidth as best effort traffic. 3. LPRS is similar to LARS, with the difference that the guaranteed number of wave slots covers the peak rate of MPLS flows. We assume the peak rate is 430 Mbps, which requires 3 wave slots. Unused reserved bandwidth can be temporarily utilized by other POS traffic. Figure 5 displays the delay distributions for MPLS traffic under the three versions of the scheduler. Among these schedulers, LPRS achieves the lowest average delay at 40.5 µs and smallest 90 percentile at 73 µs since peak rate bandwidth is reserved ahead of time. LARS, on the other hand, results in highest average delay at 78 µs and highest 90-percentile delay at 180 µs. LARS guarantees MPLS bandwidth based on average rate and the remaining traffic is considered as best effort. Thus the delay is pushed higher by the portion of best-effort treated traffic. CLS results in an average delay of 43 µs and a 90 percentile of 75 µs. Although CLS always considers MPLS traffic the highest priority among POS traffic, it is slow in reacting to traffic changes. Thus CLS experiences higher average delay than LPRS does. In Figure 6 , DFS1 traffic experience an average delay at approximately 1200 µs and a 90 percentile delay at 2950 µs by LARS. LARS limits the bandwidth for MPLS traffic and consequently leaves more room for lower priority traffic such as DFS1. Although LPRS reserves peak rate equivalent bandwidth for MPLS, unused reserved bandwidth can be temporarily utilized by other POS traffic with the same port-to-port combination. LPRS achieves very similar results to those of LARS with a slightly higher average delay and 90 percentile. However, CLS, which always favours TDM and MPLS traffic, degrades DFS1 traffic performance greatly with an average delay at 4050 µs and 90 percentile at 10,500 µs. DFS2 and BE traffic experience very similar performance trend as compared to DFS1 traffic. Table 2 Table 2 . POS delay performance under 3 schedulers Figure 7 . Average port connectivity comparison Figure 7 shows the average port connectivity for POS ports, i.e., the number of egress ports reached during 16 wave slots of a port schedule averaged over all 12 POS ports. It is an indirect indication of how well connectionless traffic, i.e., DFS1, DFS2, BE, is served by the scheduler. Higher connectivity means more queues can be served during a scheduling cycle, thus causing smaller average delay and jitter. LARS achieves slightly higher connectivity than LPRS, and both clearly outperform CLS with values around 7.5 out of 12 as opposed to 4.5 out of 12. Small connectivity of CLS leads to oscillating behavior and longer delay/jitter, as apparently shown on results presented earlier. 
CONCLUSIONS
This paper addresses a new QoS capable scheduling problem in the high capacity core optical switching systems for heterogeneous traffic. The scheduling problem is formulated as a Linear Integer Programming model. Two heuristic algorithms, frozen and nonfrozen, are developed to solve the problem with much less time than optimality algorithm. Due to the large capacity of the system and the long computation time of the scheduler, bandwidth efficiency and smooth transitions between the consecutive schedules are very critical to the traffic performance such as delay and jitter. The heuristic scheduling algorithms are evaluated in three different versions: classical scheduler (CLS), limited with average rate scheduler (LARS) and limited with peak rate scheduler (LPRS). For the investigated core optical system, bandwidth reservation ahead of traffic arrival for high priority traffic is very beneficial to QoS support for all types of traffic in the system.
