Abstract. In this paper we propose an enhanced parallel iterative scheduler for IBWR synchronous slotted OPS switches in SCWP mode. It obtains a maximal matching of packet demands without resource conflicts. The analytical and numerical results are highly competitive regarding previous work.
Introduction
In the Optical Packet Switching (OPS) paradigm of Wavelength Division Multiplexing (WDM), packet payloads stay in the optical domain. OPS offers high bandwidth efficiency due to statistical multiplexing, but it is well-known that packet granularity and optical buffering impose extreme constraints to photonic switching, incurring in unacceptable hardware costs with state-of-the-art technology.
In this paper, we focus on synchronous slotted OPS in Scattered Wavelength Path (SCWP) operational mode [1] . This mode specifies a fixed packet size (slot length) and packet alignment with slot boundaries at the input ports (and thus optical synchronizing stages, which increases cost). However, the performance improvement due to the better contention behavior has encouraged the study of this alternative. Packet length in OPS networks is a current topic of discussion. The European DAVID project [2] selected synchronous slotted OPS with slot lengths of ~1 μs for the WDM backbone network. In WDM OPS networks, there is a mapping of permanent end-toend connections to link wavelengths. In the SCWP operational mode, optical packet paths (OPP) univocally determine a fixed sequence of transmission fibers, but the transmission wavelength may change in each hop. This provides extra freedom to switch schedulers in packet wavelength selection, boosting the statistical multiplexing effect. Therefore, SCWP achieves a high throughput with a low packet delay in OPS networks, also lowering optical buffering requirements [3] [4] .
In SCWP it is possible to simultaneously transmit several packets of the same OPP through a fiber, in different wavelengths. In this paper we adopt the round-robin packet ordering criterion in [5] that avoids the performance degradation due to unbalanced wavelength assignments. The wavelengths are assigned cyclically. Each node uses two sets of round-robin pointers to track packet sequence: one round-robin pointer per input fiber, tracking the wavelength of the next packet in the input traffic sequence, and one round-robin pointer per output fiber, determining the output wavelength of the next packet to be transmitted. Figure 1 shows an example. This paper focuses on the Input-Buffered Wavelength-Routed (IBWR) switch architecture, for its scalability. Figure 2 shows the WDM adaptation of this architecture [6] . The switch has N input/output fibers, and n wavelengths per fiber. It has a buffering section followed by a non-blocking switching section. The buffering section consists of n·N Tunable Wavelength Converters (TWC) with a tuning range λ 0 ...λ K-1 , K=max (n·N,M) and two K×K Arrayed Waveguide Gratings (AWGs), which are interconnected by M delay lines of 0 to M-1 slots. Due to AWG symmetry, a packet arriving at port i leaves the buffering section at port i, regardless of the selected delay. The wavelength conversion determines the delay line for the packet. The switching section is composed of n·N TWCs followed by a nN×nN AWG. The switching AWG routes each packet to the proper output fiber/wavelength.
The IBWR switch scheduler assigns packet delays and packet output wavelengths. These two tasks are independent.
-Packet delay assignment. Current optical switches employ Fiber Delay Lines (FDLs) due to the lack of optical RAMs. In IBWR switches, delays are assigned at packet arrivals. The scheduler discards a packet if it cannot assign a delay fulfilling two contention conditions: (i) output fiber contention: at most n packets can reach any output fiber in a given slot, (ii) input port contention: the packets that arrive to the same i-th input port (same fiber and wavelength) in different time slots cannot leave the switch in the same time slot. Otherwise they would collide at the i-th TWC of the switching section, which can only handle one packet at a time.
-Output wavelength assignment. The scheduler assigns output wavelengths to the packets when they leave the switch, according to the round-robin criterion. Remark: Other OPS architectures, with higher hardware costs and less scalable than IBWR, emulate output buffering (OB) [6] [7] (the only factor limiting packet delay assignment is output fiber contention).
Previous work has characterized IBWR delay assignment as a matching in bipartite graphs [4] . At every slot, the scheduler seeks a feasible assignment maximizing the number of packet delay assignments (i.e. minimizing packet losses). If there are several alternatives, it minimizes average packet delay. The sequential IBWR scheduler for the SCWP mode in [8] is unfeasible in practice (for ~1 µs slots). Conversely, our proposal is parallel, as Virtual Output Queuing (VOQ) schedulers.
The rest of this paper is organized as follows: in section 2 we describe the Parallel Desynchronized Block Matching Scheduler (PDBM), which is the basis for this proposal. In section 3 we present the Insistent PDBM (I-PDBM) algorithm. In section 4 we discuss simulation results. Section 5 concludes the paper. At each input fiber controller, a round-robin grant pointer WG f , f=0,…,N-1, indicates the first wavelength to scan in input fiber f.
PDBM Algorithm
At system initialization, i X (t), y jt , WG f and CW jt are set to 0. All FG jt grant pointers associated to the same output fibers are initialized by maximizing the minimum distance between pointer positions:
Algorithm iterations consist of three steps (request, grant, and accept):
Step 1. Request: Each input module i with a packet for output fiber j sends a request signal to every output module in fiber j whose associated delay satisfies the input contention constraint. That is, output modules (j,t) such that x i (t)=0.
Step 2. Grant: Each (j,t) output module scans the request signals from the input modules, starting by the input module indicated by grant pointers FG jt and WG f . The scans from other input modules proceed in a clockwise or counter-clockwise sense, according to the alternating bit CW jt . The first n− y jt scanned request signals are acknowledged, and a grant signal is sent to the associated input module.
Step 3. Accept: Each input module i receives at most M grants, from the M delays associated to the destination output fiber. The shortest granted delay t is accepted and assigned to the packet that is present at input i. If the input does not receive any grants during algorithm execution, the packet is discarded. Otherwise, an accept signal is sent to the accepted output module and the i X (t) and y jt state vectors are updated to reflect packet allocation. When a packet is granted, its input port does not participate in subsequent algorithm iterations. At each time slot, after the last iteration, i X (t) and y jt are updated and shifted as described above to consider the allocation and the propagation of the packets in the delay lines. The CW jt bits are negated to alternate request scanning directions each time slot. The FG jt grant pointers are incremented by one (module N), every two time slots and the WG f round-robin grant pointers are incremented by the number of received packets at fiber f in the current slot (modulo n).
Algorithm justification
PDBM converges in min(M,nN) iterations at most [9] . Thus, convergence speed is independent from switch size.
The initialization of the pointers and their evolution are inspired by the desynchronizing scheme of the RDSRR [10] algorithm to minimize the grant block overlapping effect: if an output module (j,t) receives more requests than available delays, it only acknowledges signals from the modules whose indexes are closest to grant pointer FG jt . If the grant pointers take the same value, "close" input modules receive several grants, and "far" input modules receive no grants at all. In PDBM, all grant pointers of a given output fiber get initial values that maximize the minimum distance (modulo N) between two input nodes. The scheduler keeps the desynchronization by increasing (modulo N) all pointers every two time slots. The scanned direction is inverted at each time slot to enforce fairness in case of nonuniform packet arrivals.
Although PDBM does not guarantee packet sequence, input modules are scanned following the round-robin criterion to mitigate mis-sequencing. 
Insistent PDBM (I-PDBM) scheduler
The basic PDBM scheduler may assign longer delay lines than strictly necessary, ignoring shortest ones even in absence of contention. Specifically, it converges to a maximal size match (no more connections can be established without replacing any existing connections) with suboptimal aggregated delay, i.e., some connections could be individually removed and reassigned to a shorter delay output port. We call this effect "PDBM impatience". We will illustrate it with an example:
Let us assume a switch with two inputs (outputs), two wavelengths per fiber and three delay lines (N=2, n=2, M=3). Two packets arrive at input fiber 0 requesting output fiber 1. The state of the switch is:
• FG ft and CW 1t are indifferent because there are no packets in fiber 1.
• WG 0 = 0. The round-robin grant pointer of input fiber 0 points to input module 0. The first input wavelength of fiber 0 to be scanned is 0 for all iterations. • x 0 (t)=x 1 (t)=0 ∀t. No input contention.
• y 10 =1, y 11 =1, y 12 =0. There is a free delay line for t={0,1} and two delay lines for t=2. Figure 4 summarizes the state of the node. From that state, the algorithm iteration evolves as follows (figure 5):
Request: input modules 0 and 1 send request signals to all output modules (1,t), since there is no input contention at them.
Grant:
• t=0. Output module (1,0) scans the signal of input module 0 (WG 0 =0) and acknowledges it. The request signal of input module 1 is not acknowledged because there is a single available wavelength, n−y jt =1 (2-y 10 =1).
• t=1. Output module (1,1) scans the request signal of input module 0 (WG 0 = 0) and acknowledges it. The request signal of input module 1 is not acknowledged because there is a single available wavelength, n−y jt =1 (2-y 11 =1).
• t=2. Output module (1, 2) scans the signals of input modules 0 and 1 , and acknowledges them both because there are two available wavelengths, n−y jt =2 (2-y 12 =2). Accept. Input module 0 receives three grant proposals (t = 0,1,2) and accepts the best one (delay 0). Input module 1 receives a single grant proposal (t=2) and accepts it. The packet at input 1 is assigned to delay line 2. However, there is room in delay line 1, which has no input contention. Thus, the assignment is suboptimal. To solve the impatience problem we propose a new algorithm: Insistent PDBM or I-PDBM.
I-PDBM algorithm
In the PDBM accept step, a granted input module confirms the received grant to update the state vector y jt and deactivates the other request signals to allow input ports with lower priorities to be granted. It is possible to simplify the algorithm to execute a single accept step after the last iteration. It suffices to change input modules to keep the request signal active for the "accepted" grant and to deactivate all others. Since the number of wavelengths does not decrease and the pointers do not change until the accept step at the end of the slot, each granted input module that keeps an active request signal is granted again, whereas the unrequested granted delays are released and reassigned to other input modules.
The previous simplified scheme easily solves PDBM impatience if each granted input module stops requesting higher delays but it keeps the request signal active for better ones. By stopping higher delay requests, it releases some wavelengths that can be granted to other modules. Subsequent iterations may increase the number of packet assignments and further reduce the delay of previously assigned packets. Thus, grants are provisional, until the very last iteration when the accept stage takes place.
Therefore, the differences with PDBM are:
Step 1. Request: each input module i with a packet destined to output fiber j sends a request signal to every output module of fiber j whose associated delay satisfies that the input contention constraint is not worse than any granted delay to the same input module in the previous iteration, i.e. the input module sends request signals to output modules (j,t) such that i X (t)=0 and j ≤ p, where p is the shortest granted delay. End of the algorithm. Accept: the accept step takes place after the last iteration. So, state vectors and pointers are not updated until the end of the time slot and the granted input modules participate in subsequent algorithm iterations. 
Algorithm justification
I-PDBM converges when the signals get stabilized, i.e. there are no new packet allocations nor assignments of better delays to granted packets. As PDBM does, I-PDBM converges in min(M,nN) iterations at most to a maximal size matching. Proof: i) an output port only changes a grant signal if a previous input port (according to the grant pointers) releases a request signal. An input port only releases a request signal if it received a grant in the previous iteration from an output port that is associated to a shorter delay. Since the grants from delay 0 do not change after the first iteration, the algorithm converges in M iterations at most. ii) An input port is granted a shorter delay only if another input port was granted a shorter delay in the previous iteration. Since there are nN input ports, the algorithm converges in nN iterations at most.
I-PDBM avoids PDBM impatience and it is simpler to implement. It has two steps (request and grant), whereas PDBM needs three (request, grant and accept).
Results
In this section we present simulation results to compare I-PDBM (in terms of average delay, buffer requirements and practical convergence) with OB architectures and the previous PDBM algorithm, under benign or bursty traffic conditions. (Figure 6(c) ) and β = 64 (Figure 6(d) ). Switch sizes were N = 4, n = {2,8,32,64}, and buffer sizes were the same as above. For ON/OFF input traffic, the average delay of both algorithms is very similar. Bursty traffic affects I-PDBM performance as in the case of PDBM and OB architectures [11] . However, for Bernoulli traffic, the average delay of I-PDBM is lower in all configurations. We conclude that packet delay decreases by avoiding PDBM impatience and thus I-PDBM outperforms PDBM. Table 1 shows buffer requirements for a packet loss probability of 10 -7 under Bernoulli traffic (simulations with 10 9 packets). This is a good feasibility metric for OPS nodes, because FDL length is a serious bottleneck nowadays. As we would expect, reducing packet delay leads to lower buffer requirements. We observe that I-PDBM buffer length is very small, and it is close to the ideal OBS case. Tables 2 and 3 compare the theoretical convergence bound with the number of iterations K to converge with a probability above 1-10 -6 (90% input load). PDBM and I-PDBM behave similarly. Under Bernoulli traffic, they only need extra iterations for few wavelengths (n=2). However, in all cases the number of iterations is quite low.
Conclusions
In this paper we have proposed the enhanced I-PDBM parallel iterative matching scheduler for IBWR optical packet switches [6] , which is significantly advantageous over PDBM [9] in terms of performance and hardware complexity. ( 
