Abstract-We consider traffic scheduling in an N×N packet switch with an optical switch fabric, where the fabric requires a reconfiguration overhead to change its switch configurations. To provide 100% throughput with bounded packet delay, a speedup in the switch fabric is necessary to compensate for both the reconfiguration overhead and the inefficiency of the scheduling algorithm. In order to reduce the implementation cost of the switch, we aim at minimizing the required speedup for a given packet delay bound. Conventional Birkhoff-von Neumann traffic matrix decomposition requires N 2 -2N+2 configurations in the schedule, which lead to a very large packet delay bound. The existing DOUBLE algorithm requires a fixed number of only 2N configurations, but it cannot adjust its schedule according to different switch parameters. In this paper, we first design a generic approach to decompose a traffic matrix into an arbitrary number of N S (N 2 -2N+2>N S >N) configurations. Then, by taking the reconfiguration overhead into account, we formulate a speedup function. Minimizing the speedup function results in an efficient scheduling algorithm ADAPT. We further observe that the algorithmic efficiency of ADAPT can be improved by better utilizing the switch bandwidth. This leads to a more efficient algorithm SRF (Scheduling Residue First). ADAPT and SRF can automatically adjust the number of configurations in a schedule according to different switch parameters. We show that both algorithms outperform the existing DOUBLE algorithm.
power consumption for each rack, and makes the resulting switch architecture more scalable.
However, switches with optical fabrics suffer from a significant reconfiguration overhead when they update their configurations [5] . The reconfiguration overhead includes time needed for a) interconnection pattern update (ranging from 10 ns to several milliseconds for different optical switching technologies [1] [2] [3] [4] 6] ); b) optical transceiver resynchronization (10～20 ns or higher [5] ); and c) extra clock margin to align optical signals from different input modules. With most fast optical switching technologies [1] [2] [3] [4] available nowadays, the reconfiguration overhead is still more than 1 slot for a system with slotted time equal to 50 ns (64 bytes at 10 Gbps).
During the reconfiguration period, packet switching is prohibited. To achieve 100% throughput with bounded packet delay (or performance guaranteed switching [6] ), the fabric has to transmit packets at an internal rate higher than the external line-rate, resulting in a speedup. The amount of speedup S is defined as the ratio of the internal packet transmission rate to the external line-rate. Speedup is directly associated with the implementation cost in practical switching systems. It concerns not only the internal optical transmission rate, but also the memory access time. In this paper, we focus on minimizing the speedup requirement for a given packet delay bound. The goal is to achieve a cost-effective solution while at the same time maintaining guaranteed QoS performance of the system. Assume each switch reconfiguration takes an overhead of δ slots. Conventional slot-by-slot scheduling methods may severely cripple the performance of optical switches due to frequent reconfigurations. Hence, the scheduling rate has to be slowed down by holding each configuration for multiple time slots. Time slot assignment (TSA) [6] [7] [8] [9] [10] is a common approach to achieve this. Assume time is slotted and each time slot can accommodate one packet. The switch works in a pipelined four-stage cycle: traffic accumulation, scheduling, switching and transmission, as shown in Fig. 2 . Stage 1 is for traffic accumulation. Its duration T is a predefined system constant. Packets arrived within this duration form a batch which is stored in a traffic matrix C(T)={c ij }. Each entry c ij denotes the number of packets arrived at input i and destined to output j. Assume the traffic has been regulated to be admissible before entering the switch, i.e., the entries in each line (row or column) of C(T) sum to at most T (defined as maximum line sum T). In Stage 2, the scheduling algorithm is executed in H time slots to compute a schedule consisting of (at most) N S configurations for the accumulated traffic. Each configuration is given by a permutation matrix P n ={p (n) ij } (N S ≥n≥1), where p (n) ij =1 means that input port i is connected to output port j. A weight φ n is assigned to each P n and it denotes the number of slots that P n should be kept for packet switching in Stage 3. In order to achieve 100% throughput, the set of N S configurations must cover C(T), i.e., ∑ N S n=1 φ n p (n) ij ≥c ij for any i, j∈{0, …, N-1}. Because C(T) has N 2 entries and each configuration can cover at most N of them, the number of configurations N S must be no less than the switch size N. Otherwise, the N S configurations are not sufficient to cover every entry of C(T) [6, [8] [9] . In essence, this scheduling problem is equivalent to a traffic matrix decomposition problem, where the traffic matrix is decomposed into a set of N S weighted configurations (or permutations). For optical switches, this decomposition is constrained by the reconfiguration overhead δ, and the scheduling algorithm needs to determine a proper number of configurations N S to minimize speedup. In Stage 3, the switch fabric is configured according to the N S configurations and packets are switched to their designated output buffers. Without speedup, Stage 3 requires ∑ N S n=1 φ n + δN S slots, where ∑ N S n=1 φ n is the total holding time for the N S configurations and δN S is the total amount of reconfiguration overhead. Since ∑ N S n=1 φ n + δN S is generally larger than the traffic accumulation time T, speedup is needed to ensure that Stage 3 takes no more than T slots. During the holding time of a configuration, some input-output connections become idle (earlier than others) if their scheduled backlog packets are all sent. As a result, the schedule will contain empty slots [7] and this causes bandwidth loss, or algorithmic inefficiency. In general, this bandwidth loss increases with the holding time φ n . But a short holding time implies frequent switch reconfigurations, or large hardware inefficiency (due to large δN S ). A good scheduling algorithm should compromise between hardware and algorithmic inefficiency, and achieve a balanced tradeoff to minimize the speedup requirement.
At a speedup of S, the slot duration for a single packet transmission in Stage 3 is shortened by S times. Then 100% throughput can be ensured if
The values of N S and ∑ N S n=1 φ n in (1) Without loss of generality, we define a flow as a series of packets coming from the same input port and destined to the same output port of the switch. Since packets in each flow follow first-in-first-out (FIFO) [11] order in passing through the switch, there is no packet out-of-order problem within the same flow. (But packets in different flows may interleave at the output buffers.) Stage 4 takes another T slots to dispatch the packets from the output buffers to the output lines in the FIFO order. Consider a packet arrived at the input buffer in the first slot of Stage 1. It suffers a delay of T slots in Stage 1 for traffic accumulation (i.e., the worst-case accumulation delay), and another delay of H slots in Stage 2 for algorithm execution. In the worst case, this packet will experience a maximum delay of 2T slots in Stages 3 & 4 (assume it is sent onto the output line in the last slot of Stage 4) . Taking all the four stages into account, the delay experienced by any packet at the switch is bounded by 3T+H slots as shown in Fig. 2a . Note that the value of H depends on how the scheduling hardware is engineered. For H≤T, a single set of scheduling hardware can schedule all incoming batches. For H>T, ⎡ ⎤ T H/ sets of scheduling hardware can be used in Stage 2 for pipelined scheduling of multiple batches at the same time. This is feasible because each batch is independently processed. Fig. 2b shows an example of H=2T, where two sets of scheduling hardware are used.
Several TSA-based scheduling algorithms have been proposed to achieve performance guaranteed switching [6, [8] [9] . Some of them target at minimizing the packet delay bound using the minimum number of N S =N configurations in the schedule. These algorithms are called minimum-delay scheduling algorithms. Examples include MIN [6] and QLEF [8] [9] . They generally have low algorithmic efficiency and thus require a very large S schedule . Other algorithms favor larger number of configurations to achieve higher algorithmic efficiency. For example, EXACT [6] adopts the classic Birkhoff-von Neumann decomposition [12] [13] [14] [15] the speedup function leads to the proposal of a novel ADAPT algorithm. ADAPT works by converting the traffic matrix C(T) into a weighted sum of a quotient matrix Q and a residue matrix R. Then, Q is covered by N S -N configurations and R is covered by the other N configurations (detailed in Sections II & III). We further show that the performance of ADAPT can be enhanced by sending more packets in the N S -N configurations devoted to Q. This leads to another algorithm SRF (Scheduling Residue First), which requires an even lower speedup than ADAPT. Both ADAPT and SRF outperform DOUBLE [6] , because they always construct a schedule with a proper number of N S configurations (instead of fixing N S =2N) to minimize speedup. In other words, both ADAPT and SRF can automatically adjust the schedule according to different switch parameters T, N, and δ. Note that ADAPT and SRF are based on a generic matrix decomposition approach proposed in this paper. This matrix decomposition technique may also find applications in other networks, such as SS/TDMA [16] [17] , TWIN [18] and wireless sensor networks [19] . It can also be applied to unconstrained switches [6] (e.g. electronic switches) to reduce the number of configurations in the schedule (compared to Birkhoff-von Neumann decomposition) [14] .
The rest of the paper is organized as follows. In Section II, we derive a generic approach to decompose a traffic matrix into N S configurations (N 2 -2N+2>N S >N). Based on the traffic matrix decomposition, our speedup function S=f(N S ) is formulated. ADAPT algorithm is designed in Section III and SRF algorithm is proposed in Section IV. Section V gives some discussions. The paper is concluded in Section VI.
II. TRAFFIC MATRIX DECOMPOSITION AND SPEEDUP FUNCTION

A. Traffic Matrix Decomposition
To generate a schedule consisting of at most N S switch configurations, we use T/(N S -N) to divide each entry c ij ∈C(T). For simplicity, we first assume T/(N S -N) is an integer. The traffic matrix C(T) is then converted into a weighted sum of a quotient matrix Q={q ij } and a residue matrix R={r ij }. That is,
Since the maximum line sum of C(T) is T, we have 
From (5) & (6), we get (7) and
With (7) & (8), it is easy to see that the maximum line sum of Q is at most N S -N. According to Lemma 1 below, the quotient matrix Q can be covered by N S -N configurations.
Lemma 1: An N×N matrix with maximum line sum N S -N can be covered by N S -N configurations, each with a weight of 1.
Proof: We first construct a bipartite multigraph [7, [20] [21] ] G Q from Q, as illustrated by the example in Fig. 3 . Rows and columns of Q translate to two sets of vertices A and B in G Q , and each entry q ij translates to q ij edges connecting vertices i∈A and j∈B. Since the maximum vertex degree is at most N S -N, G Q can be edge-colored [20] [21] using N S -N colors, such that the edges incident on the same vertex have different colors. Each color is then mapped back to generate a configuration, where the edges in this color translate to 1s at the corresponding entries and all other entries are 0s. As a result, the quotient matrix Q can be covered by the N S -N configurations obtained, with a weight of 1 for each configuration. # On the other hand, from (5) we have
Therefore, the residue matrix R can be covered by any N non-overlapping configurations with a weight T/(N S -N) for each. "Non-overlapping" means that any two of the N configurations do not cover the same entry of R. Mathematically, these N non-overlapping configurations can add to an all-1 matrix. They can be chosen (or predefined)
arbitrarily without any explicit computation. 
B. Speedup Function
With the above traffic matrix decomposition, S schedule in (4) can be formulated as follows. Note that S schedule is further reduced in Section IV by (19) . (10) , this would increase S schedule by at most N S /T. When T>>N S , it can be ignored. For simplicity, we assume T/(N S -N) is an integer.
Note that T, N and δ are predefined switch parameters. From (2) & (10), the overall speedup S can be expressed in N S using the speedup function below.
The importance of the above speedup function can be summarized as follows: a) it formulates how speedup S changes with the number of configurations N S (in the range of N 2 -2N+2>N S >N); and b) it allows us to study how the switch- 
Step 2: Calculate Q
Step 3: Color Q
Step 4: Schedule Q
Step 5: dependent parameters T, N and δ can affect speedup S.
C. An Example Based on DOUBLE
The traffic matrix decomposition in the existing DOUBLE algorithm [6] can be regarded as a special case of our proposed decomposition with N S =2N. As mentioned in Section I, DOUBLE uses N S =2N to achieve S schedule =2. This is obtained by replacing N S in (10) by 2N. In DOUBLE, c ij ∈C(T) is divided by T/(N S -N)=T/N to get the quotient matrix Q and the residue matrix R. Then, both Q and R are covered by N configurations. Particularly, the N S -N=N configurations devoted to Q are obtained from edge-coloring, and the other N non-overlapping configurations devoted to R can be chosen arbitrarily [6] . Each configuration is equally weighted by n φ = T/(N S -N)=T/N. Fig. 4 gives an example of DOUBLE execution.
III. ADAPT ALGORITHM
A. ADAPT Algorithm
Based on the speedup function S=f(N S ) in (11), we can design a scheduling algorithm to minimize the overall speedup S. Let
Solving (12) for N S , we get
where
Since T>δN S ≥δN, we have λ>1. λ is used to normalize the switch-dependent parameters T, N and δ. Without this normalization, it is generally difficult to compare speedup requirement among switches with different parameters T, N and δ. Otherwise, the comparison would be too complex, where each parameter should be considered one by one and others should be kept the same for a fair comparison. From (12)～ (14), we can see that the overall speedup S is minimized if N S =λN configurations are used for scheduling. The above analysis leads to our ADAPT algorithm, as summarized in Fig. 5 . The number of configurations N S required by ADAPT is self-adjusted with switch parameters T, N, and δ using (13) . The traffic matrix C(T) is then covered by the N S configurations obtained from the decomposition in Part A of Section II. In practice, if λN and T/(N S -N) are not integers, we can set
Without loss of generality, let S ADAPT be the overall speedup required by ADAPT. Substituting (13) into (11), we have
We can see that the minimized speedup (S ADAPT ) only depends on the value of λ. In other words, switches with the same value of λ require the same speedup. This theoretical insight cannot be easily seen without λ.
B. Performance Comparison with DOUBLE
Let S DOUBLE denote the overall speedup required by DOUBLE. Since DOUBLE uses N S =2N configurations to achieve S schedule =2, from (2) 
Input:
An N×N traffic matrix C(T)={c ij } with maximum line sum no more than T, and the reconfiguration overhead δ.
Output:
At most N S configurations P 1 , …, P NS and the corresponding weights
Step 1. Calculate N S :
Step 2. Calculate the quotient matrix Q:
Construct an N×N matrix Q={q ij } such that
Step 3. Color Q:
Construct a bipartite multigraph G Q from Q. Rows and columns of Q translate to two sets of vertices A and B in G Q , and each entry q ij ∈Q translates to q ij edges connecting vertices i∈A and j∈B. Find a minimal edge-coloring of G Q to get at most N S -N colors, such that the edges incident on the same vertex have different colors. Set 1→n.
Step 4. Schedule the quotient matrix Q:
For a specific color in the edge-coloring of G Q , construct a configuration P n from the edges in that color by setting the corresponding entries to 1 in P n (all other entries in P n are set to 0). Set the weight switching. In this case, DOUBLE cannot generate a feasible schedule even if the speedup is infinite.
Recall that the packet delay bound is given by 3T+H time slots, and T is directly related to the QoS that the switch can provide. ADAPT can achieve a tighter packet delay bound than DOUBLE. For example, if we set T=2δN, a delay bound of 6δN+H slots can be achieved by ADAPT at a proper speedup. However, this is impossible in DOUBLE as the minimum delay bound it can provide must be larger than 6δN+H slots.
Even for the region where both DOUBLE and ADAPT are feasible (i.e., 2 > λ ), from (15) & (16), it is easy to prove that S ADAPT ≤S DOUBLE is always true. Therefore, for the same set of switch parameters T, N and δ, the overall speedup required by ADAPT is always smaller than that required by DOUBLE (except λ=2 or T=4δN, where S ADAPT =S DOUBLE =4). For example, with T=16δN, we have S ADAPT =1.78 and S DOUBLE =2.29.
The running time of ADAPT is dominated by edge-coloring of (N S -N)×N=(λ-1)N 2 edges. This gives a time complexity of
IV. SRF ALGORITHM
In both DOUBLE and ADAPT, the N non-overlapping configurations devoted to the residue matrix R are chosen arbitrarily without explicit computation. In this section, we design an SRF (Scheduling Residue First) algorithm which schedules R more carefully to further reduce S schedule . Since DOUBLE is a special case of ADAPT at N S =2N, for simplicity, we first design SRF based on DOUBLE.
A. Observation and Motivation
DOUBLE assigns an equal weight of T/N to each of its 2N configurations (see the example in Fig. 4) . We observe that a schedule generated by DOUBLE may be inefficient. In particular, the bandwidth in the N configurations devoted to Q is generally not well utilized (due to the bandwidth loss). If such otherwise wasted bandwidth can be used to transmit some packets in R, the remaining packets in R may be sent in a shorter amount of time. That is, some configurations devoted to R may require a weight less than T/N. Then the overall switch speedup requirement can be further reduced.
In DOUBLE, C(T)=[T/N]×Q+R and r ij <T/N for any entry r ij ∈R. If r ij >T/(2N), we call the entry an LER (large entry in R). Otherwise it is an SER (small entry in R). We have the following Lemma 2 (proved in Appendix A).
Lemma 2: In DOUBLE, if a line (row i or column j) of R contains k LERs, then in Q we have , row for 2 Fig. 6 gives an example based on the same Q and R as in Fig.  4 . The third row of R contains k=1 LERs (>T/(2N)=2). Then the entries in the third row of Q sum to at most
Based on Lemma 2, we can move some packets from R to Q, while still keeping the maximum line sum of Q no more than N.
Note that all φ n in DOUBLE equal to T/N. If a line in R contains k LERs, we can move half (i.e., ⎡ ⎤ 2 k ) of them to Q. This is achieved by setting the moved LER entries to 0 in R, and at the same time increasing the corresponding entries in Q by 1. Fig. 6 shows this operation. We use Q´ and R´ to denote the updated Q and R. In Fig. 6 , because the maximum line sum of Q´ is at most 4, Q´ can still be covered by N=4 configurations, each with the same original weight φ n =T/N=4. Compared to the example in Fig. 4 , more packets are scheduled in the N configurations devoted to the quotient matrix.
Note that each line of R can contain at most N LERs. If half of them are moved to Q (without increasing its maximum line sum), then each line of R´ will contain at most N/2 LERs (if N is even). Though we still need N non-overlapping configurations to cover R´, it is possible to reduce the weight for some of them. Specifically, we may find half (N/2) of the non-overlapping configurations, each with the original weight φ n =T/N, to cover all the remaining LERs in R´. The other half of the non-overlapping configurations only need a reduced weight of φ n =T/(2N) to cover the remaining SERs. If this can be achieved, then S schedule can be reduced to
The above observation motivates us to explore a more efficient scheduling algorithm than DOUBLE. The key is to find a proper residue matrix R´, such that R´ contains at most N/2 LERs in each line, and each line sum of Q´ is not larger than N. Generally, this is not easy. For example, we assume that all the non-zero entries (3s) in Fig. 7a are LERs. The number next to each line of R gives the number of LERs that can be In Fig. 7a , if the four circled entries are moved to Q, then we cannot further move any other LERs without violating the quota of the corresponding lines. At this point, the last row of R´ still contains 3 LERs where 3>N/2=2. This makes it impossible to cover the remaining LERs by N/2 configurations. For larger switch size N, it will be more difficult to figure out a proper set of LERs to move.
B. SRF Algorithm
For simplicity, we first consider even switch size N (an example for odd N is given in Appendix B). We map the residue matrix R={r ij } into a 0/1 reference matrix Ω={ω ij } such that ω ij =1 if r ij is an LER and ω ij =0 otherwise. Fig. 7b gives an example of Ω for R in Fig. 7a (where all the 3s are LERs). A horizontal line h and a vertical line v are used to partition Ω into four (N/2)×(N/2) zones A, B, C and D as shown in Fig. 8 . The partitioning line v (or h) separates each row (or column) into two parts, and each part contains N/2 entries. In addition, we define a set of N/2 non-overlapping predefined partial configurations (PPCs) in a cyclic manner to cover all the entries in zone A. Specifically, if an entry (i, j) 3 in zone A is covered by PPC p (where N/2≥p≥1), then
From (18) 4 , each PPC p covers N/2 entries in zone A. Fig. 9 gives an example where four non-overlapping PPCs (i.e., PPC 1 ～PPC 4 ) are used to cover all the entries in zone A of an 8×8 matrix. For easy reading, we use an italic number at an entry to denote the index number p if PPC p covers that entry. For example, the circled entries in Fig. 9 are covered by PPC 2 .
For an arbitrary entry ω ij ∈Ω, we define its line images and diagonal image as shown in Fig. 10 . Particularly, ω i(N -j -1) and ω (N -i -1)j are line images of ω ij , because they are symmetrical to ω ij with respect to the two partitioning lines v and h. ω (N -i -1)(N -j -1) is the diagonal image as it is symmetrical to ω ij with respect to the cross-point of v and h.
Without loss of generality, we consider PPC p . For each entry ω ij (in zone A) covered by PPC p , we find its line and diagonal images. The 4-tuple {ω ij ,
possible values (or combinations) as shown in Fig. 11 . The tuples in Figs. 11a～11l are defined as diagonal dominant tuples (DD tuples), and the two circled diagonal entries in each 4 Note that the PPCs are not necessarily predefined in the cyclic manner as in (18) . In fact, any set of N/2 non-overlapping permutation sub-matrices in zone A can be used as PPCs. We use (18) only to facilitate the presentation.
Row isomorphic
Diagonal dominant (DD) tuples
Non-DD tuples (The two circled diagonal entries in each DD tuple are dominant entries whereas the other two diagonal entries are non-dominant) Fig. 12 . Butterfly mechanism.
The most recent row isomorphic tuple occurred in the same row pair
The current row isomorphic tuple Ω = Fig. 10 . Images. To cover the residue matrix using N non-overlapping configurations P 1 ～ P N , we first initialize P 1 ～ P N to all-0 matrices. Then, based on the reference matrix Ω, each PPC p (N/2≥p≥1) is sequentially examined to construct two configurations P p and P p+N/2 . Specifically, for each ω ij covered by PPC p , we find its line and diagonal images to form a 4-tuple N-j-1), (N-i-1, j), (N-i-1, N-j-1 )} in P p and P p+N/2 are set according to the three cases below.
Case 1: If the 4-tuple is a DD tuple, we set the two dominant entries to 1 in P p , and set the other two non-dominant entries to 1 in P p+N/2 .
Case 2: If the 4-tuple is a row isomorphic tuple, we check whether there exist other row isomorphic tuples in the same row pair {i, N-i-1} (which may occur when examining an earlier PPC t , t<p.) If no, we set either pair of the diagonal entries to 1 in P p , and set their line images to 1 in P p+N/2 . For example, if the two entries (i, j) and (N-i-1, N-j-1) are set to 1 in P p , then the entries (i, N-j-1) and (N-i-1, j) are set to 1 in P p+N/2 . On the other hand, if some row isomorphic tuples occurred earlier in the same row pair, we set the entries in P p and P p+N/2 according to the most recently occurred one. Assume that the most recent row isomorphic tuple in this row pair occurred in examining PPC t (where p>t). Note that two configurations P t and P t+N/2 have been obtained in examining PPC t . If P t was set to cover a "1" in row i and a "0" in row N-i-1 of Ω, then we set P p to cover a "1" in row N-i-1 and a "0" in row i, and vice versa. Fig. 12 gives an example. Assume that the two dash-circled entries have been set to 1 in P t . Then, the two solid-circled entries are set to 1 in P p . At the same time, their line images (i.e., the two entries in the triangles) are set to 1 in P p+N/2 . The goal is to let P t and P p cover the 1s in row i and row N-i-1 of Ω in an alternating manner. We call this butterfly mechanism.
Case 3: If the 4-tuple is a column isomorphic tuple, we use the same mechanism as in Case 2 to set P p and P p+N/2 , but we operate on the corresponding column pair instead of row pair.
A 4×4 example is given in Fig. 13 . For simplicity, we only discuss the construction of P 1 and P 2 . In examining PPC 1 (which covers entries (0, 0) and (1, 1) in zone A), {ω 00 , ω 33 } are first identified as dominant entries in the 4-tuple {ω 00 , ω 03 , ω 30 , ω 33 }. Therefore, the two entries (0, 0) and (3, 3) are set to 1 in P 1 . Then, we can see that {ω 11 , ω 12 , ω 21 , ω 22 } is a row isomorphic tuple. Since no other row isomorphic tuples precede it so far, we simply set the two solid-circled entries (1, 1) and (2, 2) to 1 in P 1 . In examining PPC 2 (which covers entries (0, 1) and (1, 0) in zone A), {ω 01 , ω 02 , ω 31 , ω 32 } is also a row isomorphic tuple but it resides in a row pair different from that of {ω 11 , ω 12 , ω 21 , ω 22 } (which occurred earlier in examining PPC 1 ). Since no other row isomorphic tuples precede it in the same row pair, we can set either pair of the diagonal entries to 1 in P 2 . In Fig. 13 , the two diagonal entries (0, 1) and (3, 2) are set to 1 in P 2 . After that, the row isomorphic tuple {ω 10 , ω 13 , ω 20 , ω 23 } is considered. Because another row isomorphic tuple {ω 11 , ω 12 , ω 21 , ω 22 } precedes it in the same row pair, we need to set P 2 according to the butterfly mechanism. As a result, the two dash-circled entries (1, 0) and (2, 3) are set to 1 in P 2 .
Obviously, the above process generates N non-overlapping configurations P 1 ～P N to cover every entry of the residue matrix. This is because the entries covered by the PPCs in zone A and their corresponding images do not overlap each other. On the other hand, P 1 ～P N/2 usually cover more than half of 1s for each and all of the lines in Ω. This is because the dominant entries in each DD tuple are always covered by a configuration among P 1 ～P N/2 . For non-DD tuples, the number of 1s covered by P 1 ～P N/2 and P N/2+1 ～P N in each line of Ω are well balanced by the butterfly mechanism.
However, for a particular line in Ω, the number of 1s covered by P 1 ～P N/2 (denoted by x) may be one less than that covered by P N/2+1 ～P N (denoted by y). This is due to the odd number of isomorphic tuples in the corresponding line pair and the butterfly mechanism. Assume there are z isomorphic tuples in a particular row or column pair. For simplicity, we only consider these isomorphic tuples when counting x and y. If z is even, then x and y can be perfectly balanced by the butterfly mechanism, and thus x=y. On the other hand, if z is odd, then the butterfly mechanism leads to |x-y|=1. Generally, if DD tuples are also taken into account, we have either x≥y or x+1=y for each and all of the lines in Ω. In other words, among all the x+y 1s in each line of Ω, the N/2 configurations P 1 ～P N/2 always cover "more than half" and the other N/2 configurations P N/2+1 ～P N always cover "less than half" of them (quotes are used here because the case x+1=y is an exception).
Recall that ω ij =1 means r ij is an LER. Therefore, if P N/2+1 ～ P N cover some 1s in Ω, we can move the corresponding LERs from R to Q. This gives us R´ and Q´. From Lemma 2, the maximum line sum of Q´ will not exceed N. This is also true if x+1=y for some lines in Ω. Note that k in Lemma 2 equals to x+y. If x+1=y, then k is odd and the roof function in Lemma 2 can handle this case. Consequently, all the remaining LERs in R´ can be covered by P 1 ～P N/2 with a weight φ n =T/N for each configuration. Besides, P 1 ～P N/2 may also cover some SERs in R´. For the remaining SERs in R´ that are not covered by P 1 ～ P N/2 , they can be covered by P N/2+1 ～P N with a reduced weight of φ n =T/(2N). This leads to S schedule =1.75 as conjectured in (17), instead of S schedule =2 in DOUBLE. The above discussion is based on DOUBLE algorithm, and DOUBLE is a special case of ADAPT at N S =2N. In general, we have the following theorem. 
Proof: By using T/(N S -N) to divide each entry c ij ∈C(T), we convert C(T) into a weighted sum of Q and R as in (5) . The maximum line sum of Q is at most N S -N (see (7) & (8)), and each entry r ij ∈R is not larger than T/(N S -N) (see (9) Proof: Compared to S schedule in (10), S schedule in (19) is reduced. We can replace S schedule in (2) by (19) . This gives us a refined speedup function (still denoted by S=f(N S ) for simplicity). (21)). We can convert it into an integer using (21) , where
is defined in (14) and λ>1 must be true. Formula (21) also prevents N S =N. This ensures that the traffic matrix decomposition can be carried out properly. # Based on the above analysis, SRF (Scheduling Residue First) algorithm is designed and is summarized in Fig. 14. SRF guarantees a better algorithmic efficiency than both DOUBLE and ADAPT, by taking residue matrix scheduling as a priori. Accordingly, SRF adopts a refined matrix decomposition which is slightly different from that given in Part A of Section II. Proof: SRF algorithm in Fig. 14 can achieve performance guaranteed switching with a packet delay bound of 3T+H slots (see Fig. 2a ). From (2), (3), (19) and (21), we can formulate the overall speedup required by SRF as below. (15) and S SRF in (22) . For simplicity, we focus on an optical switch with a given switch size N and a given reconfiguration overhead δ. From (14), varying λ now corresponds to varying the traffic accumulation time T, and thus the packet delay bound 3T+H (assume the algorithm's execution time H is constant). Accordingly, Fig. 15 shows the tradeoff between the overall speedup and the packet delay bound for the three algorithms. Since DOUBLE is a special case of ADAPT at N S =2N, in Fig. 15 we can see that S DOUBLE =S ADAPT =4 at λ=2. For other λ values, we have S ADAPT <S DOUBLE . At the same time, S SRF <S ADAPT is always true. For example, at λ=1.5, we have S DOUBLE =18, S ADAPT =9 and S SRF =7.49. SRF outperforms ADAPT and DOUBLE by 16.78% and 58.39% respectively. We can see that ADAPT and SRF dramatically cut down the required speedup for delay-sensitive traffic with small λ values. As λ increases, the speedup gap between DOUBLE and ADAPT/SRF diminishes.
ADAPT and SRF can also be used to minimize the required traffic accumulation time and the packet delay bound for a given speedup requirement. Assume we can only tolerate a speedup up to 3. Consider a 64×64 fast optical switch with a reconfiguration overhead of 100 ns [5] . Let the duration of a time slot be 50 ns (64 bytes at 10 Gbps). We have δ=2 slots. Further let T DOUBLE , T ADAPT and T SRF denote the traffic accumulation times required by DOUBLE, ADAPT and SRF under the given speedup S=3, and λ DOUBLE , λ ADAPT , λ SRF denote their corresponding λ values. From (16) , (15) and (22), we get λ DOUBLE =2.45, λ ADAPT =2.37 and λ SRF =2. 19 . From (14), we get T DOUBLE =768 slots (38.4μs), T ADAPT =719 slots (35.95μs) and T SRF =614 slots (30.7μs). We can see that T DOUBLE > T ADAPT > T SRF . Note that both speedup and packet delay bound take realistic values in this example (e.g. the packet delay bound for SRF is 3T SRF +H=92.1μs+H, where H is the algorithm's execution time). Compared to DOUBLE, the required traffic accumulation time is cut down by 6.38% in ADAPT and by 20.05% in SRF.
From our discussion in Section I, T>δN S ≥δN (and thus
) must be ensured in any feasible schedule. Since DOUBLE enforces N S =2N, its traffic accumulation time T must be larger than 2δN (T>δN S =2δN) . Therefore, DOUBLE is only 
SRF ALGORITHM
Input:
Output:
At most N S configurations P 1 , …, P NS and the corresponding weights φ 1 , …, φ NS .
Step 1: Divide the entries in C(T):
Calculate N S by (21) . Use to divide each entry c ij ∈C(T) and separate C(T) into a quotient matrix Q={q ij } and a residue matrix R={r ij } as in (5).
Step 2: Schedule the residue matrix: a) Define a reference matrix Ω={ω ij } based on R, such that ω ij =1 if r ij is an LER (> ) and ω ij =0 otherwise. Define a set of predefined partial configurations PPC p (N/2≥p≥1) to cover every entry ω ij ∈{ω ij | N/2>i, j≥0}, where the value of p is calculated from i and j according to (18) . Initialize Otherwise if the most recent row isomorphic tuple occurred in examining PPC t (p>t), invoke the butterfly mechanism to set the corresponding entries in P p , such that P p and P t can cover the 1s in the row pair {i, N-i-1}
of Ω in an alternating manner. If a pair of diagonal entries are set to 1 in P p , then set the other pair of diagonal entries to 1 in P p+N/2 . If the 4-tuple is a column isomorphic tuple in Figs. 11m～ 11n, use the same mechanism as in the row isomorphic case to set P p and P p+N/2 but operate on the column pair {j, N-j-1} instead of row pair {i, N-i-1}. c) Repeat
Step 2b) until all the entries ω ij ∈{ω ij | N/2>i, j≥0} covered by PPC p are considered and the two configurations P p and P p+N/2 are obtained.
Step 2b). Otherwise continue.
e) Set φ 1 ～φ N/2 to and φ N/2+1 ～φ N to . If some 1s in Ω are covered by P N/2+1 ～P N , increase the corresponding entries in Q by 1. Denote the updated Q by Q´.
Step 3: Schedule the quotient matrix: a) Construct a bipartite multigraph G from Q´. Rows and columns of Q´ translate to two sets of vertices A and B in G, and each entry q´i j ∈Q´ translates to q´i j edges connecting vertices i∈A and j∈B. Find a minimal edge-coloring of G to get at most N S -N colors, such that the edges incident on the same vertex have different colors. Set N+1→n.
b) For a specific color in the edge-coloring of G, construct a configuration P n from the edges in that color by setting the corresponding entries to 1 in P n (all other entries in P n are set to 0). Set and n+1→n. Repeat step 3b) for each color in the edge-coloring of G. Fig. 15 . In comparison, both ADAPT and SRF are feasible for λ>1. This is because λ>1 (or T>δN) always ensures T>δN S in ADAPT and SRF. In ADAPT, the number of configurations in a schedule is optimized to be
, and thus
. (23) Similarly, with λ>1 and N S in (21) for SRF, T>δN S is ensured in SRF because
On the other hand, if we consider the feasibility region in terms of speedup S, then DOUBLE is only feasible for S>2 (i.e., S= S reconfigure × S schedule = 2 × S reconfigure > 2). In comparison, both ADAPT and SRF are feasible for S>1. Fig. 16 gives a simple example to compare the execution of DOUBLE, ADAPT and SRF. Although DOUBLE produces the smallest S schedule (but the largest N S ), it requires the largest overall speedup of S DOUBLE =14, whereas ADAPT requires S ADAPT =8.4 and SRF only requires S SRF =7.
In this paper, we presented a generic approach to decompose a traffic matrix into an arbitrary number of N S (N 2 -2N+2>N S >N) configurations/permutations. For general applications in other networks, it may or may not have a constraint of reconfiguration overhead. If such a constraint exists, the corresponding system can be modeled as a constrained switch [6] (e.g. SS/TDMA [16] [17] and TWIN [18] ), and ADAPT/SRF algorithms can be directly applied. If such a constraint does not exist, our generic matrix decomposition can still be applied. For example, computing a schedule for an electronic switch (which has negligible reconfiguration overhead) is difficult as switch size increases [6, 14, 24] . This is because the large number of O(N 2 ) configurations in Birkhoff-von Neumann decomposition limits the scalability of the switch [14] . In this case, our generic matrix decomposition can be applied to generate a schedule with less number of configurations, at a cost of speedup.
It should be noted that N S >N is ensured in ADAPT and SRF because λ>1. As mentioned in Section I, N S =N corresponds to minimum-delay scheduling, and it can be handled by QLEF algorithm [8] [9] . Also note that in this paper we focused on performance guaranteed switching with worst-case analysis. Average performance analysis is out of the scope of this paper, but can be handled by two existing greedy algorithms, GOPAL [22] and LIST [6, 23] .
VI. CONCLUSION
The progress of optical switching technologies has enabled the implementation of high-speed packet switches with optical fabrics. Compared with conventional electronic switches, the reconfiguration overhead issue of optical switches must be properly addressed.
In this paper, we focused on designing scheduling algorithms for optical switches that provide performance guaranteed switching (100% throughput with bounded packet delay). We first designed a generic approach to decompose a traffic matrix into the sum of N S weighted switch configurations (for N 2 -2N+2>N S >N where N is the switch size). We then took the reconfiguration overhead constraint of optical switches into account, and formulated a speedup function to capture the relationship between the speedup and the number of configurations N S in a schedule. By minimizing the speedup function, an efficient scheduling algorithm ADAPT was designed to minimize the overall speedup for a given packet delay bound. Based on the observation that some packets can be moved from the residue matrix to the quotient matrix and thus the bandwidth utilization of the configurations can be improved, another algorithm SRF (Scheduling Residue First) was designed to achieve an even lower speedup. Both ADAPT and SRF algorithms can automatically adjust the schedule according to different switch parameters, and find a proper N S value to minimize speedup. We also showed that ADAPT and SRF can be used to minimize packet delay bound under a given speedup requirement. 
Without loss of generality, we only consider row i of R and assume that it contains k LERs. Because The N non-overlapping configurations devoted to R can be constructed with the following two steps.
Step 1: We first consider the entries in the middle row and the middle column of Ω, and the entries in the two matrix diagonals, as shown in Fig. 17b . An extra 4-tuple (denoted by x-tuple) is defined as {ω 4t , ω t4 , ω 4(8 -t) , ω (8 -t)4 } for 3≥t≥0. For example, the solid-circled and dash-circled entries in Fig. 17b form two x-tuples, with t=0 and t=2 respectively.
In each x-tuple, a pair of entries resides in the middle row (row 4) and the other pair resides in the middle column (column 4). For each pair, if one entry is no less than the other entry, then we define it as a dominant entry and the other entry is non-dominant. For example, we can take entries (4, 8) and (8, 4) as two dominant entries in x-tuple {ω 40 , ω 04 , ω 48 , ω 84 }, whereas entries (4, 0) and (0, 4) are non-dominant. Then, based on the two dominant entries (4, 8) and (8, 4) , we draw two solid lines as in Fig. 17c and find the entry at the cross-point of the two lines, which is defined as a cross-point entry. The diagonal image of the cross-point entry, as shown by the solid-triangle in Fig. 17c , is defined as the partner of the two dominant entries (4, 8) and (8, 4) . Similarly, for dominant entries (4, 2) and (2, 4) in x-tuple {ω 42 , ω 24 , ω 46 , ω 64 }, entry (2, 2) is the cross-point entry, and (6, 6) is the partner of the two dominant entries, as shown by the dashed part in Fig. 17c .
For each possible x-tuple {ω 4t , ω t4 , ω 4(8 -t) , ω (8 -t)4 } (3≥t≥0), the two dominant entries and their partner are set to 1 in configuration P t+1 . At the same time, the two non-dominant entries and the cross-point entry are set to 1 in ⎡ ⎤ N/2 t P + =P t+5 . Fig. 17d give the result of this operation, where we use a set for each configuration to record the three entries that are set to 1. In Fig. 17b , we remove all the entries recorded in Fig. 17d , the remaining 9 entries are set to 1 in P 0 , as shown in Fig. 17d .
#
At the end of Step 1, we can see that all the entries in Fig. 17b have been covered by the partial configurations P 0 ～P 8 in Fig.  17d . For each matrix line (row or column), let a be the number of 1s covered by P 0 ～P 4 , and b be the number of 1s covered by P 5 ～P 8 . In Fig. 17e , we use circles and triangles to indicate the entries covered by P 0 ～ P 4 and P 5 ～ P 8 , respectively. The number next to each matrix line in Fig. 17e gives the value of a-b for this line. In the example, we can see that a-b≥0 for every line. In general, it is easy to prove that a-b≥-1 is always true. It means that the partial configurations P 0 ～P 4 in Fig. 17d usually cover more 1s than P 5 ～P 8 for each line in Figs. 17b & 17e. If this is not the case, then P 5 ～P 8 can cover at most one more 1 for some lines. It is also not difficult to prove that, for a particular line pair (i.e., row pair {i, N-i-1} or column pair {j, N-j-1}), at most one line (but not both) can have a-b=-1.
Step 2: In Fig. 17f , we shade all the entries covered so far using a filled rectangle. Then, we pick up a partial configuration P t+1 (3≥t≥0) from Fig. 17d and use dash lines to shadow the rows and columns of its three entries. Assume P 1 is chosen (i.e., t=0). Fig. 17f shows the result after the shadowing operation. We can use maximum-size matching [25] to determine a PPC in zone A, such that it contains the maximum number of not-yet-shadowed and not-yet-covered entries, each in a distinct row and column in zone A. As shown in Fig. 17f , the PPC found contains three circled entries. For each circled entry, we find its line and diagonal images to form a 4-tuple {ω ij , ω i(N -j -1) , ω (N -i -1)j , ω (N -i -1)(N -j -1) }. If it is a DD-tuple, the two dominant entries are set to 1 in P t+1 , and the two non-dominant entries are set to 1 in we use the butterfly mechanism to set the entries in P t+1 and P t+5 . If an entry is set to 1 in P t+1 or P t+5 , we add it to the corresponding partial configuration in Fig. 17g (see the underlined entries), and shade this entry by a filled rectangle as we have done in Fig. 17f (because it has been covered).
It is important to note that we need to slightly modify the butterfly mechanism for odd switch size N. Recall that for even N, if an isomorphic tuple is the first one in a particular line pair, we can set the corresponding configurations according to either pair of its diagonal entries. However, for odd N, we have an initial state as shown in Fig. 17e , where a-b≠0 for some matrix lines. It means that the number of 1s covered by the partial configurations P 0 ～P 4 and P 5 ～P 8 in Fig. 17d may not be perfectly balanced for every line at the beginning of Step 2. If a-b=-1 for a particular line, we can treat it as if another preceding isomorphic tuple already exists in the corresponding line pair. Therefore, for the first isomorphic tuple in this line pair, we should take this initial state into account, and set the entries in the corresponding configurations to ensure a-b≥-1 for both lines. After that, we can use the same butterfly mechanism as in Fig. 12 for all subsequent isomorphic tuples in this line pair.
At the end of Step 2, we remove all the dash lines in Fig. 17f . Then, Step 2 is repeated for another P t+1 (3≥t≥0) in Fig. 17d , until all the configurations P 0 ～P 8 are obtained, as shown in Fig. 17g . # The entries covered by P 0 ～P 4 (in Fig. 17g ) are circled in Fig.  17h . The number next to each matrix line in Fig. 17h gives the difference on the number of 1s covered by P 0 ～P 4 and P 5 ～P 8 for that line. Note that P 0 ～P 4 cover more 1s than P 5 ～P 8 for each line. Therefore, for those 1s covered by P 5 ～P 8 , we can move the corresponding LERs from R to Q. Then, P 5 ～P 8 can be weighted by a reduced weight of T/[2(N S -N)]. In this 9×9 example, the set of configurations P 0 ～P 4 contains one more configuration than P 5 ～ P 8 . For large switch size N, this difference is trivial.
It is not difficult to extend the above approach to other odd switch size N. Note that maximum-size matching [25] in Step 2 is only used to find a predefined PPC pattern. It is not really required for online execution. (8, 4) , (0, 0)} P 2 ={(4, 7), (7, 4) , (1, 1)} P 3 ={(4, 2), (2, 4), (6, 6)} P 4 ={(3, 4), (4, 5) , (5, 3)} P 0 ={(0, 8), (1, 7) , (2, 6) , (3, 3) , (4, 4) , (5, 5) , (6, 2), (7, 1), (8, 0)} P 5 ={(4, 0), (0, 4), (8, 8 )} P 6 ={(4, 1), (1, 4), (7, 7)} P 7 ={(4, 6), (6, 4) , (2, 2)} P 8 ={(4, 3), (5, 4) , (3, 5 (8, 4) , (0, 0), (1, 2), (7, 6) , (2, 5) , (6, 3) , (3, 1), (5, 7)} P 2 ={(4, 7), (7, 4) , (1, 1), (0, 5), (8, 3) , (2, 0), (6, 8) , (3, 2) , (5, 6)} P 3 ={(4, 2), (2, 4) , (6, 6) , (8, 1) , (0, 7), (1, 5) , (7, 3) , (3, 8) , (5, 0)} P 4 ={(3, 4), (4, 5) , (5, 3) , (0, 2), (8, 6) , (1, 8) , (7, 0) , (2, 7), (6, 1)} P 0 ={(0, 8), (1, 7) , (2, 6) , (3, 3) , (4, 4) , (5, 5) , (6, 2), (7, 1), (8, 0)} P 5 ={(4, 0), (0, 4), (8, 8) , (7, 2) , (2, 6) , (2, 3) , (6, 5) , (3, 7) , (5, 1)} P 6 ={(4, 1), (1, 4), (7, 7) , (0, 3), (8, 5) , (2, 8) , (6, 0) , (3, 6) , (5, 2)} P 7 ={(4, 6), (6, 4) , (2, 2) , (8, 7) , (0, 1), (1, 3), (7, 5) , (3, 0), (5, 8)} P 8 ={(4, 3), (5, 4) , (3, 5) , (0, 6), (8, 2) , (1, 0), (7, 8) 
