Abstract-We consider traffic scheduling in non-blocking electronic-buffered optical packet switches (OPS) with bounded packet delay. Due to the reconfiguration overhead of the switch fabric, the two commonly used optimization objectives, minimizing packet delay and minimizing switch speedup, conflict with each other. Intelligent scheduling algorithms have been designed to provide tradeoff between these two objectives. In this paper, we propose a more efficient approach to schedule OPS traffic, resulting in significantly reduced speedup and/or packet delay. However, our approach is based on a very interesting conjecture, which has not been strictly proved so far. We would like to put forward this conjecture as an open question, and call for a proof or disproof.
I. INTRODUCTION
ECENT progress on optical switching technologies [1] [2] [3] [4] has enabled the implementation of electronic-buffered optical packet switches (OPS) as shown in Fig. 1 . The core of this architecture is the optical switch fabric, which can efficiently provide huge switching capacity as demanded by the backbone routers in the Internet. Since optical connections (i.e. optical fibers) are used to interconnect the input/output line-cards with the central switch fabric, the input/output line-cards can be distributed into several racks, which may locate at hundreds of meters away from each other. As a result, power consumption in each rack can be reduced, and the switch becomes more scalable.
On the other hand, the optical switch fabric usually needs some guard time to change its inter-connection pattern from one to another, and to synchronize the signals arriving at the input ports [5] . This guard time is called reconfiguration overhead. During this period, no packet can be transmitted across the switch fabric. Accordingly, packet transmission rate in the switch fabric must be faster than the external line-rate (i.e. a speedup is required in the switch fabric) in order to achieve performance guaranteed switching (i.e. 100% throughput with bounded packet delay) [6] [7] [8] [9] [10] [11] [12] . It is shown in [6, 8] that minimizing speedup and minimizing delay are two conflicting goals, where a higher speedup gives a shorter packet delay and vice versa.
Based on switch architectures similar to Fig. 1 , several algorithms have been recently proposed [6] [7] [8] [9] [10] [11] [12] to schedule OPS traffic with guaranteed switching performance. Among them, MIN [6] , i -SCALE [9] and QLEF [10] aim primarily at minimizing the packet delay, whereas reducing speedup is a secondary objective. DOUBLE [6] is the first algorithm that allows the tradeoff between speedup and delay. Let N denote the switch size and N S be the (maximum) number of switch configurations required for scheduling. DOUBLE needs no more than N S =2N configurations to schedule any legitimate traffic matrix, with a speedup of S schedule =2 (detailed in Section II). However, this algorithm does not consider the amount of reconfiguration overhead in its scheduling decision, and thus it is not optimized for switches with different . Besides, N S =2N only represents a single point in the solution space [6, 8] , and the characteristics for other N S values are not studied in [6] . To address those issues, ADAPTIVE [8] is proposed. It is shown [8] that DOUBLE can be regarded as a special case of ADAPTIVE at N S =2N.
In this paper, we explore the possibility of beating the performance of DOUBLE and ADAPTIVE. We show that this can be achieved based on a very interesting conjecture. We put forward this conjecture as an open question, and hope that a proof or disproof can be found soon.
The rest of the paper is organized as follows. In Section II, we review the generic scheduling procedure [6] [7] [8] [9] [10] [11] [12] Section III, we propose an approach to improve the scheduling efficiency. Section IV gives some further discussion, and we conclude the paper in Section V.
II. TRAFFIC SCHEDULING AND SPEEDUP-DELAY TRADEOFF

A. Scheduling Procedure
The generic four-stage scheduling procedure as shown in Fig. 2 is followed. In Stage 1, incoming packets are periodically accumulated in the input buffers over T time slots to construct an N×N traffic matrix C(T)={c ij }. Each entry c ij denotes the number of packets received at input i and destined to output j. C(T) is legitimate if each of its line sums (i.e. row sum or column sum) is no larger than T. Throughout the paper, we only consider legitimate C(T). The scheduling algorithm takes H time slots in Stage 2 to generate N S configurations P n ={p (n) ij }, n∈{1, …, N S }, each weighted by n , to cover C(T). "Cover" means that
NS n=1
n p (n) ij c ij for any i, j∈{1, …, N}. P n is an N×N permutation matrix with at most a single "1" in each line (row or column). p (n) ij =1 indicates that a packet can be sent from input i to output j in one slot; p (n) ij = 0 otherwise. In Stage 3, the switch fabric is reconfigured according to the N S configurations obtained in Stage 2. An internal switch fabric speedup S is applied, resulting in compressed/shortened time slots, to ensure that this stage occupies only T (regular) slots. The fabric holds each configuration P n for n compressed slots for packet transmission. Finally in Stage 4 packets are sent onto the output lines from output buffers (in T slots).
From the tagged packet in Fig. 2 , we can see that the bounded delay of any packet is 2T+H slots. Because N S slots are used to reconfigure the switch for N S times in Stage 3, only T N S slots are left for transmitting C(T). Since there are at most T packets waiting at each input port for transmission, a speedup factor S reconfigure =T/(T N S ) is necessary to compensate for the idle time caused by reconfigurations. At the same time, the scheduling algorithm may produce some empty slots (i.e. underutilize the bandwidth provided by the configurations [6] [7] [8] [9] [10] [11] [12] ). As a result, more than T compressed slots are usually needed in Stage 3 to transmit C(T). Therefore another speedup factor 
is required to compensate for the inefficient scheduling. In fact, S schedule denotes the efficiency of the scheduling algorithm adopted. A smaller S schedule indicates a more efficient scheduler (with less empty slots in the schedule). The overall internal speedup S is then given by
B. Speedup-Delay Tradeoff and Scheduling Algorithms
For a given C(T), we divide it by T/(N S N) 1 to get a quotient matrix Q={q ij }and a residue matrix R={r ij }:
Since each line of C(T) sums to at most T, the maximum line sum of Q is at most N S N. So we can apply edge-coloring [13] to the bipartite multigraph of Q, and get N S N configurations to cover Q [6] [7] [8] . 
The above formula (4) is referred to as speedup function in [8] . In essence, it depicts the tradeoff relationship between the speedup (S schedule due to the inefficient scheduling) and the delay (in terms of N S ). Recall that the bounded delay of any packet is 2T+H slots and T> N S . Therefore the minimum achievable delay is given by 2 N S +H.
DOUBLE [6] requires N S =2N configurations to cover C(T). This is obtained by replacing N S in (3) by 2N to get C(T)=[T/N]×Q+R. N configurations are required to cover Q and R respectively, and each configuration is equally weighted by n =T/N. From (4), DOUBLE achieves S schedule =2. Fig. 3 gives an example of DOUBLE execution.
Unlike DOUBLE, ADAPTIVE [8] substitutes (4) into (2), and minimizes the overall speedup S by solving
Therefore the schedule generated by ADAPTIVE is optimized with respect to the value of .
III. IMPROVING SCHEDULING EFFICIENCY
In this section, we aim at achieving a better scheduling efficiency than DOUBLE and ADAPTIVE. Since DOUBLE is a special case of ADAPTIVE at N S =2N, for simplicity, we only focus on DOUBLE below. We further assume that the switch size N is an even number.
A. Observation and Motivation
In DOUBLE, the traffic matrix C(T) is decomposed as C(T)=[T/N]×Q+R. For any r ij R, if r ij >T/(2N), we call it an LER (large entry in R). Otherwise it is an SER (small entry in R). We have the following Lemma 1 (proved in Appendix A). =2. Based on Lemma 1, we can move some packets from R to Q, while keeping the maximum line sum of Q not more than N. Note that each configuration in DOUBLE is equally weighted by n =T/N. Without loss of generality, if row i of R contains k LERs, we can move at most 2 k of these LERs to Q by setting them to 0s in R, and at the same time increasing the corresponding entries in Q by one. Fig. 4 shows an example based on the Q and R in Fig. 3 . We use Q´ and R´ to denote the modified Q and R. We can see that Q´ can still be covered by N configurations with a weight n =T/N each. So we can pack more packets in the N configurations used to cover the quotient matrix (than DOUBLE). On the other hand, since some LERs are moved to Q, it may not be necessary to use another N equally weighted configurations (with n =T/N) to cover R´.
The above observation motivates us to explore a more efficient scheduling algorithm than DOUBLE. In fact, there may be at most N LERs in each line of the original R. From Lemma 1, "half" of these LERs in each line of R can be moved to Q, while keeping the maximum line sum of Q´ not more than N. Therefore, it is reasonable to expect that each line of Rć ontains at most N/2 LERs after packet moving. As a result, we may be able to find N/2 non-overlapping configurations, each weighted by n =T/N, to cover all the remaining LERs in R´. At the same time, we may find another set of N/2 non-overlapping configurations, each with a reduced weight of n =T/(2N), to cover the remaining SERs. If this can be done, then S schedule of DOUBLE can be reduced to
B. Issues To achieve the goal mentioned above, we need to solve the following issues.
• Determine the set of LERs to be moved to Q, such that each line sum of Q´ does not exceed N, and R´ contains at most N/2 LERs in each line.
• Among the N non-overlapping configurations used to cover R´, N/2 of them should cover all the remaining LERs in R´, and the other N/2 configurations should cover all the SERs not yet covered by the first N/2 configurations. Generally, it is not easy to determine the set of LERs that should be moved to Q. This can be seen from the example in Fig. 5 . Assume all the non-zero entries (3s) in R are LERs. The number next to each line is the number of LERs that can be moved from this line to Q, which is obtained from 2 k based on Lemma 1. If the four circled entries are moved, then we cannot further move any other LERs without violating the quota of the corresponding line. At this point, the last row still contains more than N/2 LERs. For larger switch size N, it will be more difficult to figure out a proper set of LERs to move.
C. Methodology
We first define two important notions, PCs (Predefined Configurations) and DHH matrix. For an N×N matrix, we can use N predefined non-overlapping configurations (or PCs) to cover all of its entries. As an example, eight non-overlapping PCs (i.e. PC 1 PC 8 ), as defined in Fig. 6 , can be used to cover all the entries of an 8×8 matrix. Note that the number at each entry of this matrix denotes the particular PC that covers this 5 Step 1: Calculate Q
Step 2: Color Q
Step 3: Schedule Q
Step 4: Schedule R entry. For easy reading, the entries covered by PC 3 are circled in Fig. 6 .
Given an arbitrary 0/1 matrix, we can use two lines to partition it into four (N/2)×(N/2) zones/sub-matrices A, B, C and D as shown in Fig. 9 (and Fig. 10 ) in Appendix B. For each row/column, if the number of 1s in zone A or C is no less than that in zone B or D, or is less by at most one, then this 0/1 matrix is called a DHH matrix. In other words, the two diagonal zones (A and C) of a DHH matrix contain at least "half" number of 1s for each row and column. (Please refer to Appendix B for a more rigorous definition.)
Our approach in improving the scheduling efficiency (i.e. minimizing S schedule ) is based on the concepts of PC and DHH matrix. We first convert the residue matrix R={r ij } into a 0/1 indicator matrix ={ ij } such that ij =1 if r ij is an LER and ij =0 otherwise. The DHH conjecture given in Appendix B says that we can always find two permutation matrices U and V, such that ´=U V is a DHH matrix. For the 8×8 matrix, assume that a DHH matrix ´ is obtained by ´=U V. Then, PC 5 PC 8 defined in Fig. 6 (which span over the sub-matrices A and C as illustrated in Fig. 9 ) can cover "more than half" of the 1s for each row and column of ´. The remaining 1s covered by PC 1 PC 4 "correspond" to LERs that should be moved to Q. Since ´ is obtained from after some row/column permutations (i.e. ´=U V), the 1s in ´ do not directly match the original LER entries. Therefore, we need to invoke an inverse transform to get our desired configurations. Fig. 7 gives an example, where the execution steps are indexed by the numbers in the dashed circles. In Step 1, we construct the indicator matrix from R. In Step 2, we find two 4 , if they cover some LERs in the original R, then these LERs (circled entries in Step 4) are to be moved to Q. Based on Lemma 1, we can set these LERs to 0s in R, and increase the corresponding entries in Q by one. Q´ and R´ are thus obtained in Step 4. In Step 5, to cover Q´ we simply use the same edge-coloring algorithm [13] as in DOUBLE to determine configurations P 1 P 8 , each weighted by T/N=4. On the other hand, R´ can be covered by 1 8 , with a weight of 4 for 5 8 , and a reduced weight of T/(2N)=2 for 1 4 . As a result, S schedule can be reduced to 1.75, as shown in Step 6. In fact, this 12.5% reduction in S schedule is independent of switch size N, as stipulated by (6) . Note that there is a one-to-one mapping from entry to entry between and ´, which is determined by the linear transform ´=U V. Therefore, the resulting 1 8 are non-overlapping, and they can cover every entry of R´. 
For N S <2N, we can achieve a greater gain (than 12.5%) over the original ADAPTIVE algorithm. Since speedup and packet delay can trade one for another, this also means that packet delay can be smaller than that given in [8] if the same speedup is applied to the OPS switch. From Fig. 11 in Appendix B, we know that U and V are used to record the row/column permutations involved, and they can be constructed from a square unit matrix E (i.e. an N×N matrix with N 1s at its diagonal entries and all other entries are 0s). Therefore, it is not necessary to get U 1 and V 1 by algebraic calculations. Instead, we can also start from a square unit matrix E, and permute its lines in a reverse order to generate U 1 and V 1 . Finally, it is important to note that our proposed approach is based on the DHH conjecture. We have tried to prove it for a very long time but without luck. We also cannot find a single counterexample by checking extensive samples using computer programs. Many mathematicians including the authors of [14] have reviewed our conjecture. As for now, the problem is still open.
V. CONCLUSION
Due to the reconfiguration overhead, speedup and packet delay are two main issues for traffic scheduling in optical packet switches (OPS). In this paper, we proposed a new approach to improve the scheduling efficiency in OPS with guaranteed switching performance. Compared with the existing scheduling algorithms, our approach can significantly reduce speedup and/or packet delay. However, the proposed approach is based on the DHH conjecture given in this paper. We call for a proof or disproof for this conjecture. 
Without loss of generality, we assume that row i of R contains k LERs. Because 
we have 
Since N j=1 q ij is an integer, we then have
APPENDIX B DHH CONJECTURE
In this appendix, we assume that the size N of the matrices/ vectors is an even number.
Definition 1 (halve): Given an arbitrary 0/1 vector (row or column whose entries are either 0 or 1), use a line to separate it into two equal parts as shown in Fig. 8 . Let x and y denote the number of 1s in each part. If |x y| 1, we say that the 1s in the vector are halved by the line. Fig. 8 gives some examples, where the 1s are halved in (a) and (c), but not in (b) and (d).
Definition 2 (DHH matrix): Given an arbitrary N×N 0/1 matrix, use two lines to partition it into four (N/2)×(N/2) zones/ sub-matrices A, B, C and D as in Fig. 9 . For each row/column of the matrix, if the number of 1s in zone A or C is more than that in zone B or D, or the 1s in this row/column are halved by one of the two lines, then this matrix is called a DHH matrix (it means that the diagonal half-size sub-matrices A and C contain at least "half" number of 1s for each row and column).
The matrix in Fig. 10a is a DHH matrix, whereas the matrix in Fig. 10b is not, because its second column has two more 1s in zone D than that in zone A.
DHH conjecture: Given an arbitrary N×N 0/1 matrix , we can permute its rows or columns 2 for a limited number of times, such that can be turned into a DHH matrix. In other words, there exist two permutation matrices U and V, such that U V is a DHH matrix.
For example, if we swap the first row and the last row in Fig.  10b , then the resulting matrix is a DHH matrix. Fig. 11 gives a more complex example for in Fig. 7 . U and V are used to record the row/column permutations. They can be constructed as follows: Initialize U and V as square unit matrices E. If we permute two rows in , then we also permute the corresponding rows in U; if we permute two columns in , then we permute the corresponding columns in V too. After is turned into a DHH matrix, the corresponding U and V are also obtained. 
