Abstract-This paper proposes an innovative Pipeline-based Maximalsized Matching scheduling approach, called PMM, for input-buffered switches. It dramatically relaxes the timing constraint for arbitration with a maximal matching scheme. In the PMM approach, arbitration operates in a pipelined manner, where Ã subschedulers are used. Each subscheduler is allowed to take more than one time slot for its matching. Every time slot, one of them provides the matching result. The subscheduler can adopt a pre-existing efficient maximal matching algorithm such as SLIP and DRRM. PMM maximizes the efficiency of the adopted arbitration scheme by allowing sufficient time for a number of iterations. We show that PMM preserves 100% throughput under uniform traffic and fairness for besteffort traffic of the pre-existing algorithm.
I. INTRODUCTION
The explosion of Internet traffic has led to a greater need for high-speed switches and routers that have over 1-Tbit/s throughput [1] . Most high-speed packet switching systems adopt a fixed-size cell in the switch fabric. Variable-length packets are segmented into several fixed-sized cells when they arrive, are switched through the switch fabric, and are reassembled into packets before they depart.
There are various types of buffering strategies in switch architectures: input buffering, output buffering, or crosspoint buffering [2] , [3] , [4] , [5] . Input buffering is a cost-effective approach for high-speed switches. This is because input-buffered switches do not require internal speedup or allocate any buffers at each crosspoint. It relaxes memory-bandwidth and memorysize constraints.
In input-buffered switches, it is well known that head-of-line (HOL) blocking limits the maximum throughput to 58.6% in a input-buffered switch with the First-In-First-Out (FIFO) structure [6] . A Virtual-Output-Queue (VOQ) structure is used to overcome the HOL-blocking problem [9] . Consider an AE ¢ AE input-buffered switch with VOQs at the inputs and a crossbar switch fabric, as shown in Figure 1 . A fixed-size cell is sent from any input to any output, provided that no more than one cell is sent from the same input and no more than one cell is received by the same output. Each input has AE VOQs, each of which is denoted as Î Ç É µ, where cells that are destined for output are stored. The HOL cell in each VOQ can be selected for transmission across the switch in each time slot. Therefore, every time slot, a scheduler has to determine one set of matching.
For a input-buffered switch with VOQs, maximum-sized matching algorithms have been proposed to achieve 100% throughput [7] , [8] . Although the maximum-sized matching algorithms provide a maximum match, they suffer from highcomputing time complexity. Therefore, it is difficult to implement such algorithms for high-speed switching systems.
Maximal-sized matching algorithms have been proposed as an alternative to maximum matching ones, such as SLIP [9] and Dual Round-Robin Matching (DRRM) [10] , [11] . Both algorithms reduce their computing complexity compared with maximum matching ones, and provide 100% throughput under uniform traffic and complete fairness for best-effort traffic. However, they still have the strict constraint that the maximal matching has to be completed within one cell time slot. The constraint is a bottleneck when the switch size increases or a port speed becomes high, because the arbitration time becomes longer than one time slot or the time slot shrinks, respectively. Consider a 64-byte fixed-length cell at a port speed of 40Gbit/s (OC-768). The computation time given for maximal-sized matching is only 12.8ns.
To relax the strict scheduling timing constraint, a pipelinebased scheduling algorithm called Round-Robin Greedy Scheduling (RRGS) was proposed by Smiljanić et al. [12] . The RRGS switch has AE round-robin arbiters, each of which is associated with an input. Each input performs one round-robin arbitration to select one VOQ in one time slot. AE round-robin operations that select their own candidates to be transmitted at time slot Ì are performed during the previous AE time slots According to the RRGS algorithm described in [12] , the acceptable transmission rate is obtained as ´¼ ¼µ ¾ ¿ and ´½ ¼µ ½ ¿. This is due to the simple cyclic allocation mechanism. The RRGS operation in this example is also described in Appendix A. Thus, when traffic is not balanced, some inputs can unfairly send more cells than others. Smiljanić also proposed weighted-RRGS (WRRGS), which guarantees pre-reserved bandwidth [13] . In WRRGS, however, the fairness problem is not yet solved for best-effort traffic. In addition, every AE time-slot cycle, an idle time slot is produced when AE is an even number. This means that RRGS does not efficiently utilize the switching capacity. It is a challenge to find a maximal matching scheduling scheme to meet the following requirements.
The scheduling time should be relaxed into more than one time slot. High throughput should be provided. Fairness should be maintained for best-effort traffic. This paper presents a solution to these requirements. We introduce an innovative Pipeline-based approach for Maximalsized Matching scheduling in input-buffered switches, called PMM. Within the scheduler, more than one subscheduler operate in a pipelined manner. Each subscheduler is allowed to take more than one time slot. Every time slot, a subscheduler provides a matching result. The subschedulers can adopt a pre-existing maximal matching algorithm such as SLIP and DRRM, while preserving their properties in the same way as the original non-pipelined version. Therefore, PMM provides 100% throughput under uniform traffic, and maintains fairness for best-effort traffic.
The remainder of this paper is organized as follows. Section II describes the PMM scheme. Section III describes the performance of the PMM scheme. Section IV summarizes the
VOQ(i, j)
Subscheduler 0
Subscheduler K-1 
II. PIPELINE-BASED MAXIMAL-SIZED MATCHING (PMM)
The PMM scheme is able to relax the computation time for maximal-sized matching into more than one time slot. A main scheduler consists of AE ¾ request counters and Ã subschedulers. Each subscheduler has AE ¾ request flags, as shown in Figure 3 . Detail notations in Figure 3 is described below. Each subscheduler operates the maximal-sized matching in a pipelined manner, and takes Ã time slots to complete the matching, as shown in Figure 4 . In each subscheduler, we assume that one of the pre-existing maximal matching algorithms, (e.g., DRRM), is adopted to simplify the description below, otherwise stated. PMM is also able to adopt other max-min fair share algorithms such as SLIP [9] .
We note that DRRM is logically equivalent to SLIP, although each implementation is different [11] . The logical equivalence between DRRM and SLIP can be easily derived in the same Whenever a condition associated with any phase is satisfied, the phase is executed by the main scheduler.
To apply the DRRM algorithm as a matching algorithm in subscheduler in PMM, we use ´ µ instead of VOQ requests as described in the DRRM scheme [10] , [11] . Each subscheduler has its own round-robin pointers. The position of pointers in subscheduler is modified by the results only from subscheduler . The operation of DRRM in subscheduler is the same as that of the non-pipelined DRRM scheme.
III. PERFORMANCE
This section describes throughput and delay performance of PMM under uniform traffic, and the fairness for best-effort traffic. In addition, the effect of the PMM scheduling relaxation is described. 
A. Throughput
PMM that adopts the DRRM algorithm in the subschedulers provides 100% throughput under uniform traffic.
The reason is as follows. Consider the input load as 1.0. If some inputs cannot send cells, outstanding requests are maintained in each subscheduler. In other words, ´ µ ½. As a result, ´ µ is not decremented in phase 2 and increased in phase 1. Since ´ µ reaches a large enough value to be always satisfied with ´ µ ¼, ´ µ ½ is kept in phase 2. ¾ In this situation with ´ µ ½ at any Ø in phase 3, subscheduler that adopts the DRRM algorithm provides a complete matching result every Ã time slot due to the desynchronization effect of input and output arbiters, as described in [11] . ¿ Thus, PMM preserves the throughput advantage of DRRM.
B. Delay
The scheduling time Ã does not significantly degrade delay performance, as shown in Figure 5 . Simulation results are obtained with a 95% confidence interval, not greater than 5% for the average cell delay. A Bernoulli arrival process is used for the input traffic. In this evaluation, we include absolute delay caused by the scheduling time in the delay performance.
When Ã increases, requests from Ê ´ µ are distributed to Ê ´ ) associated with each subscheduler . Therefore, the desynchronization effect becomes less efficient with Ã for a light traffic load. For a heavy traffic load, the delay dependency on Ã becomes negligible. Therefore, Ã does not affect the delay performance for a practical use.
The delay performance is improved with more iterations. Since PMM relaxes the scheduling timing constraint, a large number of iterations is not a bottleneck even when the switch size increases or a port speed becomes fast, compared with the non-pipelined algorithm, as will be described in subsection III-D. Note that we showed the delay performance up to four iter-¾ Although ´ ) becomes 0 in phase 4 when a request is granted, ´ µ is always changed to 1 in phase 2. ¿ We note that the pointer desynchronization is achieved within the same subscheduler. There is no relationship of pointers among different subschedulers. ations because there is no measurable improvement with more iterations.
The above observations for AE ¿ ¾ are applied to different switch sizes. Figure 6 shows the same delay tendency as that in Figure 5 for a switch with AE . Figure 7 shows that Ã does not affect delay performance for a practical use even when a bursty arrival process is considered. We consider an ON-OFF source model in which an arrival process to an input port alternates between ON (active) and OFF (idle) periods. A traffic source, during the ON period, continues sending cells in every time slot, but stops sending cells in the OFF period. Both the durations on the ON and OFF periods are assumed to be geometrically distributed. The average burst length used in this paper corresponds to the average ON period. In this evaluation, the average burst length is set to be 10.
C. Fairness
Since we adopt a round-robin-based algorithm in subscheduler , subscheduler can maintain a fair scheduling. Therefore, PMM provides max-min fair share for best-effort traffic. Figure 8 shows the effect of the PMM scheduling timing relaxation. We assume that a cell size, Ä ÐÐ , is 64 ¢ 8 bits. Let the allowable arbitration time per iteration, a port speed, and the number of iterations be Ì Ö , , and Á, respectively. Ì Ö is given by,
D. Scheduling Timing Relaxation
Ì Ö decreases with Á and , but increases with Ã. In the nonpipelined DRRM scheme, Ã is 1 as a special case of PMM in Eq. (1). In the non-pipelined DRRM, when =40Gbit/s and Á , Ì Ö =3.2ns. Under this timing constraint, it is difficult to implement round-robin arbiter that supports large AE in hardware by using available CMOS technologies, in which, for example, typical gate-delay time is about 100ps [15] . On the other hand, PMM can expand Ì Ö by increasing Ã. When =40Gbit/s, Á and Ã ¿ , Ì Ö =9.6ns. Therefore, PMM achieves the desired number of iterations even when AE increases or becomes fast.
IV. CONCLUSIONS
This paper has proposed a Pipeline-based Maximal-sized Matching scheduling approach, called PMM, for input-buffered switches. It dramatically relaxes the scheduling timing constraint. Each subscheduler is allowed to take more than one time slot. Every time slot, one of them provides the matching results. The subscheduler can adopt a pre-existing efficient maximal matching algorithm such as SLIP or DRRM. PMM maximizes the efficiency of the adopted arbitration scheme by allowing sufficient time for a number of iterations. We showed that PMM provides 100% throughput under uniform traffic and keeps fairness for best-effort service as the pre-existing algorithm does, while ensuring that cells from the same VOQ are transmitted in sequence. Thus, PMM is a solution to a maximalsized matching scheduler for input-buffered switches when the switch size increases or a port speed becomes high, such as OC-768.
