Abstract-iSLIP and parallel hierarchical matching (PHM) are distributed maximal size matching schedulers for input-buffered switches. Previous research has analyzed the hardware cost of those schedulers and their performance after a small number of iterations. In this letter, we formulate an upper bound for the number of iterations required by PHM to converge. Then, we compare the number of iterations required by iSLIP and PHM to achieve a maximal throughput under uniform Bernoulli traffic, by means of simulation. Finally, we obtain the corresponding delay performances, which are similar. The results suggest that PHM has both the advantages of previous hierarchical matching algorithms (low hardware complexity) and iSLIP (low number of iterations).
I. INTRODUCTION

I
NPUT-BUFFERED packet switches are more scalable than output-buffered ones (see [1] and references therein). It has been observed that they may suffer a lower packet-loss probability when subject to burst traffic [2] . Unfortunately, throughput is worse in the input-buffered case, due to head-of-line (HOL) blocking [3] . HOL blocking limits throughput to a maximum of 58.6% under Bernoulli traffic.
Besides their scalability, input buffers are interesting for specific applications. For example, they have been proposed to implement multicast switches [4] , [5] .
For these reasons, the virtual output queue (VOQ) model has been developed to improve input-buffered switches. Nowadays, several state-of-the-art implementations follow this approach [6] , [7] .
A scheduler is a device that selects the largest possible number of packets without conflicts from a VOQ pool, where is the number of inputs/outputs. At each switching slot, the scheduler solves the following optimization problem:
Paper approved by T. (2) and (3)]. The model admits variants. If we drop constraint (3), it is possible to take more than one packet from input . Constraint (2) can be modified so that output admits more than one packet (for example, if output ports are faster than input ones).
Among the different schedulers proposed [6] - [8] , iSLIP [7] shows the following salient features.
• It has an excellent performance, in terms of throughput (close to 100% under Bernoulli traffic) and delay.
• It is a distributed algorithm. This means that each iteration has a fixed computational cost, which is independent of . As a consequence, iSLIP is scalable.
• It admits a simple digital implementation.
• From a theoretical point of view, it is a particular case of parallel iterative matching (PIM) algorithm [9] . It has been shown that PIM requires iterations to converge, in the worst case. However, in practice, PIM converges to a maximal matching in iterations [9] . In [10] , the authors proposed a new scheduler, parallel hierarchical matching (PHM) that, like iSLIP, is a maximal sizematching algorithm. It is also a distributed scalable algorithm, and has a similar gate complexity . In our applicationspecified integrated circuit (ASIC) implementations of 16 16 and 32 32 switch schedulers, PHM was faster than iSLIP, and had a similar performance under uniform Bernoulli traffic.
In this letter, we formulate an upper bound for the number of iterations required by PHM to converge. Then, we compare the number of iterations required by iSLIP and PHM to achieve a maximal throughput under uniform Bernoulli traffic, by means of simulation. Finally, we obtain the corresponding delay performances. The results suggest that PHM has both the advantages of previous sequential hierarchical matching algorithms [6] (low hardware complexity) and iSLIP (low number of iterations).
II. PHM SCHEDULER DESCRIPTION
For the sake of clarity, we reproduce here the description of PHM. Let us assume the existence of input controllers that generate matrix at each switching slot. Explanation: A packet is transferred from input to output if there are no conflicts, by considering only other units of higher hierarchical level.
In [11] , we proposed to divide scheduler units into maximum-throughput groups. A similar approach was followed in [6] . Fig. 1(d) shows an example of group labels for a 4 4 switch. At each iteration, the scheduler calculates a mapping between group labels and unit hierarchies, which produces matrix . Like PIM, PHM admits many variants, depending on the way matrix is updated. In the most complex case, the mapping between group labels and unit hierarchies is completely random. In this letter, we use the following strategy: the initial group assignment is random. Then, at the beginning of each switching slot, , we set . Fig. 1 shows the evolution of PHM on a 4 4 switch for a given matrix , after two iterations.
III. WORST-CASE BOUND FOR PHM
In an scheduler, we say pair is solved at the th iteration if or or
In other words, a demand is solved if, after the th iteration, at least one logical unit in the same row or column (including ) was activated by previous iterations. Obviously, when , all demands in are unsolved, since is set to 0 at the beginning.
Proposition All remaining demands may be blocked (their variables are temporarily set to zero) by higher priority competing demands (whose variables are set to one during the current iteration) although or although We conclude that, in the worst case, at , only those demands such that and get solved • . If we follow similar arguments, it is easy to verify that, in the worst case: -all demands such that have been solved by previous iterations; -iteration solves all demands such that and .
• If is even, .
• If is odd, End of proof.
IV. SIMULATION STUDY
In order to check the real performance of iSLIP and PHM, we simulated the number of iterations required to achieve maximal throughput. Queue length was set to avoid packet loss. Fig. 2 shows average number of iterations versus traffic load, for uniform Bernoulli traffic. For each load, the number of packets simulated was determined with the Batch Means method [12] , for a confidence level of 95% and a confidence interval of 5%.
Since the number of iterations required is not directly comparable, we implemented PHM and iSLIP using the Ambit ASIC library lca300k.alf and obtained worst-case response times using Ambit Physically Knowledgeable Synthesis 4.0 (Cadence Design System). For iSLIP, we followed the advanced pipelined design in [13] . We observed the following.
• The peak number of iterations for both schedulers, on 16 16 and 32 32 switches, is 2-3 and 3-4, respectively.
• The peak number of iterations is higher in the case of PHM.
• iSLIP requires more iterations for high loads (above 90%
for 16 16 and above 95% for 32 32).
• PHM requires more iterations for lower loads.
• When considering our ASIC implementations, PHM was faster than iSLIP in the cases studied, for all loads. These results are interesting for the following reasons: PHM shares some common features with previous sequential hierarchical matching algorithms, such as 2DRR [6] , which need extremely simple steps per switching slot. PHM iterations are more complex, but still less computationally expensive than iSLIP ones. Nevertheless, simulation results suggest that PHM may only need iterations per switching slot under uniform Bernoulli traffic, like iSLIP.
Next, using the previous results, we estimated average delay versus load, for 16 16 and 32 32 switches subject to Bernoulli traffic. We set buffer sizes to avoid packet loss. The number of packets was also determined by the Batch Means method, for the confidences aforementioned. The results were validated with [14] . Fig. 3 shows the result. We see that the delays obtained were similar.
V. CONCLUSIONS AND FUTURE WORK
In this letter, we have formulated a theoretical worst-case bound for the PHM algorithm, and we have analyzed the number of iterations required by PHM and iSLIP schedulers by means of simulation. PHM seems to deserve attention when considering iteration cost.
We have observed that the iteration plot peak of iSLIP is displaced toward high loads, while that of PHM is displaced toward lower loads. Since the timing results may vary depending on the implementation technology, this fact suggests the implementation of a hybrid scheduler, which would execute both algorithms, and take the largest transfer either after each iteration or at the end of the switching slot. Such a combination is perfectly feasible, since both algorithms satisfy constraints (2)-(4) at each iteration. We plan to study the hybrid scheduler in forthcoming work.
Another issue to be analyzed is the choice of the best matrix for the PHM algorithm.
