Proposed is a buffered Clos-network switch with per-output flow queues in the middle-stage modules to avoid head-of-line blocking of the queues in the middle stage that occurs in Clos-network switches using simple crosspoint-buffered switch modules. It is shown that the proposed switch achieves higher performance than a conventional buffered Clos-network packet switch.
Introduction: Current advances in chip fabrication allow high on-chip memory density. However, the pin count has mainly remained the same, thus limiting the number of ports a switch chip can support. A three-stage Clos-network switch [1] uses small switch modules, called input module (IM), central module (CM), and output module (OM), to implement a large switch (i.e. with a large number of ports) with an efficient amount of hardware (e.g. number of chips). For example, a 256 × 256 switch can be implemented with 48 16 × 16 switch modules. However, the configuration time of a Clos-network switch may be long as configuration information may recur to inter-chip signalling with long delays. To reduce this configuration delay, buffers in the switch modules may be used. The placement of buffers at different stages defines the switch model, namely the memory -space-memory Clos-network (MSM) switch [2] (buffers in IMs and OMs), spacememory -memory Clos-network (SMM) switch [3] (buffers in CMs and OMs), and memory -memory -memory (MMM) Clos-network switches [4] [5] [6] (buffers in IMs, CMs, and OMs). The MSM and SMM switches require large configuration times, proportional to the switch size [2, 7] . The MMM switch performs separate selections at each stage. This reduces the configuration time and increases the scalability of the Clos-network switch.
In this Letter, we follow the mainstream approach of switching fixedlength packets, called cells, in the switch. The incoming variable-length packets are segmented into cells and re-assembled before they leave the switch. Therefore, it takes a fixed amount of time, called a time slot, to forward a cell from the input of a switch module to the output of the switch module. For example, a time slot for 512-bit cells is 51.2 ns under a link rate of 10 Gbit/s.
A conventional MMM Clos-network switch, or MMM switch for brevity, adopts queues, one per OM, in the CM where cells destined to different output ports of a destination OM are stored, as bufferedcrossbars can be used as off-the-shelf switch modules [4, 6] . A headof-line (HoL) cell in the queue may block the cells behind destined to other output ports that have available room in the destined OM [8] . Thus HoL blocking at the CM may occur. Switch performance degrades as HoL blocking degrades the switch throughput [9] . Therefore, avoidance of HoL blocking at the CMs in an MMM switch is needed.
In this Letter, we propose an MMM Clos-network switch with peroutput flow, which is the set of packets going from input i to output j, queues in the CMs, called the MM e M switch. In the MM e M switch, separate queues, one dedicated queue per flow for an output port, are allocated at each crosspoint buffer in the CMs to avoid HoL blocking. We show that the MM e M switch outperforms an MMM switch in terms of throughput and delay. Table 1 . 
Output link of CM(r) that is connected to OM( j) VOQ(i, g, j, h) Virtual output queue at IP(i, g) that stores cells destined to OP( j, h) VCMQ(i, g, r) Virtual central module queue at IM(i) that stores cells from IP (i, g)
to go through CM(r) to its output port VOMQ(i, r, j) Virtual output module queue at CM(r) that stores cells from IM(i) and destined to OM( j) (MMM switch) POFQ(i, r, j, h) Per-output flow queue at CM(r) that stores cells from L I (i, r) and destined to OP j,h (MM e M switch) VOPQ(r, j, h) Virtual output port queue at OM( j) that stores cells from CM(r) and destined to OP( j, h) k 1 Size of crosspoint queues in IMs, in cells In this Letter, round-robin (RR) and longest queue first (LQF) selection schemes are considered as input arbitration schemes to observe the maximum switch performance under uniform and nonuniform traffic, respectively. Other schemes can also be adopted. The selection of CMs at the IMs, and the arbitration at CMs and OMs are RR-based. Credit-based flow control is used at each module to avoid queue overflow. 
which is the minimum size of the crosspoint buffer. For the MMM switch, k 1 ¼ k 3 ¼ 1 cell and k 2 ¼ {1, 16} cells. The latter value of k 2 is used to compare the performance of both switches with the same amount of memory. The switches were modelled in C language for event-driven simulation. Simulation results were obtained with a 95% confidence interval, with standard error not greater than 5% for the average queuing delay.
Uniform traffic:
The selection scheme at the input port, IM, CM and OM arbiters for the MMM and MM e M switches is RR. Traffic is considered with Bernoulli and bursty arrivals, where l is the average burst length and a cell burst is defined by an on -off Markov modulated process. Fig. 3 shows that the average cell delay of the 256 × 256 MM e M switch under uniform traffic with Bernoulli arrivals is smaller than that of the MMM switch. Nonuniform traffic: The unbalanced traffic [10] model was adopted as nonuniform traffic. The unbalanced traffic model uses probability w as the fraction of input load directed to a single pre-determined output, while the rest of the input load is directed to all outputs with a uniform distribution. Fig. 4 higher performance than an MMM switch under nonuniform traffic with one-cell queues and no speedup. The MM e M switch also achieves higher performance than the MMM switch when they have the same amount of memory.
