We propose a configuration scheme for a loadbalancing Clos-network packet switch that has split central modules and buffers in between the split modules. Our splitcentral-buffered Load-Balancing Clos-network (LBC) switch is cell based. The switch has four stages, namely input, centralinput, central-output, and output stages. The proposed configuration scheme uses a pre-determined and periodic interconnection pattern in the input and split central modules to load-balance and route traffic. The LBC switch has low configuration complexity. The operation of the switch includes a mechanism applied at input and split-central modules to forward cells in sequence. The switch achieves 100% throughput under uniform and nonuniform admissible traffic with independent and identical distributions (i.i.d.). These high switching performance and low complexity are achieved while performing in-sequence forwarding and without resorting to memory speedup or central-stage expansion. Our discussion includes throughput analysis, where we describe the operations that the configuration mechanism performs on the traffic traversing the switch, and proof of in-sequence forwarding. A simulation study is presented as a practical demonstration of the switch performance on uniform and nonuniform i.i.d. traffic.
I. INTRODUCTION
Clos-network switches are attractive for building large-size switches [1] . These switches mostly employ three stages, where each stage uses switch modules as building blocks. Each module is a small-or medium-size switch. Modules of the first, second, and third stages are often called input, central, and output modules, and they are denoted as IM, CM, and OM, respectively. Overall, Clos-network switches require fewer crosspoint elements, each of which is the atomic switching unit of a packet switch, than a single-stage switch of equivalent size, and thus they may require less building hardware. This trait of a Clos network often comes at the cost of an increased configuration complexity. The term configuration here means the local interconnection between inputs and outputs of a module. In general, a Clos-network switch requires the configuration of the modules in every stage before packets are forwarded through. Moreover, owing to the multi-stage architecture of such switch, the time for switch reconfiguration increases as the number of stages holding dependences increases. In a multi-stage switch, there is a dependence when the configuration of a module is affected by the configuration of another. The required configuration time dictates the internal data transmission time, which in turn defines the minimum size of the internal data unit. For example, switches that require long configuration time may need to use a long internal segment and time to transmit data while switches with fast configuration times may use a smaller segment size. Therefore, the configuration time of a switch must be kept to the shortest possible for a fast and efficient reconfiguration [2] .
In the remainder of this paper, we consider the proposed packet switch to be cell based; that is, upon arrival at an input port of a switch, packets of variable size are segmented into fixed-size cells. Cells are forwarded through the switch to their destination outputs. Packets are re-assembled at the outputs of the switch. The selection of the cell length is left for the implementation of the LBC switch. However, as in any other switch, the cell length is decided by the time required to reconfigure IMs and CIMs and memory speed (of central queues or CBs). Cell length may be selected such that cell transmission time is equal to or greater than the largest of the switch configuration or memory response times. Additionally, the cell length can be increased if the average Internet packet is longer than the configuration time to reduce segmentation/reassembly processing [2] .
Based on the design of its switching modules, each stage of a Clos-network switch can be categorized as either spacebased (S) or memory-based, where space switching modules are bufferless while memory switching modules are buffered. Space switching refers to the use of a level of parallelism where multiple cells can be switched at the same time slot by using multiple connections. Memory switching refers to the use of memory to store cells when they cannot be forwarded to the outputs (or next stage). Some of these categorizes are SSS (or S 3 ) [3] , [4] , MSM [5] - [8] , MMM [9] - [12] , SMM [13] , and SSM [14] , [15] , among the most popular ones. Out of those, S 3 switches require small amounts of hardware but their configuration has been proven challenging as input-to-output path setup must be resolved before cells are transmitted. On the other hand, inclusion of memory in modules may relax the configuration complexity. However, configuration complexity has remained high despite using memory in every switch module because of internal blocking arXiv:1812.11650v1 [cs.NI] 13 Dec 2018 and the multiplicity of input-output paths associated with diverse queuing delays [9] , [16] . Specifically, switches with buffered central or output stages are prone to forwarding packets out of sequence, making re-sequencing or in-sequence transmission mechanisms an added feature. Moreover, the number and size of queues in a module are restricted to the available on-chip real estate. This restriction plus the adopted in-sequence measures may exacerbate internal blocking that, in turn, may lead to performance degradation [11] .
Minimizing the complexity of the central module of a Closnetwork switch has been of research interest in recent years. Hassen et. al proposed a Clos-network switch that combines different switching stages [17] . In this work, central modules are replaced with multi-directional networks-on-chip (MDN) modules. The switch uses a static dispatching scheme from the input/output modules, for which every input constantly delivers packets to the same MDN module, and adopts intercentral-module routing to enable forwarding of the cells to the final destination.
Load balancing traffic prior to routing it towards the destination output is a technique that not only improves switching performance but also reduces the configuration complexity of a packet switch when the load-balancing and routing follow a deterministic schedule [18] . Such a schedule may be obtained as an application of matrix decomposition [19] , [20] . This technique enables high performance not only on switches but also on a large number of network applications [21] .
A switch that load-balances traffic may need at least two stages to operate; one for load balancing and the other for routing cells to their destination outputs [18] . A switch with such a deterministic and periodic schedule may require the use of queues between the load-balancing and routing stages. However, placing such queues and enabling multiple interconnection paths between an input and an output make load-balancing switches susceptible to forwarding cells out of sequence [18] . This issue has been addressed by introducing either re-sequencing buffers at the output ports [22] or mechanisms that prevent out-of-sequence forwarding [23] , [24] . However, these approaches are either complex or degrade switching performance.
Load balancing has been applied to Clos-network switches [9] , [25] . For example, Zhang et al. [25] proposed an SMM switch which adopts the two-stage load-balanced Birkhoff-von Neumann switch in each central module but has no input port buffers. Here, a central module consists of two k×k bufferless crossbar switches and k buffers in between the crossbars. The switch performs load balancing at the input module and the first stage of the load-balanced Birkhoff-Von Neumann switch. Each of these queues accommodates up to one cell to guarantee the transmission of cells in sequence. However, the distance between modules in a large switch requires larger queue sizes for which this switch would suffer from out-ofsequence forwarding.
The switches discussed above suffer from either limited switching performance, high complexity, or out-of-sequence forwarding. These drawbacks then raise the question, can a load-balancing Clos-network switch achieve high switching performance, low configuration complexity, and in-sequence cell forwarding without resorting to memory speedup?
In this paper, we aim at answering this question by proposing a split-central-buffered Load-Balancing Clos-network (LBC) switch. The switch has a split central module and queues in between. The switch employs predetermined and periodic interconnection patterns to interconnect the inputs and outputs of the switch modules. The switch load balances the incoming traffic and switches the cells towards the destination outputs, both with minimum configuration complexity. The result is a switch that attains high throughput under admissible traffic with independent and identical distribution (i.i.d.) and uses a configuration scheme with O(1) complexity. The switch also adopts an in-sequence forwarding mechanism at the input queues to keep cells in sequence despite the presence of buffers between the split CMs.
Different from existing switching architectures, as discussed above, the LBC switch achieves high performance, configuration simplicity, and in-sequence service, all attained without memory speedup nor central module expansion.
We analyze the performance of the proposed switch by modeling the effect of each stage on the traffic passing through the switch. In addition, we study the performance of the switch through traffic analysis and computer simulation. We show that the throughput of the switch approaches 100% under several admissible traffic models, including traffic with nonuniform distributions, and demonstrate that the switch forwards cells to the output ports in sequence. The high performance and the in-sequence forwarding of packets of the switch are both achieved without resorting to speedup throughout the switch.
In summary, the contributions of this paper are as follows: 1) the proposal of a configuration scheme for a splitcentral-buffered load-balancing switch such that the attained throughput is 100% under admissible traffic while having O(1) scheduling complexity, 2) the proposal of an in-sequence mechanism for forwarding of cells in sequence throughout the switch, 3) the presentation of throughput analysis of the LBC switch for each of the stages that shows that the switch achieves 100% throughput under i.i.d. admissible traffic, and 4) proof of the in-sequence capability of the proposed insequence forwarding mechanism.
The remainder of this paper is organized as follows: Section II introduces the LBC switch. Section III analyzes the throughput performance of the proposed switch. Section IV analyzes the in-sequence forwarding property of the LBC switch. Section V presents a simulation study on the performance of the proposed switch. Section VI presents our conclusions.
II. SWITCH ARCHITECTURE
The LBC switch has N inputs and N outputs, each denoted as IP (i, s) and OP (j, d), respectively, where 0 ≤ i, j ≤ k − 1, 0 ≤ s, d ≤ n − 1, and N = nk. Figure 1 shows the architecture of the LBC switch. This switch has k n × m IMs and k m × n OMs. Each central module is split into two modules called central-input and -output modules, denoted as CIMs and COMs, respectively. The switch has m CIMs and the same number of COMs. Each CIM and COM is a k × k switch. In the remainder of this paper, we set n = k = m for symmetry and cost-effectiveness. The IMs, CIMs, and COMs are bufferless crossbars while the OMs are buffered ones.
The use of a split central module on this switch enables preserving staggered symmetry and in-order delivery [26] by using a pre-determined configuration in the IMs, CIMs and COMs with a mirror sequence between CIMs and COMs. The staggered symmetry and in-order delivery refers to the fact that at time slot t, IP (i, s) connects to COM (r) which connects to OM (j). Then at the next time slot (t+1), IP (i, s) connects to COM ((r + 1) mod m), which also connects to OM (j). This property enables the configuration of IMs/CIMs and COMs to be easily represented with a pre-determined compound permutation that repeats every k time slots. This property also ensures that cells experience the same amount of delay for uniform traffic and the incorporation of a simple in-sequence mechanism. A switch with queues between IMs and CMs but without a split central module may require more complex load balancing and routing configurations to achieve the same objective.
Each input port has N virtual output queues (VOQs), denoted as V OQ(i, s, j, d), to store cells destined to output port d at OM (j). The combination of IMs and CIMs form a compound stage, called the IM-CIM stage. The COMs and OMs operate as single stages. There are queues placed between CIMs and COMs to store cells coming from an IM and destined to OMs. These central queues may be implemented as virtual output port queues (VOPQs), as shown in Figure  2 (a). Each VOPQ, denoted as V OP Q(r, p, j, d), stores cells coming for OP (j, d) through L CIM (r, p). As an alternative, to reduce the number of VOPQs for a large switch, we consider the use of virtual output module queues (VOMQs) instead, as shown in Figure 2 (b). A VOMQ, denoted as V OM Q(r, p, j), stores cells for all OPs at OM (j). Each of these queues stores cells coming from L CIM (r, p) and destined to OM (j). Compared to VOPQs, VOMQs introduce the possibility of head-of-line (HoL) blocking. However, as we show in Section II-F, such HoL effect is not a concern when the switch is loaded with admissible traffic. The remainder of this paper considers VOMQs, as this option stresses the load-balancing feature of LBC.
Every CIM has k L CIM ports. Every L CIM (r, p) of a CIM is connected to one input I C (r, p) of the corresponding COM. The LCIM includes a set of k VOMQs, one per OM. Each OP has m crosspoint buffers, each denoted as CB(r, j, d). A flow control mechanism operates between VOMQs and VOQs, and between CBs and VOMQs to avoid buffer overflow and this is described in Subsection II-E. The VOMQs are off-chip. The switch has N LCIMs, and therefore N sets of k VOMQs each. Table I lists the notations used in the description of the LBC switch.
The following is a walk-through description of how the switch operates: After arriving at the IP, a cell is placed at the VOQ corresponding to its destination OP. The IP arbiter selects a VOQ to be served in a round-robin manner. When a VOQ is selected, the HoL cell is forwarded to a VOMQ at the LCIM identified by the current configuration of the IM and CIM. The VOMQ is the one associated with the OM that includes the destination OP of the cell. When the configuration of the COM permits forwarding to the destination OM, the cell is forwarded to the OM and stored at the crosspoint buffer (CB) allocated for cells from the source COM. The OP arbiter selects CBs based on a round-robin manner. Upon selection of a CB, the HOL cell is forwarded from the CB to the OP. 
VOMQ at output of CIMs that stores cells destined to OM (j). V OP Q(r, p, j, d)
VOPQ at output of CIMs that stores cells destined to OP (j, d). CB(r, j, d) Crosspoint buffer at OM (j) that stores cells going through COM (r) and destined to OP (j, d). OP (j, d)
Output port d at OM (j).
A. Module Configuration
The IMs and CIMs in the IM-CIM stage are configured based on a pre-determined sequence of disjoint permutations, applying one permutation every time slot. We call a permutation disjoint from the set of permutations if an input-output pair is interconnected in one and only one of the permutations. This pre-determined sequence of permutations repeats every k time slots. Cells at the inputs of IMs are forwarded to the outputs of the CIMs determined by the configuration of that time slot. A cell is then stored in the VOMQ corresponding to its destination OM.
The COMs follow a configuration similar to that of the CIMs, but in a mirror (i.e., reverse order) sequence. The HoL cell at the VOMQ destined to OM (j) is forwarded to its destination when the input of the COM is connected to the input of the destination OM (j). Else, the HoL cell waits until the required configuration takes place. The forwarded cell is queued at the CB of its destination OP once it arrives in the OM. At the OP, a CB (i.e., HoL cell of that queue) is selected from all non-empty CBs by an output arbitration scheme.
The specific configurations of the bufferless modules, IM, CIM, COM, and OM are as follows. . . . CB(r,j,d) VOQ (i,s,j,d) . . . At time slot t, IM (i) is configured to interconnect input IP (i, s) to L IM (i, r), with:
Similarly, CIM input L IM (i, r) is interconnected to CIM output L CIM (r, p) at time slot t with:
The configuration of COMs is similar to that of IMs, but in a reverse sequence. At time slot t, COM input I C (r, p) is interconnected to output L COM (r, j) with:
Round-robin could also be used to select VOMQs and configure COMs. OM buffers allow forwarding a cell from a VOMQ to the destination output without requiring port matching [14] . Figure 3 shows an example of the configuration of a 9 × 9 LBC switch. As k = 3, the example shows the configuration of three consecutive time slots, after which the configuration pattern repeats. Because similar connections are set for all the IMs and CIMs and a different connection pattern is set for all COMs at each time slot, Table II describes the configuration on the figure for IM (0), CIM (0), and COM (0) at each time slot. In this example, we use → to denote an interconnection. 1 a mod k = a + (mutiples of k) > 0 when a < 0 (e.g., -2 mod 5 = 3).
B. Arbitration at Output Ports
An output port arbiter selects a HoL cell from the crosspoint buffers in a round-robin fashion. Because there is one cell from each flow at these buffers, out-of-sequence forwarding is not a concern at this stage. We discuss this case in Section IV. Here, a flow is the set of cells from IP (i, s) destined to OP (j, d). The round-robin schedule ensures fair service for different flows.
C. In-sequence Cell Forwarding Mechanism
The proposed in-sequence forwarding mechanism for the LBC switch is based on holding cells of a flow at the VOQs so that no younger cell is forwarded from VOMQs to OPs before any given cell of the same flow. The policy used for holding cells at an IP is as follows: No cell of flow y at the IP is forwarded to a VOMQ for δk time slots after cell τ of the same flow has been forwarded to a VOMQ, whose occupancy is δ cells at the time of arrival in the VOMQ. For a cell that arrives at an empty VOMQ, δ = 0. The flow control mechanism keeps IPs informed about VOMQ occupancy as discussed in Section II-E. Figure 4 shows an example of this forwarding mechanism for flow A. Cells from flow A are denoted as A t , where t is the cell arrival time. In this example, cells arrive at time slots 1, 2, 4, and 5, and they are denoted as A 1 , A 2 , A 4 , and A 5 , respectively. VOMQ(k) denotes the kth VOMQ to where cells are forwarded. Here, the "X" mark indicates that the buffer at VOMQ(k) is occupied by cells from other flows. Assuming k = 3 and no other cell arrival or departure during this time (1) OM (2) OP (2, 2) . . 0
OP (1, 0) OP (1, 1) OP (1, 2) OP (2, 0) OP (2, 1) OP (2, 2) . . 0
L COM (2, 2) . . 0
(c) Time slot 2 period, A 1 is the first cell of the flow with arrival time t = 1 and is sent to VOMQ(1) at time slot t = 2. Because VOMQ(1) has no backlogged cells before A 1 , there is no waiting time for A 2 . Therefore, A 2 is sent to VOMQ(2) at t = 3. A 2 finds three cells already queued, so no cell from this flow is forwarded in 3 * 3 = 9 time slots, or from time slots t = 4 to t = 12.
After that, A 4 is sent to VOMQ(3) at t = 13. This cell finds no other cell, so A 5 is sent to VOMQ(1) at t = 14.
D. Implementation of In-sequence Mechanism
Each IP has an input port counter (IPC) for each VOMQ to which it connects. IPCs keep track of the number of cells at these VOMQs. Each IP also has a hold-down timer for each VOQ. The timer is used by the in-sequence forwarding mechanism. The timer is triggered by the IPC count of the VOMQ where the last cell was forwarded. When a cell is forwarded from a VOQ to VOMQ, and the IPC is updated to σ, this update sets the hold-down timer for that VOQ for (σ − 1)k time slots, where δ = σ − 1. 
E. Flow Control
There is a flow control mechanism between VOMQs and IPs and another between CBs and VOMQs that extends to IPs. There are fixed connections between each VOMQ and its k corresponding IPs and between each CB and its corresponding k I C s. Each IP has mk = N occupancy counters, IPCs, one per VOMQ. Each VOMQ updates the corresponding k IPCs about its occupancy. A VOMQ uses two thresholds for flow control; pause (T pv ) and resume (T rv ), where T pv > T rv , in number of cells. When the occupancy of VOMQ, |V OM Q|, is larger than T pv , the VOMQ signals the corresponding IPs to pause sending cells to it. When the |V OM Q| < T rv , the VOMQ signals the corresponding IPs to resume sending cells to it. Here, T pv is such that
Similar to VOMQs, CBs use two thresholds; pause (T pc ) and resume (T rc ), where T pc > T rc , and T pc is such that C CB − T pc ≥ D c , for a CB size, C CB , and flow-control information delay between a CB and corresponding IPs, D c . These CB thresholds work in a similar way as for VOMQs. Different from IPs, VOMQs have a binary flag to pause/resume forwarding of cells to CBs. When the occupancy of a CB, |CB|, becomes larger than T pc , the CB informs the corresponding VOMQs, and in turn VOMQs inform corresponding IPs to pause forwarding cells to the VOMQ for the congested OP. With IPs paused for traffic to a CB, traffic already at VOMQs can still be forwarded to CBs as long as |CB| is such that T pc < |CB| < C CB . When |CB| < T rc , the CB signals the corresponding VOMQs to resume forwarding, and in turn, VOMQs signal source IPs to resume forwarding cells for that destination OP.
F. Avoiding HoL Blocking in LBC with VOMQs
Concerns of HoL blocking, owning to the aggregation of traffic going to different OPs at the same OM at a VOMQ, may arise. However, one must note that this HoL blocking may occur if and only if a CB gets congested. Here, we argue that the efficient load-balancing mechanism and the use of one CB for each COM at an OP avoids congestion of CBs even in the presence of heavy (but admissible) traffic. We also show that CB occupancy does not build up. Let us consider the input traffic matrix, R 1 , with input load, λ i,s,j,d , which gets loadbalanced to CIMs at rate of 1 m . The aggregate traffic arrival rate at an L CIM from all IMs, R LCIM , is:
Therefore, the traffic arrival rate to a CB from COMs, R CB , is:
To test the growth of CBs, we consider three stressing traffic scenarios: a) All IPs in the switch have traffic only for OPs in an OM; b) all IPs in an IM forward traffic to all OPs in an OM; and c) a single flow, with a large rate, going from an IP to a single OP. Then, for a) the largest arrival rate at IPs while being admissible is:
Substituting (6) into (5) and m = n = k yields:
Because round-robin is used as selection policy at an OP, the service rate, S CB , of a CB would be:
Yet, while considering the worst case scenario, or:
Therefore, CB occupancy does not grow because
For b), the arrival rate at IPs for admissibility is:
Substituting (9) into (5) yields:
The service rate would be the same as in (8) . Therefore, the CB would not become congested as R CB = S CB .
For c), the arrival rate at the IP:
The traffic arrival rate to an L CIM is:
The traffic arrival rate to a CB from COMs is:
Therefore, the CB would not become congested since R CB ≤ S CB for admissible traffic.
III. THROUGHPUT ANALYSIS
In this section, we analyze the performance of the proposed LBC switch. Let us denote the traffic coming to the IM-CIM stage, the COM stage, the OMs, OPs, and the traffic leaving LBC as R 1 , R 2 , R 3 , R 4 , and R 5 , respectively. Figure 1 shows these analysis points. Here, R 1 , R 2 , and R 3 are N ×N matrices, R 4 comprises N m × 1 column vectors, and R 5 comprises N scalars.
The traffic from input ports to the IM-CIM stage, R 1 , is defined as:
Here, λ u,v is the arrival rate of traffic from input u to output v, where
and 0 ≤ u, v ≤ N − 1.
In the following analysis, we consider admissible traffic, which is defined as:
under i.i.d. traffic conditions. The IM-CIM stage of the LBC switch balances the traffic load coming from the input ports to the VOMQs. Specifically, the permutations used to configure the IMs and CIMs interconnect the traffic from an input to k different CIMs, and then to the VOMQs connected to these CIMs. R 2 is the traffic directed towards COMs and it is derived from R 1 and the permutations of IMs and CIMs. The configuration of the combined IM-CIM stage at time slot t that connects IP (i, s) to L CIM (r, p) are represented as an N × N permutation matrix, Π(t) = [π u,v ], where r and p are determined from (1) and (2) and the matrix element:
The configuration of the compound IM-CIM stage can be represented as a compound permutation matrix, P 1 , which is the sum of the IM-CIM permutations over k time slots as follows,
Because the configuration is repeated every k time slots, the traffic load from the same input going to each VOMQ is 1 k of the traffic load of R 1 . Therefore, a row of R 2 is the sum of the row elements of R 1 at the non zero positions of P 1 , normalized by k. This is:
where 1 denotes an N × N unit matrix and • denotes element/position wise multiplication. There are k non-zero elements in each row of R 2 . Here, R 2 is the aggregate traffic in all the VOMQs destined to all OPs. This matrix can be further decomposed into k N × N submatrices, R 2 (j), each of which is the aggregate traffic at VOMQs designated for OM (j).
where j is obtained from ( 
Similarly, the switching at the COM stage is represented by a compound permutation matrix P 2 , which is the sum of k permutations of the COM stage over k time slots. Here
The output traffic of COMs going to different OMs is described by matrix R 3 (j), which is defined as
where j is obtained from (16) ∀ d. The traffic destined to OP (j, d) at OM (j), R 3 (j, d), is obtained by extracting the traffic elements from R 3 (j), or:
where d is obtained from (16) for the different j. D s is an m × N matrix, built by concatenating N k × 1 vector of all ones, 1, as:
A is a 1 × k row vector, built by setting the first element to 1 and every other element to 0, or:
A s is an N × 1 column vector, built by concatenating k A and taking the transpose, or:
where
The traffic queued at the CB of an OP, R 4 (v), is the multiplication of D s , R 3 (j, d), and A s , or:
The traffic leaving an OP, R 5 (v), is:
Therefore, R 5 (v) is the sum of the traffic leaving OP (v).
Equations (19), (29) , and (30) show that the admissibility conditions in (17) are satisfied by the traffic at the VOMQ, CBs, and OP. Since R 2 , R 4 (v), and R 5 (v) meet the admissibility conditions in (17) , this implies that the sum of the traffic load at each V OM Q, CB, and OP does not exceed their respective capacities. From (29) , we can deduce that R 4 is equal to the input traffic R 1 , or:
From the admissibility of R 2 , R 4 (v), R 5 (v) and (31), we can infer that the input traffic is successfully forwarded to the output ports.
As discussed in Section II-B, the output arbiter selects a flow in a round-robin fashion and if no cell of a flow is selected, the OP arbiter moves to the next flow. This implies the queues are work conserving which ensures fairness and that cells forwarded to OPs are successfully forwarded out of OPs. Hence, from (30), we can infer that R 5 (v) is equal to R 4 (v), or:
From (31) and (32), we can conclude that LBC successfully forwards all traffic at IPs out of OPs.
The following example shows the different traffic matrices for a 4×4 (k = 2) LBC switch. Let the input traffic matrix be
From (18) , the compound permutation matrix for the IM-CIM stage for this switch is:
Using (19), we get:
From (20), the traffic matrix at VOMQs destined for the different OMs are:
The rows of R 2 (v) represent the traffic from IPs, and the columns represent V OM Q(r, p, j) at I C (r, p). From (22) , the compound permutation matrix for the COM stage for this switch is: (23) and (24), the traffic forwarded to an OP is:
The rows of R 3 (j, d) represent the traffic from V OM Q(r, p, j) at I C (r, p) and the columns represent L COM (r, j). D S and A s are obtained from (25) and (28), respectively, as:
The traffic forwarded from CBs to the corresponding OP is obtained from (29):
The rows of R 4 (v) represent the traffic from COM (r). Using (30), we obtain the sum of the traffic leaving the OP, or:
We use the traffic analysis of the previous section to demonstrate that the LBC switch achieves 100% throughput under admissible traffic. This demonstration is provided in Appendix B.
IV. ANALYSIS OF IN-SEQUENCE SERVICE
In this section, we demonstrate that the LBC switch forwards cells in sequence through the proposed in-sequence forwarding mechanism. Table III lists the definition of terms used in the discussion of the properties of the proposed LBC switch. Here, c y,τ (i, s, j, d) denotes the τ th cell of traffic flow y, which comprises cells going from IP (i, s) to OP (j, d) with arrival time t x . In addition, t ay,τ denotes the arrival time of c y,τ , and q 1y,τ , q 2y,τ , and q 3y,τ denote the queuing delays experienced by c y,τ at V OQ(i, s, j, d), V OM Q(r, p, j), and CB(r, j, d), respectively. The departure times of c y,τ from these queues are denoted as d 1y,τ , d 2y,τ , and d 3y,τ , respectively. In this paper, we consider admissible traffic as defined in (17).
Here, we claim that the LBC switch forward cells in sequence to the output ports, through the following theorem. Theorem 1. For any two cells c y,τ (i, s, j, d) and c y,τ (i, s, j, d), where τ < τ , c y,τ (i, s, j, d) departs the destination output port before c y,τ (i, s, j, d). This theorem is sectioned into the following lemmas.
Lemma 1. For a single flow traversing the LBC switch, any cell of the flow experiences the same delay. This is, let t d be the delay experienced by a cell. Then, t dy,τ = γ ∀ τ , where γ is a positive constant.
A constant delay for each cell implies that cells depart the switch in the same order they arrived under the conditions of this lemma. Appendix A presents the proofs of these lemmas.
V. PERFORMANCE ANALYSIS
We evaluated the performance of the LBC switch through computer simulation under both uniform and nonuniform traffic models. We also compared the performance of the proposed switch with that of an output-queued (OQ) switch, a highperforming Memory-Memory-Memory Clos-network (MMM) switch, and an MMM switch with extended memory (MM e M). The MMM switch uses forwarding arbitration schemes to select cells from the buffers in the previous stage modules and is agnostic to cell sequence, therefore delivering high switching performance. We considered switches with sizes N = {64, 256}.
A. Uniform Traffic
We evaluated the LBC, OQ, MMM, and MM e M switches under uniform traffic with Bernoulli and bursty arrivals. achieves 100% throughput under uniform traffic with Bernoulli arrivals, indicated by the finite and moderate average queuing delay. The high throughput performance by the proposed switch is the result of using an efficient load-balancing process in the IM-CIM stage. However, this high performance is expected under this traffic pattern as the distribution of the incoming traffic is already uniformly distributed. Figure 5(a) shows that the LBC switch experiences a similar delay as the MM e M switch at high input load. Figure 5(b) shows that the LBC switch experiences a slightly higher average delay than the OQ switch. This additional delay in the LBC switch is caused by having cells wait in the VOMQs until a configuration that allows forwarding the cells to their destination output modules takes place. Because MM e M requires an excessive amount of memory to implement the extended set of queues, the measurement of average cell delay cannot be measured for N =256 by our simulators. This figure also shows that the LBC switch achieves a lower average delay than the MMM switch with an input load of 0.95 and larger.
Uniform bursty traffic is modeled as an ON-OFF Markov modulated process, with the average duration of the ON period set as the average burst length, l, with l = {10, 30} cells. Figures 5(c) and 5(d) show the average delay under uniform traffic with bursty arrivals for average burst length of 10 and 30 cells, respectively, for switches with N =256. The results show that the LBC switch achieves 100% throughput under bursty uniform traffic. In contrast, the MMM switch has a throughput of 0.8 and 0.75 for an average burst length of 10 and 30 cells, respectively. Therefore, the LBC switch achieves a performance closer to that of the OQ switch. There is a very small difference in the delay of the LBC. From this graph, we also observe that the LBC switch achieves 100% throughput under bursty uniform traffic. The uniform distributed nature of the traffic and the load-balancing stages help to achieve this high throughput and low queueing delay. The slightly larger average queueing delay of the LBC switch for very small input loads is caused by the predetermined and cyclic configuration of the bufferless modules as some cells wait for a few time slots to be forwarded and this is irrespective of the switch size. Nevertheless, these two figures show that the queueing delay difference between the LBC and the OQ switch is not significant for large input loads.
B. Nonuniform traffic
We also compared the performance of the proposed LBC switch with the MMM, MM e M, and OQ switches under unbalanced [27] , [28] and hot-spot patterns as nonuniform traffic. The unbalanced traffic can be modeled using an unbalanced probability ω to indicate the load variances for different flows. Consider input port IP (i, s) and output port OP ( j, d) of the LBC switch, the traffic load is determined by
where ρ is the traffic load for input IP (i, s) and ω is the unbalanced probability. When ω=0, the input traffic is uniformly distributed and when ω=1, the input traffic is completely directional; traffic from IP (i, s) is destined for OP (j, d) .
The simulation results show that the throughput of the LBC switch is 100% under this traffic pattern for all values of ω, matching those of MMM and MM e M switches, which are also known to achieve high throughput but neglect insequence forwarding. It has been shown that many switches do not achieve high throughput when w is around 0.6 [28] . Therefore, we measured the average delay of the LBC switch under this traffic pattern for ω=0.6, as shown in Figure 5 (e), and compared with the OQ switch as this switch is well-known to achieve 100% throughput. As the figure shows, the average delay of the LBC switch is comparable to that of an OQ switch. The load-balancing stage of the LBC switch distributes the traffic uniformly throughout the switch.
We compared the performance of the proposed LBC switch with the MMM, MM e M, and OQ switches under hot-spot traffic [24] . Hot-spot traffic occurs when all IPs send most or all traffic to one OP. Consider input port IP (i, s) and output port OP (j, d) of the LBC switch, the traffic load is determined by
where h is the hot-spot OP and 1 ≤ h ≤ N . Our simulation shows that the LBC switch as well as the MMM and MM e M switches achieve 100% throughput under admissible hot-spot traffic. Figure 5 (f) shows the measured average delay of the LBC switch under this traffic pattern and that of an OQ switch. The figure shows that the average delay of the LBC switch is comparable to that of an OQ switch. This is as a result of effective load-balancing at the IMs, CIMs, and COMs of the multiple flows coming from different inputs.
In addition to the analysis presented in Section 2.F, we also simulated the LBC switch under two new traffic patterns, which we believe may stress the occupancy of CBs and therefore increase the likelihood of occurrence of HoL blocking conditions. The traffic patterns are: a) k flows from IPs at different IMs, each arriving at a rate of 1 k for admissibility, are forwarded to all OPs at one OM. The source IPs of the flows are selected such that they share VOMQs; i = s or IP (0, 0), IP (1, 1), · · · , IP (k − 1, n − 1). b) Each IP at an IM forwards cells at rate 1 k to each OP at an OM (e.g., i = j). Each OP in the destination OM receives traffic from all IPs of one IM. VOMQs are also shared by different flows. Figures 6(a) and 6(b) show the average delay under the first and second traffic patterns presented above, respectively. The results in the figures show that LBC experiences a finite and moderate average queuing delay, which implies that LBC achieves 100% throughput under both traffic patterns. We also measured the average CB length and this length does not grow more than one cell, indicating that no CB gets congested. This result is obtained because the load-balancing mechanism spreads a flow to different VOMQs.
VI. CONCLUSIONS
We have introduced a configuration scheme for a splitcentral-buffered load-balancing Clos-network switch and a mechanism that forwards cells in sequence for this switch. To effectively perform load balancing, the switch has virtual output module queues between these two central stages. With the split central module, the switch comprises four stages, named IM, CIM, COM, and OM. The IM, CIM, and COM stages are bufferless crossbars, while the OMs is a buffered one. All the bufferless modules follow a pre-deterministic configuration while the OM follows a round-robin sequence to forward cells from the CB to the output ports. Therefore, the switch does not have to perform matching in any stage despite having bufferless modules, and the configuration complexity of the switch is minimum, making it comparable to that of MMM switches. We introduce an in-sequence mechanism that operates at the inputs of the LBC switch to avoid out-of-sequence forwarding caused by the central buffers. We modeled and analyzed the operations that each of the stages performs on the incoming traffic to obtain the loads seen by the output ports. We showed that for admissible independent and identically distributed traffic, the switch achieves 100% throughput. Unlike the existing switching architectures discussed in Section I, LBC achieves high performance, configuration simplicity, and in-sequence service attained without memory speedup and central module expansion. In addition, we analyzed the operation of the forwarding mechanism and demonstrated that cells are forwarded in sequence. We showed, through computer simulation, that for all tested traffic, the switch achieved 100% throughput for uniform and nonuniform traffic distributions.
APPENDIX A ANALYSIS OF IN-SEQUENCE SERVICE
In this section, we demonstrate the lemmas that support the theorem where we claim that the LBC switch forwards cells in sequence through the proposed in-sequence forwarding mechanism.
Lemma 1. For a single flow traversing the LBC switch, any cell of the flow experiences the same delay. This is, let t d be the delay experienced by a cell. Then, for any cell traversing the LBC switch, t dy,τ = γ, where γ is a positive constant.
We analyze first the scenario of a single flow, i.e., y, traversing the switch, whose cells arrive back to back, one each time slot. For simplicity but without losing generality, let us also consider empty queues as an initial condition.
Proof:
For any c y,τ , the total delay time is defined as:
in number of time slots. Here we consider fixed arbitration time at each queue and this delay is included in the queuing delay. We are then interested in finding q 1y,τ , q 2y,τ , and q 3y,τ . For q 1y,τ , under a single-flow scenario, let us consider any two cells of c y,τ with arrival times k time slots apart, c y,τ −2k and c y,τ −k , they are forwarded to the same VOMQ. Then, c y,τ is held at the VOQ (owing to the mechanism to keep cells in sequence at the VOQ) if c y,τ −k finds one or more cells in the VOMQ, q 1y,τ increases. In this case, the empty queue initial condition makes the waiting factor δ = 0.
On the other hand, an OM is connected to a VOMQ every k time slots as per the configuration scheme of COM. Therefore,
This queuing delay is smaller than the arrival gap between these two cells as: a y,τ −2k − a y,τ −k = k time slots Therefore, c y,τ is not backlogged further in VOMQ and there is no impact on the time the cell is held in a VOQ, such that: q 1y,τ = 0 ∀ y, τ
For q 2y,τ , let us now assume that c y,τ −k arrives at a time that it has to wait γ time slots, where 1 ≤ γ ≤ k, to be forwarded to the destination OM, or
Then when c y,τ arrives, k time slots later, it finds exactly the same configuration in the COM as found by c y,τ −k . Because cells arrive consecutively, q 2y,τ = γ ∀ τ For q 3y,τ , because there is a single flow traversing the switch and the configuration scheme followed by COM, one cell arrives in the CB each time slot and one cell departs OP at the same time slot. Therefore, no cell is backlogged in this case and q 3y,τ = 0 From (35): t dy,τ = γ ∀ τ for empty queues as initial condition. It is then easy to see that for any queued cells, q 1y,τ would be increased by δk time slots, and q 2y,τ as well as q 3y,τ remain unchanged.
Therefore, all cells of the flow experience the same delay and are forwarded in sequence.
Lemma 2. For any number of flows traversing the LBC switch, cells from the same flow arrive at the OM in sequence.
Proof: Here, we consider the following traffic scenario: There are k flows coming from different IPs, each from a different IM. In each of the flows, cells arrive back to back and are destined to the same OP. Furthermore, the flows have one time slot difference in their arrival times such that the cells with the same sequence number of each different flow are stored in the same VOMQs. Here, each flow consists of k cells. Table A shows an example of the arrival pattern of this traffic scenario for three flows. The table shows the arrival of k cells from k flows at different IPs and IMs that arrive at one time slot apart to enable these flows to be forwarded to the same VOMQ, otherwise the flows would be forwarded to different VOMQs. 
Table A shows that cells c 1,1 , c 1,2 , c 1,3 , c 2,1 , and c 3,1 were successfully forwarded to the VOMQ without any blocking. While the in-sequence mechanism holds back the cells c 2,2 , c 2,3 , c 3,2 and c 3,3 to prevent out-of-sequence, because cells c 2,1 and c 3,1 were forwarded to a non-empty VOMQ.
The configuration pattern used in the IMs and CIMs, and the in-sequence mechanism determine the order in which cells arrive to the VOMQs. Table V shows this order in our example.
In such arrival pattern, the departures from VOMQs follow the deterministic configuration of the COMs. Table VI shows the corresponding departures of the cells from VOMQs of these three flows.
TableVI shows that all the cells were forwarded out the VOMQ in the same pattern they arrived and one cell each k time slots because the COM connects to the OM once each k time slots.
Also, let us assume that the first cell of a flow at the L CIM arrives at least one or more time slots before the configuration of the COM allows forwarding the cell to its destination OM. Thus, cells may depart in the following or a few time slot after its arrival. A cell then may wait up to k − 1 time slots for the designated interconnection to take place before being forwarded to the OM.
Given k flows, with their τ th cells being c 1,τ to c k,τ , the arrival time of the first arriving cell c 1,τ is:
The number of cells at the VOQ, N 1 (c y,τ ), upon the arrival of c 1,τ is:
This condition holds because there is no cell at the VOQ when c 1,τ arrives. Because of (38), the queuing delay at the VOQ of c 1,τ is:
The departure time of a cell c y,τ from the VOQ is:
Using (37) to (40), the departure time of c 1,τ from the VOQ is:
Upon arriving at the VOMQ, c 1,τ finds no cell ahead of it. Thus, the number of cells at the VOMQ, N 2 (c 1,τ ), upon the arrival of c 1,τ is:
Based on the considered traffic pattern, c 1,τ is stored in the VOMQ for additional k − 1 time slots. Therefore,
The departure time of a cell c y,τ from the VOMQ is:
Using (41), (43), and (44), the departure time of c 1,τ from the VOMQ is:
Let us consider now another cell from the same flow,
Upon the arrival of c 1,τ +θ , there is no cell at the VOQ, or:
Because of (42) and (47), the queuing delay at the V OQ for c 1,τ +θ is:
Using (40), (46), and (48), the departure time of c 1,τ +θ from the VOQ is:
Upon arriving at the VOMQ, c 1,τ +θ finds no cell ahead of it, or:
Because of the considered traffic, c 1,τ +θ is queued extra k − 1 time slots at the VOMQ, hence:
Using (44), and (49) to (51),
Using (45), therefore,
In general, for c z,τ , where 1 < z ≤ k, the arrival time is
and upon the arrival of c z,τ in the VOQ, there is no cell: tx tx+1 tx+2 tx+3 tx+4 tx+5 tx+6 tx+7 tx+8 tx+9 tx+10 tx+11  c1,1  c1,2  c1,3  c2,1  c2,2  c2,3  c3,1 c3,2 c3,3 With (55), q 1z,τ = 0 (56) Using (40), (54) , and (56),
Time slots cells arrive at the VOMQs
However, upon arriving in the VOMQ, c z,τ finds δ cells ahead of it, or:
where 0 < δ < k q 2z,τ = q Hz,τ + (δ − 1)k + k (60) q Hz,τ is the delay from the HoL cell in the VOMQ on c z,τ .
(δ − 1)k is the delay generated from the other (δ − 1) cells ahead of c z,τ in the VOMQ. The extra k time slots is the delay c z,τ experiences as it waits for the configuration pattern to repeat after the last cell ahead of it is forwarded to the OM. where
Using (44), (60), and (61), the departure time of c z,τ from the VOMQ is:
Using (45) and (59), then:
Let us now consider any other cell from flow z, c z,τ +θ , where 0 < θ < k. The time of arrival of the cell c z,τ +θ is:
Upon the arrival of c z,τ +θ , there could be zero or more at the VOQ, hence:
where γ is the number of cells at the VOQ upon the arrival of c z,τ +θ and 0 ≤ γ < k. Using (58) and (65), then:
is the delay generated from the γ cells ahead of c z,τ +θ at the VOQ. Let
The difference between the departure times of any two cells of a flow from VOMQ is a function of θ, which is the arrival time difference of the two cells. Therefore, cells of a flow are forwarded to the OM in the same order they arrived.
Lemma 3. For any number of flows traversing the LBC switch, the cells of each flow arrive and are cleared at the output port (OP) in the same order the cells arrived at the input port (IP).
In our discussion of this lemma, let us consider the following traffic scenario: The switch has cells from only two flows, each arriving in a different IM (and therefore IP) and both of them are destined to the same OP. In each flow, cells arrive back-to-back, one at each time slot, and the first cell of both flows arrive at a time slot such that the configuration pattern of IM-CIM stage would not enable forwarding them to the COM immediately. With this condition, we analyze how these two flows are kept from affecting each other, and therefore, the sequence in which cells may depart the OP. This traffic scenario may present the greatest opportunity of experiencing out-of-sequence forwarding by any two cells of a flow as cells from these two flows interact at the CBs of the destination OP. Let us also consider empty queues as an initial condition.
Given flows y and z, where the first cells of y and z, c y,τ and c z,τ , respectively, arrive at their respective VOQs at time slot t x and the θth cells, c y,τ +θ and c z,τ +θ ∀ θ ≥ 1, arrive at time slot t x + θ. Therefore, according to this lemma c y,τ and c z,τ must be forwarded and cleared from the output port OP (j, d) before c y,τ +θ and c z,τ +θ , respectively.
We analyze the departure time of the cells c y,τ and c z,τ from the CBs. The arrival times for cells c y,τ and c z,τ is:
Upon arriving in the VOQ, c y,τ and c z,τ are placed as HoL cells. Because there are no backlogged cells, hence:
and
Using (75) and (76), the queuing delays of c y,τ and c z,τ at the VOQ are: q 1 c y,τ = 0 (77) and q 1 c z,τ = 0 (78)
Using (40), (74), and (77) the departure time for c y,τ from the VOQ is:
Using (40), (74), and (78) the departure time for c z,τ from the VOQ is:
Thus, c y,τ and c z,τ are forwarded to the same CIM (so that these two cells would share the same CB) and stored in their respective VOMQ. Because the VOMQs are empty at the time the two cells arrive, hence:
Based on the adopted traffic scenario, c y,τ and c z,τ are held at the VOMQ for β 1 and β 2 time slots, respectively, before the configuration pattern enables forwarding them to their destination OM. Here, 1 ≤ β 1 < k and 1 ≤ β 2 < k. Hence, the queuing delay of c y,τ at the VOMQ is:
The queuing delay of c z,τ at the VOMQ is:
Assuming β 1 < β 2 , hence c y,τ would be forwarded to the destination OM before c z,τ . From (44), (79), and (83), the departure time of c y,τ from the VOMQs is:
From (44), (80), and (84), the departure time of c z,τ from the VOMQs is:
When c y,τ and c z,τ arrive at the OM, they are stored at CBs before being forwarded to the output port.
Let us now consider c y,τ +1 and c z,τ +1 , which arrive at time slot t x + 1, hence:
Because there are no cells at the VOQ upon the arrival of c y,τ +1 and c z,τ +1 , then:
With (81) 
Next, we analyze the departure time of the cells from the output port. Because d 2y,τ +1 > d 2y,τ and d 2z,τ +1 > d 2z,τ , this means that c y,τ and c z,τ arrive at the output module before c y,τ +1 and c y,τ +1 , respectively. With the CB initially empty based on the initial condition, then: Therefore, with d 3y,τ +1 > d 3y,τ and d 3z,τ +1 > d 3z,τ , c y,τ and c z,τ would depart the output port before c y,τ +1 and c z,τ +1 , respectively. Note that for N 1 (c y,τ ) > 0, δ > 0, such that the cells from the same flow are forwarded with larger time separation from each other, and there are fewer chances that they will be at the CBs at the same time slot. Therefore, this property, as described by this lemma, applies to any two cells of a flow.
This completes the proof of Theorem 1.
APPENDIX B 100% THROUGHPUT
In this section we prove that LBC achieves 100% throughput by using the analysis presented on Section III. A and the concept of queue stability. A switch is defined as stable for a traffic pattern if the queue length is bounded and a switch achieves 100% throughput if it is stable for admissible i.i.d. traffic [29] . With this, we set the following theorem: Theorem 2. LBC achieves 100% throughput under admissible i.i.d traffic.
Proof: Here, we consider the queue to be weakly stable if the drift of the queue occupancy from the initial state is a finite integer ∀ t as lim t→∞ . Using the definition above, we show that the queue length of VOQs, VOMQs, and CBs are weakly stable under i.i.d. traffic, and hence, achieves 100% throughput under that traffic pattern.
Let us represent the queue occupancy of VOQs at time slot t, N 1 (t) as:
where A 1 (t) is the packet arrival matrix at time slot t to VOQs and D 1 (t) is the service rate matrix of VOQs at time slot t. Solving (101) with an initial condition N 1 (0), recursively yields:
Let us consider s 1u,v (t) as the service rate received by the VOQ at IP (u) for OP (v) at time slot t or: 
Another way to express D 1 (t) is:
and recalling R 1 as the aggregate traffic arrival to VOQs or:
Substituting (103) into (104), and (104) and (105) into (102), yields:
From (106), we obtain:
From the admissibility condition of R 1 , it is easy to see that for any value of t, (107) is finite. Hence, from the admissibility of R 1 , (106) and (107), we conclude that occupancy of VOQ is weakly stable. Now we prove VOMQs stability. As before, the queue occupancy matrix of VOMQs at time slot t can be represented as:
where A 2 (t) is the arrival matrix at time slot t to VOMQs and D 2 (t) is the service rate matrix of VOMQs at time slot t. Solving (108) recursively with consideration of an initial condition for N 2 (t), yields:
Because a VOMQ is serviced at least once every k time slots, the service rate of the VOMQ at I C (r, p) for OP (v) at time slot t, d 2µ,v (t) is:
Then, the service matrix of VOMQs is:
and representing R 2 as the aggregate traffic arrival to VOMQs or:
Substituting (110) and (111) into (109) gives:
From (19) and (112), we get:
Recalling that R 2 is admissible, per the discussion in Section III.A, and by substituting P 1 and R 2 into (114), it is easy to see that is finite. Hence, from (112), (113), and (114), we conclude that the occupancy of VOMQ is weakly stable. Now we prove the stability of CBs. The queue occupancy matrix of CBs at time slot t can be represented as:
where A 3 (t) is the packet arrival matrix at time slot t CBs, and D 3 (t) is the service rate matrix of CBs at time slot t. Solving (115) recursively as before yields:
Because a CB is serviced at least once every k time slots. Hence, the service rate of the CB at OP (v) at time slot t, d 3v (t) is:
and service matrix of CBs is:
Similarly, the aggregate traffic arrival to the CB or:
Let us assume d 3v (t) = 1 k ∀ v in (117), which is the worst case scenario at which a CB gets served once every k time slots. Substituting (117) and (118) into (116) gives:
With R4 being admissible, as discussed in Section III.A, and by substituting R 4 into (120), it is easy to see that is finite. Hence, from (119) and (120), we conclude that the occupancy of CB is also weakly stable.
This completes the proof of Theorem 2.
