We propose a three-stage load balancing packet switch and its configuration scheme. The input-and centralstage switches are bufferless crossbars, and the output-stage switches are buffered crossbars. We call this switch ThReestage Clos-network swItch with queues at the middle stage and DEtermiNisTic scheduling (TRIDENT), and the switch is cell based. The proposed configuration scheme uses predetermined and periodic interconnection patterns in the input and central modules to load-balance and route traffic, therefore, it has low configuration complexity. The operation of the switch includes a mechanism applied at input and output modules to forward cells in sequence. TRIDENT achieves 100% throughput under uniform and nonuniform admissible traffic with independent and identical distributions (i.i.d.). The switch achieves this high performance using a low-complexity architecture while performing in-sequence forwarding and no central-stage expansion or memory speedup. We analyze the operations the configuration mechanisms perform on the traffic traversing the switch. We use this analysis to prove that the switch achieves 100% through under i.i.d. traffic. We also show that the switch forward cells in-sequence. We present a simulation analysis as a practical demonstration of the switch performance under uniform and nonuniform i.i.d. traffic.
I. INTRODUCTION
C LOS networks are very attractive for building large-size switches [1] . Most Clos-network switches adopt three stages, where each stage uses switch modules as building blocks. The modules of the first, second, and third stages are called input, central, and output modules, and they are denoted as IM, CM, and OM, respectively. Overall, Clos-network switches require fewer switching units (crosspoint elements), than a single-stage switch of equivalent size, and thus may require less building hardware. The hardware reduction of a Clos-network switch often increases its configuration complexity. In general, a Clos-network switch requires configuring its modules before forwarding packets through them.
We consider for the remainder of this paper that the proposed packet switch is cell-based; this is, upon arrival in an input port of a switch, packets of variable size are segmented into fixed-size cells and re-assembled at the output port, after being switched through the switch. The smallest size of a cell depends on the response time of the fabric and reconfiguration time.
Clos-network switches can be categorized based on whether a stage performs space-(S) or memory-based (M) switching into SSS (or S 3 ) [2] , [3] , MSM [4] - [8] , MMM [9] - [13] , SMM [14] , and SSM [15] , [16] , among the most popular ones. Compared to the other categories, S 3 switches require the smallest amount of hardware, but their configuration complexity is high. Despite having a reduced configuration time, MMM switches, must deal with internal blocking and the multiplicity of input-output paths associated with diverse queuing delays [9] , [17] . In general, switches with buffers in either the central or output stage are prone to forwarding packets out of sequence because of variable queue lengths, making in-sequence transmission mechanisms or re-sequencing a required feature.
Traffic load balancing is a technique that improves the performance of switching and reduces the configuration complexity [18] . Such a technique is especially attractive for its application to Clos-network switches as these suffer from high configuration complexity or large amounts of hardware. A large number of network applications such as those used in network virtualization and data center network, adopt load balancing techniques to obtain high performance [19] - [21] . Load balancing finds its application in wireless networks [22] - [24] .
Predetermined and periodic permutations scheduling mechanism may be used for load-balancing and routing to achieve high switching performance [9] , [25] , [26] . A switch using a deterministic and periodic schedule may require queues between the load-balancing and routing stages. These queues store the cells while they wait for forwarding. These queues enable multiple interconnection paths between the load-balancing stage and the other stages of the switch, but they also make these switches prone to forwarding cells out of sequence [18] . Re-sequencing [27] and out-ofsequence prevention mechanisms [28] , [29] , as they become switch components, may affect the switching performance and increase complexity.
The issues above raise the question, can a load-balancing Clos-network switch attain high switching performance, low configuration complexity, and in-sequence cell forwarding without resorting to memory speedup nor switch expansion?
We answer this question affirmatively in this paper by proposing a load-balancing Clos-network switch that has buffers placed between the IMs and CMs. Furthermore, we use OMs implemented with buffered crossbars with per-flow queues. The switch is called ThRee-stage Clos swItch with queues at the middle stage and DEtermiNisTic scheduling (TRIDENT). This switch uses predetermined and periodic interconnection patterns for the configuration of IMs and CMs. The incoming traffic is load-balanced by IMs and routed by CMs and OMs. The result is a switch that attains high throughput under admissible traffic with independent and identical distribution (i.i.d.) and uses a configuration scheme with O(1) complexity. The switch also adopts an in-sequence forwarding mechanism at the input ports and output modules to keep cells in sequence.
The motivation for adopting this configuration method is its simplicity and low complexity. For instance, TRIDENT reduces the amount of hardware needed by another load balancing switch [26] and it also reduces the complexity of the in-sequence forwarding mechanism. The configuration approach used by TRIDENT also provides full utilization of the switch fabric and requires a small configuration time because of its deterministic and periodic pattern. Our solution overcomes the required module or port matching, which are complex and time consuming, as required by other schemes.
We analyze the performance of the proposed switch by modeling the effect of each stage on the traffic passing through the switch. In addition, we study the performance of the switch through traffic analysis and by computer simulation. We show that the switch attains 100% throughput under several admissible traffic models, including traffic with uniform and nonuniform distributions, and demonstrate that the switch forwards cells to the output ports in sequence. This high switching performance is achieved without resorting to speedup nor switch expansion.
The remainder of this paper is organized as follows: Section II introduces the TRIDENT switch. Section III presents the throughput analysis of the proposed switch. Section V presents a proof of the in-sequence forwarding property of TRIDENT. Section VI presents a simulation study on the performance of the proposed switch. Section VII presents our conclusions.
II. SWITCH ARCHITECTURE
TRIDENT has N inputs and N outputs, each denoted as IP (i, s) and OP (j, d), respectively, where 0 ≤ i, j ≤ k − 1, 0 ≤ s, d ≤ n−1, and N = nk. Figure 1 shows the architecture of TRIDENT. This switch has k n × m IMs, m k × k CMs, and k m × n OMs. Table I lists the notations used in the description of TRIDENT. In the remainder of this paper, we set n = k = m for symmetry and cost-effectiveness. The IMs and CMs are bufferless crossbars while the OMs are buffered ones. In order to preserve the staggered symmetry and in-order delivery [30] , this switch uses a fixed and predetermined configuration sequence, and a reverse desynchronized configuration scheme in CMs. The staggered symmetry and in-order delivery refers to the fact that at time slot t, IP (i, s) connects to CM (r) which connects to OM (j). Then at the next time slot (t + 1), IP (i, s) connects to CM ((r + 1) mod m), which also connects to OM (j). This property enables us to easily represent the configuration of IMs and CMs as a predetermined compound permutation that repeats every k time slots. This property also ensures that cells experience similar delay under uniform traffic, and the incorporation of the in-sequence mechanism enables preserving this delay under nonuniform traffic, as Section V shows.
The switch has virtual input-module output port queues (VIMOQs) between the IMs and CMs to store cells coming from IM (i) and destined to OP (j, d), and each queue is denoted as V IMOQ (r, i, j, d) . Each output of an IM is denoted as L I (i, r). Each output of a VIMOQ is connected to a CM. Each input and output of a CM are denoted as I C (r, p) and L C (r, j), respectively. Each OP has N k crosspoint buffers, each denoted as CB(r, j, d, i, s) and designated for the traffic from each IP traversing different CMs to an OP. A flow control mechanism operates between a CB and VIMOQs to avoid buffer overflow and underflow [31] . Cells are sent from IPs through the IMs for load balancing and then queued at VIMOQs before they are forwarded to their destined OMs through the CMs.
A. Module Configuration
The IMs are configured based on a predetermined sequence of k disjoint permutations, where one permutation is applied each time slot. We call a permutation disjoint from the set of permutations if the input-output pair interconnection is unique in one and only one of the k permutations. Cells at the inputs of IMs are forwarded to the outputs of the IMs determined by the configuration at that time slot. A cell is then stored in the VIMOQ corresponding to its destination OP.
Similar to the IMs, CMs are configured based on a predetermined sequence of k disjoint permutations. Unlike IMs, CMs follow a desynchronized configuration; a different permutation is used each time slots, and the configuration follows a cycle but in counter clock manner to that of the IM. The Headof-Line (HoL) cell at the VIMOQ destined to OP (j, d) is forwarded to its destination when the input of the CM is connected to the input of the destined OM (j). Else, the HoL cell waits until the required configuration takes place. The forwarded cell is queued at the CB of its destination OP once it arrives in the OM.
The configurations of the bufferless IMs and CMs are as follows. At time slot t, IM input IP (i, s) is interconnected to IM output L I (i, r), as follows:
and each CM input I C (r, p) is interconnected to output L C (r, j) as follows:
The use of CBs at an OP allows forwarding a cell from of a VIMOQ to its destined output without requiring port matching [15] . Table II shows an example of the configuration of the IMs and CMs of a 9×9 TRIDENT switch. Because k = 3, the example shows the configuration of three consecutive time slots. In this table, we use w → x to denote an interconnection between w and x. Figure 2 shows the configuration of the modules.
B. Arbitration at Output Ports
Each output port has a round-robin arbiter to keep track of the next flow to serve, and N flow pointers to keep track of the next cell to serve for each flow. Here, a flow is the set of cells from IP (i, s) destined to OP (j, d). An output port arbiter selects the flow to serve in a round-robin fashion. For this selection, the output arbiter selects the HoL cell of a CB if the cell's order matches the expected cell order for that flow. Because the output port arbiter selects the older cell based on the order of arrival to the switch, this selection prevents out-ofsequence forwarding. We discuss this property in Section V. Furthermore, the round-robin schedule ensures fair service for different flows. If there is no HoL cell with the expected value for a particular flow, the arbiter moves to the next flow.
C. Analysis of Crosspoint Buffer Size
In this section, we show that no CB queue in the switch receives more than one cell in a time slot and those who receive cells at a rate of 1/kN are served at rates of 1/kN . Let us consider a scenario where all the IPs in the switch only have traffic for one OP. The largest admissible arrival rate at an IP is:
The input load, λ i,s,j,d , gets load-balanced to VIMOQs at a rate of 1 m . The aggregate traffic arrival rate at a VIMOQ from an IM, R V , is:
because m = n = k, therefore,
The aggregate traffic rate at a CM for an OP is:
The traffic arrival rate to a CB, R C , is the aggregate traffic from an IP through a CM or:
Therefore, R C ≤ S C for admissible traffic, which implies that the crosspoint buffer size at OMs does not impact the performance of the switch because the queue size does not grow with the input load.
D. In-Sequence Cell Forwarding Mechanism
The proposed in-sequence forwarding mechanism of TRIDENT is based on tagging cells of a flow at the inputs with their arriving sequence number, and forwarding cells from the crosspoint buffers to the output port in the same sequence they arrived in the input. The policy used for keeping cells in-sequence is as follows: When a cell of a flow arrives in the input port, the input port arbiter appends the arrival order to the cell (for the corresponding flow). After being forwarded through L I (i, r), the cell is stored at the VIMOQ for the destination OP. When the CM configuration permits, the cell is forwarded to the destined OM and stored at the queue for traffic from the IP to the destined OP traversing that CM. An OP arbiter selects cells of a flow in the order they arrived in the switch by using the arrival order carried by each cell. As an example of this operation, Table III shows the arrival times of cell c 1,1 , c 2,1 , and c 2,2 , where c y,tx denotes flow y and arrival time t x to the VIMOQs. Cell c 2,1 is queued behind c 1,1 , and c 2,2 is placed in an empty VIMOQ. Table IV shows the time slots when the cells are forwarded from the VIMOQ. For example, when c 2,2 leaves the VIMOQ before c 2,1 . Table V shows the time slots when the cells are forwarded to the destination OP after the output-port arbitration is performed. Figure 3 shows a single flow A with two cells, A 3 and A 4 , arriving at timeslots, t 3 and t 4 , respectively. Let us assume that no cell of this flow has transited the switch. The cell that arrives at t 3 is appended a tag of 1 (i.e., the order of arrival) and the cell that arrives at t 4 is appended a tag of 2. Both cells are load balanced and forwarded to different virtual input module output queues (VIMOQs). As shown in Step 2 of Figure 3 , A 31 is forwarded to a queue with cells from other flows, while A 42 , the younger cell, is forwarded to an empty queue. Therefore, A 42 arrives at the output port (OP) before A 31 (Step 3). Because the pointer of flow A at this OP has not received any cell for this flow, it currently points to tag 1. Hence A 42 remains at the CB until A 31 arrives and is forwarded out the OP. Thereafter, flow A pointer at this OP is updated to 2 and A 42 is forwarded out the OP.
III. THROUGHPUT ANALYSIS
In this section, we analyze the performance of the proposed TRIDENT switch. Let us denote the traffic coming to the IMs, CMs, OMs, OPs, and the traffic leaving TRIDENT as R 1 , R 2 , R 3 , R 4 and R 5 , respectively. Here, R 1 and R 2 , and R 3 are N × N matrices, R 4 comprises N N × 1 column vectors, and R 5 comprises N scalars. Figure 1 shows these traffic points set at each stage of TRIDENT with the corresponding labels at the bottom of the figure.
The traffic from input ports to the IM stage, R 1 , is defined as:
where, λ u,v is the arrival rate of traffic from input u to output v, and
In the following analysis, we consider admissible traffic, which is defined as:
and as i.i.d. traffic.
The IM stage of TRIDENT balances the traffic load coming from the input ports to the VIMOQs. Specifically, the permutations used to configure the IMs forwards the traffic from an input to k different CMs, and then to the VIMOQs connected to these CMs in k consecutive time slots. R 2 is the traffic directed towards CMs and it is derived from R 1 and the permutations of IMs. The configuration of the IM stage at time slot t that connects
where r is determined from (1) and the matrix element:
The configuration of the IM stage can be represented as a compound permutation matrix, P 1 , which is the sum of the IM permutations over k time slots as follows,
Because the configuration is repeated every k time slots, the traffic load from the same input going to each VIMOQ is 1 k of the traffic load of R 1 . Therefore, a row of R 2 is the sum of the row elements of R 1 at the non zero positions of P 1 , normalized by k. This is: 
where j is obtained from (10) ∀ d and d is also obtained from (10) but for the different j. The configuration of the CM stage at time slot t that connects I c (r, p) to L C(r,j) may be represented as an N × N permutation matrix,
where j is determined from (2) and the matrix element:
Similarly, the switching process at the CM stage is represented by a compound permutation matrix P 2 , which is the sum of k permutations used at the CM stage over k time slots. Here,
The traffic destined to OP (j, d) at OM (j), R 3 (j, d), is:
The aggregate traffic at CBs of an OP for the different IPs, R 4 (v), is obtained from the multiplication of R 3 (j, d) with a vector of all ones, 1, or:
Each row of R 4 (v) is the aggregate traffic at the CBs from each IP. The traffic leaving an OP, R 5 (v), is:
Therefore, R 5 (v) is the sum of the traffic leaving OP (v).
The following example shows the operations performed on traffic coming to a 4 × 4 (k = 2) TRIDENT switch. Let the input traffic matrix be
Then, R 2 is generated from the arriving traffic and the configuration of IM. The compound permutation matrix for the IM stage for this switch is:
Using (12), we get
From (13), the traffic matrix at VIMOQs destined for the different OMs are R 2 (0) and R 2 (1) , as shown at the top of the next page. The rows of R 2 (v) represent the traffic from IPs, and the columns represent V IMOQ(r, i, j, d) at I C (r, p) . The compound permutation matrix for the CM stage for this switch is:
From (14), the traffic forwarded to an OP is:
The rows of R 3 (j, d) represent the traffic from V IMOQ(r, i, j) at I C (r, p) and the columns represent L C (r, j). The traffic forwarded from CBs allocated for the different IPs to the corresponding OP is obtained from (15):
The rows of R 4 (v) represent the traffic from IP (i, s). Using (16) , we obtain the sum of the traffic leaving the OP, or:
As raised from the example, one may wonder if TRIDENT achieves 100% throughput. This property of TRIDENT is discussed as follows:
From R 4 (0) to R 4 (3) above, we can deduce that R 4 is equal to the input traffic R 1 , or, in general:
Also, because R 2 and R 4 (v) meet the admissibility condition in (11) , and R 5 (v) does not exceed the traffic rate for any OP (v), the aggregated traffic loads at each VIMOQ, CB, and OP do not exceed the capacity of each output link. From the admissibility of R 2 and R 4 (v), and (17), we can infer that the input traffic is fully forwarded to the output ports. As discussed in Section II-B, an output arbiter selects a flow in a round-robin fashion and a cell of that flow based on the arrival order. If a cell of a flow is not selected, the OP arbiter moves to the next flow. This arbitration scheme ensures fairness and that the cells forwarded to the OP are also forwarded out of the OP. Hence, from R 5 (0) to R 5 (3), we can infer that R 5 (v) is equal to R 4 (v), or:
From (17) and (18), we conclude that TRIDENT achieves 100% throughput under admissible i.i.d. traffic. We present the proof of this claim in the following section.
IV. 100% THROUGHPUT
In this section we prove that TRIDENT achieves 100% throughput by using the analysis under admissible i.i.d. traffic.
Theorem 1: TRIDENT achieves 100% throughput under admissible i.i.d. traffic.
Proof: Here, we proof that TRIDENT achieves 100% throughput. This is achieved by showing that VIMOQs and CBs are weakly stable under i.i.d. traffic. Because a stable switch achieves 100% throughput under admissible i.i.d. traffic [32] . A switch is considered stable under a traffic distribution if the queue length is bounded.
The queues are considered to be weakly stable if the queue occupancy drift from its initial state is finite ∀ t as lim t→∞ .
Let us represent the queue occupancy of VIMOQs at time slot t, N μ (t) as:
where A μ (t) is the aggregate traffic arrival matrix at time slot t to VIMOQs and D μ (t) is the service rate matrix of VIMOQs at time slot t. Solving (19) with an initial condition N μ (0), recursively yields:
Because a VIMOQ is served at least once every N time slots, the service rate of a VIMOQ at a CM for OP (v) at time slot t, d μv (t) is:
Then, the service matrix of VIMOQs is:
and representing R 2 as the aggregate traffic arrival to VIMOQs or:
Substituting (21) and (22) into (20) gives:
We recall from section III.A that R 2 is admissible, and by substituting P 1 and R 2 into (24), shows that is finite. We conclude from (23) and (24) that the occupancy of VIMOQ is weakly stable. We now prove the stability of CBs. The queue occupancy matrix of CBs at time slot t can be represented as:
where A c (t) is the aggregate traffic arrival matrix at time slot t to CBs, and D c (t) is the service rate matrix of CBs at time slot t. Solving (25) recursively as before yields:
Because a CB is served at least once every N k time slots. The service rate of the CB at OP (v) at time slot t, d cv (t) is:
and service matrix of CBs is:
The aggregate traffic arrival to CBs, R 4 , or:
Let us assume the worst case scenario, where the CB is served only once in N k timeslots or d cv (t) = 1 Nk ∀ v in (27) . Substituting (27) and (28) into (26) gives:
where
Because R 4 is admissible, as discussed in Section III.A, substituting R 4 into (30) shows that is finite. We can conclude from (29) and (30) that the occupancy of CB is also weakly stable.
This completes the proof of Theorem 1. 
V. ANALYSIS OF IN-SEQUENCE SERVICE
In this section, we demonstrate that TRIDENT forwards cells in sequence to the OPs through the proposed in-sequence forwarding mechanism. Table VI lists the terms used in the in-sequence analysis of the proposed TRIDENT switch. Here, c y,τ (i, s, j, d) denotes the τ th cell of traffic flow y, which comprises cells going from IP (i, s) to OP (j, d) . In addition, t ay,τ denotes the arrival time of c y,τ , and q Vy,τ and q Cy,τ denote the queuing delays experienced by c y,τ at V IMOQ (r, i, j, d) and CB(r, j, d, i, s) , respectively. The departure times of c y,τ from the corresponding VIMOQ and CB are denoted as d Vy,τ and d Cy,τ , respectively. We consider admissible traffic in this analysis.
Here, we claim that TRIDENT forwards cells in sequence to the output ports, through the following theorem.
Theorem 2: For any two cells c y,τ (i, s, j, d) and c y,τ (i, s, j, d), where τ < τ , c y,τ (i, s, j, d) departs the destined output port before c y,τ (i, s, j, d).
Lemma 1: For any flow traversing TRIDENT, an older cell is always placed ahead of a younger cell from the same flow in the same crosspoint buffer.
Proof: From the architecture and configuration of the switch an IP connects to a CM once every k time slots. If a younger cell arrives at the OM before an older cell then the younger cell was forwarded through a different CM from the one the older cell was buffered. Also, two cells of the same flow may be queued in the same CB if and only if the younger cell arrived at the VIMOQ k time slots later than the older cell, and therefore, the younger cell would be lined up in a queue position behind the position of the older cell.
Lemma 2: For any number of flows traversing TRIDENT, cells from the same flow are cleared from the OP in the same order they arrived at the IP.
Proof: Let us consider a traffic scenario where multiple flows are traversing the switch. We focus on one flow with cells arriving back to back. Let us also consider as an initial condition that all CBs are empty, and the VIMOQ to where the first cell of the flow is being sent has backlogged cells (from other flows) while other VIMOQs to where the subsequent cells of the same flow are sent are empty. This scenario would have the largest probability to delay the first cell of the flow and, therefore; to forward the subsequent cells of the flow out of sequence. Also, let us consider that the flow pointer at the output ports initially points to the cell arrival order L yθ , where y is the flow id and θ is the cell's order of arrival.
Also, let us assume that the cells arrive at L I (i, r) one or more time slots before the configuration of the CM allows forwarding a cell to its destined OM. Thus, a cell may depart in the following or a few time slots after its arrival. This cell then may wait up to k − 1 time slots for a favorable interconnection to take place at the CM before being forwarded to the destined OM. In the remainder of the discussion, we show that the arriving cells are forwarded to the destination OP in the same order they arrive in the IP.
Given flow y, the arrival time of the first cell c y,τ is:
Upon arriving in the IP, c y,τ is tagged with L y0 and forwarded to the VIMOQ. Based on the backlog condition, c y,τ is placed behind γ cells from other flows upon arriving at the VIMOQ. Therefore, the VIMOQ occupancy, N Vy,τ , is:
Using (32) the queuing delay of c y,τ at the VIMOQ is:
where q Hy,τ is the time it takes the HoL cell to depart the VIMOQ and (γ−1)k is the delay generated by the other (γ−1) cells ahead of c y,τ in the VIMOQ. The extra k time slots are the delay c y,τ experiences as it waits for the configuration pattern to repeat after the last cell ahead of it is forwarded to the OM. Using (31) and (33), the departure time of c y,τ from the VIMOQ is:
When c y,τ arrives at the output module it is stored at the corresponding output buffer before being forwarded to the output port.
Let us now consider the next arriving cell from flow y, c y,τ +θ , where 0 < θ < k. The time of arrival of c y,τ +θ is:
Upon arrival, c y,τ +θ would have L yθ appended to it and forwarded to the VIMOQ. Based on the traffic scenario, c y,τ +θ would be forwarded to an empty VIMOQ. The queuing delay at the V IMOQ for c y,τ +θ is:
where β is the number of time slots before the configuration pattern enables forwarding c y,τ +θ to the destined OM. Using (34), (35), and (36), the departure time of c y,τ +θ from the VIMOQ is:
At the output port, the pointers all initially pointed to L y0 based on the initial condition. Therefore, irrespective of d V y,τ +θ < d Vy,τ , for θ + β < q Hy,τ + γk, c y,τ +θ remains stored at the output buffer until c y,τ is cleared from the output port, because the pointer points to L y0 . Because CBs are empty as initial condition, the CB occupancy, N Cy,τ , upon c y,τ arrival is:
and the occupancy of the CB, N C y,τ +θ , upon c y,τ +θ arrival is
Using (38), the queuing delay, q Cy,τ , at the CB for c y,τ is:
From (34), (37), and (39), the queuing delay, q C y,τ +θ , at the CB for c y,τ +θ is:
From (31), (34), and (40), the departure time of c y,τ from the OP, d Cy,τ , is:
From (35), (37), and (41), the departure time of c y,τ +θ from the OP, d C y,τ +θ , is:
Using (42) and (43),
The difference between the departure times of any two cells of a flow from the CB is a function of θ, which is the arrival time difference between any two cells. Therefore, cells of a flow are forwarded to the OP in the same order they arrived.
This completes the proof of Theorem 2.
VI. PERFORMANCE ANALYSIS
We evaluated the performance of TRIDENT through computer simulation under uniform traffic model and compared with that of an output-queued (OQ), Space-Memory-Memory (SMM), and a Memory-Memory-Memory Clos-network (MMM) switch. We also evaluated the performance of TRIDENT through computer simulation under nonuniform traffic model and compared with that of an outputqueued (OQ), pace-Memory-Memory (SMM), Memory-Memory-Memory Clos-network (MMM), and MMM switch with extended memory (MM e M) switches. The SMM switch uses desynchronized static round robin at IMs and select cells from the buffers at CMs and OMs. The MMM switch selects cells from the buffers in the previous stage modules using forwarding arbitration schemes and is prone to serving cells out of sequence. Considering that most load-balancing switches based on Clos networks deliver low performance, we select these switches for comparison because they achieve the highest performance among Clos-network switches, despite been categorized as different architectures. We considered switches with size N = {64, 256}. For performance analysis, queues are assumed long to avoid cell losses and to identify average cell delay. Table VII shows a comparison of the OQ, SMM, MMM, MM e M , and TRIDENT architectures.
A. Uniform Traffic
Uniform distribution is mostly considered to be benign and the average rate for each output port λ i,s,j,d = 1 N . where IP (i, s) is the source IP and OP (j, d) is the destination OP. Hence, a packet arriving at the IP has an equal probability of being destined to any OP. Figures 4 and 5 show the average under uniform traffic with Bernoulli arrivals for N = 64 and N = 256, respectively. The finite and moderate average queuing delay indicated by the results shows that TRIDENT achieves 100% throughput under this traffic pattern. This throughput is the result of the efficient load-balancing process in the IM stage. However, such high performance is expected for uniformly distributed input traffic.
TRIDENT switch experiences a slightly higher average delay than the OQ switch. This delay is the result of cells being queued in the VIMOQs until a configuration occurs that enables forwarding the cells to their destined output modules. Due to the amount of memory required by MM e M to implement the extended set of queues, our simulator can only simulate small MM e M switches for queueing analysis, so we simulated the switches under this traffic pattern for Uniform bursty traffic is modeled as an ON-OFF Markov modulated process, with an average duration of the ON period set as the average burst length, l, with l = {10, 30} cells. Figures 6 and 7 show the average delay under uniform traffic with bursty arrivals for average burst length of 10 and 30 cells, respectively. The results show that TRIDENT achieves 100% throughput under bursty uniform traffic and it is not affected by the burst length, while the MMM switch has a throughput of 0.8 and 0.75 for an average burst length of 10 and 30 cells, respectively. Therefore, TRIDENT achieves a performance closer to that of the OQ switch. This result is, again, the product of using load-balancing at the IMs of TRIDENT. The benefits of this feature can be appreciated better at high loads (approaching 0.99 input load).
The uniform distribution of the traffic and the loadbalancing stage helps to attain this low queueing delay and high throughput. Figures 4, 5, 6, and 7 show that the queueing delay difference between TRIDENT and the OQ switch is not significant. Figures 4, 5 , 6, and 7 also show that TRIDENT outperforms the SMM switch for uniform traffic at high input loads. Because it uses load-balancing at the bufferless IMs, the SMM switch matches the high performance of TRIDENT at low input loads. However, the configuration complexity at CMs and OMs of the SMM switch affects its performance at high input loads. In addition, SMM also forwards cells These figures also show that the effective load balancing of TRIDENT reduces the average delay and also eliminates the offset in delay for a light load.
B. Nonuniform Traffic
We also evaluated the performance of TRIDENT, MMM, MM e M, and OQ switches under nonuniform traffic. We adopted the unbalanced traffic model [31] , [33] as a nonuniform traffic pattern. The nonuniform traffic can be modeled using an unbalanced probability ω to indicate the load variances for different flows. Consider input port IP (i, s) and output port OP (j, d) of TRIDENT, the traffic load is determined by
where ρ is the input load for input IP (i, s) and ω is the unbalanced probability. When ω = 0, the input traffic is uniformly distributed and when ω = 1, the input traffic is completely directional; traffic from IP (i, s) is destined for OP (j, d). Figure 8 shows the throughput of TRIDENT, SMM, MMM, and MM e M switches. The figure shows that TRIDENT switch attains 100% throughput under this traffic pattern for all values of ω, matching the performance of SMM and MM e M and outperforming that of MMM. These three buffered switches are known to achieve high throughput at the expense of outof-sequence forwarding.
We also tested the average queueing delay of TRIDENT under this nonuniform traffic. It has been shown that many switches do not achieve high throughput when ω is around 0.6 [33] . Therefore, we measured the average delay of TRI-DENT under this unbalanced probability, as Figures 9 and 10 show for N = 64 and N = 256, respectively, and compared it with MMM, SMM, MM e M, and OQ switches. One should note that due to the limited scalability of MMM and MM e M, the comparison of TRIDENT for N = 256 under this traffic conditions only includes SMM and OQ switches. Figure 10 shows that the delay of TRIDENT is lower than the delay of SMM under high input loads.
As Figure 9 for N = 64 shows, the average delay of TRIDENT is lower than the delay achieved by SMM, MMM, and MM e M under high input loads while also achieving a comparable delay of an OQ switch. The small performance difference between TRIDENT and OQ is similar for N = 256, as Figure 10 shows. These results are achieved because the load-balancing stage of TRIDENT distributes the traffic uniformly throughout the switch. Therefore, the queuing delay is similar to that observed under uniform traffic. These results also show that high switching performance of TRIDENT is not affected by the in-sequence mechanism of the switch and the load-balancing effect is more noticeable under nonuniform traffic. In addition to the analysis in Section II-C, we also tested the impact of the CB size through computer simulations. Here, we tested and measured the average delay under unbalanced traffic and the throughput under port-based hot-spot traffic for three TRIDENT switches with CB sizes of k 2 (short queue), N 2 (long queue), and ∞. Figure 11 shows that the size of the crosspoint buffer does not impact the switch performance as the delay performance for each different CB size is equivalent. The TRIDENT switches, each with a different crosspoint buffer size, attain 100% throughput for hotspot per port traffic model. These results corroborate our CB-size analysis.
VII. CONCLUSIONS
We have introduced a three-stage load-balancing packet switch that has virtual output module queues between the input and central stages, and a low-complexity scheme for configuration and forwarding cells in sequence for this switch. We call this switch TRIDENT. To effectively perform load balancing TRIDENT has virtual output module queues between the IM and CM stages. Here, IMs and CMs are bufferless modules, while the OMs are buffered ones. All the bufferless modules of TRIDENT follow a predetermined configuration while the OM selects the cell of a flow to be forwarded to an output port based on the cell's arrival order and uses roundrobin scheduling to select the flow to be served. Because of the buffers at crosspoints of OMs, the switch rescinds port matching, and the configuration complexity of the switch is minimum, making it comparable to that of MMM switches. We introduce an in-sequence mechanism that operates at the outputs based on arrival order inserted at the inputs of TRIDENT to avoid out-of-sequence forwarding caused by the central buffers. We modeled and analyzed the operations on each of the stages and how they affect the incoming traffic to obtain the loads seen by the output ports. We show that for admissible independent and identically distributed traffic, the switch achieves 100% throughput. This high performance is achieved without resorting to speedup nor switch expansion. In addition, we analyzed the operation of the forwarding mechanism and demonstrated that it forwards cells in sequence. We showed, through computer simulation, that for all tested traffic, the switch achieves 100% throughput for uniform and nonuniform traffic distributions.
