Abstract-Recently, an innovate switch architecture named Contention-Tolerant Crossbar switch, CTC(N), was proposed. Without resolving output contentions, the controllers are able to fully distributed in CTC(N). It largely reduces the scheduling complexity. However, It has been proved that the saturated switch throughput is bounded by 63% without any scheduling algorithms. In this paper, we present an implementation scheme named Two-Stage Contention-Tolerant Crossbar, denoted as TCTC(N, k). TCTC(N, k) uses Contention-Tolerant Crossbar as its basic switch component. And we will theoretically prove that TCTC(N, k) achieves high throughput with small size CTC components and without complex hardware and internal speedup.
I. INTRODUCTION
Switch, as the core part of switches and routers, delivers the packages (cells) arriving at input ports to their targeted output ports. Since the simpleness, crossbars have been widely used in commercial switches. In crossbar switches, input ports and output ports are connected by controlling the states of crosspoints of crossbar to be cross or bar state. Multiple input ports have cells intending to the same output ports, however, an output port only receives one cell in a time slot without speedup, and all others have to be remain in input port buffers. How to resolve the output contentions and optimize performance using scheduling algorithms with or without speedup has been a hot topic dozens of years. In order to overcome the Head-of-Line (HoL) blocking, buffers in input port are arranged as Virtual Output Queue (VOQ). Maximum scheduling algorithms are able to guarantee optimal performance by operating on maximum size or weight matching, e.g. [1] , [2] . However, centralized scheduler operating need at least O(N 2.5 ) time complexity. It is hard to satisfy high speed and large scale network. Iterative heuristics for finding maximal matching were considered instead, which are usually implemented as 2N arbiters cooperating with iterative RGA (Request-Grant-Accept) signal exchange. The representative works are PIM [3] , iSLIP [4] , and so on. It has been proved that O(logN) iterations are required to obtain an maximal matching. Although implemented in hardware, these schedulers are considered too slow with very high costs for high-speed networks. In addition, for resolving output contentions and achieving high performance, all input and output ports are involved in scheduling process with conventional crossbar. It limits the scale of switch on single chip even using VLSI techniques. In order to reduce scheduler complexity, a small buffer was introduced to each crosspoint of crossbar. Such switch is called buffered crossbar switch. The scheduling process of buffered crossbar operates in two phases. In the first phase, each input port selects a cell to place into a crosspoint buffer in its corresponding row, and in the second phase, each output port selects a crosspoint in its corresponding column to take a cell from. Input (resp. output) ports operate independently and in parallel in the first (resp. second) phase, eliminating a single centralized scheduler. Crosspoint buffers are used as a decoupling mechanism for implementing separated distributed and parallel input scheduling (first phase) and output scheduling (second phase). Some works on buffered crossbar switches with or without internal speedup include, for example, [5] - [9] . The cost of crosspoint buffers, which requires at least O(c · N 2 ) memory space, where c is the number of bits in a cell, is used to trade for reduced control complexity. And, crosspoint buffers and the circuit for schedulers take a large chip area, which also severely restricts the scalability of buffered crossbar switches.
Recently, we propose an innovate switch architecture called Contention-Tolerant Crossbar Switches (CTC(N)) [10] . CTC(N) is able to tolerate output conflict automatically, thus the schedulers are fully distributed over inputs, avoiding central control or signal exchange. It largely reduces the scheduling complexity. In this paper, we will present a two-stage CTC architecture called TCTC (N, k) . TCTC(N, k) is implemented with small size CTC as its basic switch component, and significantly reduces crosspoint complexity. By analyzing the queueing model of TCTC(N, k), we will prove that it achieves high switch throughput with k = 2 
II. RELATED WORK
The fabric of CTC(N) comprises N 2 crosspoints arranged as a N-by-array, as shown in Fig. 1(a) . Each crosspoint is a Switch Element (SE), and the SE at row i and at column j is denoted as SE i,j , who has two states, i.e. Cross state (CR state) and Receive-and-Transmit state (RT state), as shown in Fig. 1 (b) .
The SEs are initialized to be CR states, and are controlled by input ports in a synchronized fashion. Each input port is equipped with a scheduler. At the beginning of each time slot, if input port i wants to transmit a cell to output port j, it sets SE i,j to RT state with all other SEs in row i remaining in CR state; Otherwise, all SEs in row i remaining in CR state. In this way, the output column is dynamically partitioned into several segments so that parallel cell transmissions are performed on these segments concurrently, as shown in Fig. 1 (c) . It is possible that multiple input ports send cells to the same output port j during a time slot. In such a case, all cells but one are intercepted and are buffered by a downstream input port, with the cell from the lowest input port transmitted to output port j. Output conflicts in column j are automatically avoided without losing cells. Without resolving output conflicts, the N schedulers are fully distributed over input ports, and operate independently with zero knowledge of other input ports. These attractive properties make CTC(N) more scalable than conventional crossbars.
In [10] , we also developed a mathematic model of CTC(N) using queueing theory and analyzed the existing issues of the CTC architecture: 1. Without internal speedup, the saturated throughput, i.e. throughput under full offered load, decreases with the increasing of switch size. The saturated throughput of CTC(N) with FIFO single queue is about 63% in the worst case.2. More downstream inputs suffer from more overloads, which lead to reduction of throughput. 3. Cells from upstream inputs would be intercepted by downstream inputs. Larger number of downstream inputs causes longer worst travel path for cells. In this case, cells might suffer from out-of-sequence problem. Our subsequence work discussed those issues by presenting improved architectures and scheduling algorithms.
High throughput is able to achieve by using sophisticated scheduling algorithm. Reference [11] proposed a fully distributed scheduling algorithm called Staggered Polling (SP for short). With this algorithm, the queues in each input port are arranged as N FIFO queue, one for each output port, called Virtual Output Queue (VOQ). The schedulers are composed by two subschedulers, i.e. primary scheduler and secondary scheduler. The primary scheduler in each input port chooses a specific output queue to server in round-robin pattern, and different output queue will be served by different scheduler. In this way, interceptions in output line can be avoided. While the output queue which should be served by a primary queue is empty, the corresponding secondary scheduler will choose one non-empty output queue to server under some preset scheduling strategy. Using this scheduling algorithm, high performance achieved under Bernoulli i.i.d. uniform traffic. Under bursty traffic, however, it didn't perform well.
In order to increase the performance, we discussed several improved architectures [10] , [12] - [14] . In [10] , [12] , we proved that 100% throughput achieves with two planes of CTC(N) or with speedup two. In addition, a queue model was developed to analyzed the cell delay in CTC(N), and the mathematical result were proved its correctness by simulation results [12] . In [13] , we presented a delicate version of CTC(N), named DiaCTC. By rearranging the crosspoints only, it is able to achieve high performance with SP scheduling algorithms without any change. Article [14] proposed an parallelized version of CTC (N) named PCTC(N). In PCTC(N) , the entire fabric can be divided into several regions. Those regions operate independently and in parallel, which highly improve the performance.
Since cells from upstream input ports might be intercepted and buffered by downstream input ports, CTC architecture suffers from cell out-of-sequence problem. We discussed this issue in [15] , [16] . In [15] , a fully distributed scheduling algorithm called SELF-ADJUSTED scheduling algorithm was proposed. With this algorithm, each input port has an independent scheduler, and the queues are arranged as N VOQs and an Upstream Queue (UQ). If the UQ is non-empty, which means that cells are from upstream input ports still exist, the scheduler will choose the UQ to serve, otherwise, VOQ will be served in Round-robin pattern. In this way, cells will arrive at output port in their original order. In [16] , we developed an analytical model named Multilevel Contention-Tolerant Crossbar, denoted as
MLCTC(N). It simplifies the queueing behavior, and can be described mathematically as an open queueing network systems. Simulations results prove the correctness of MLCTC(N). And, we discuss the speedup parameter of CTC(N) matching the OQ switch using MLCTC(N).
Contention-Tolerant crossbar switch architecture opens a new space to design switches and leaves lots of challenges to overcome as well. Even we have discussed several issues and improvements, those challenges could be overcome in different directions. In this paper, we will consider to implement high throughput CTC in an innovative way.
III. TWO-STAGE CONTENTION-TOLERANT CROSSBAR
In this paper, we introduce an implementation scheme of CTC(N) called Two-stage Contention-Tolerant Crossbar Switch, denoted as TCTC (N, k) , where N is the input/output ports number of the switch, and k is the input/output number that each single switch module has, We investigate the performance of TCTC(N, K) in terms of switching throughput. The switching throughput  IS defined the ratio of the average number of cells arrived at output ports over the average number of cells arrived at input ports. In order to simply the analytical work, we assume that each input in IM or OM has a buffer arranged as FIFO queue for arriving cells. No output buffer in IMS, i.e. a cell being switched through the fabric of IM is forwarded to its corresponding OM and is buffered in its input buffer. The arriving traffic is Bernoulli i.i.d. uniform pattern. 
From the property of CTC(N), one cell leaves Qk for its downstream Qi if and only if they both transmit their cells to the same output column at the same time slot. Thus we have , ,
1; ( )
where pi,j is the probability of a cell being chosen to transmit to Oj from Qi. 
Let j  be the average rate of cells achieving Oj.
Concluding above two cases, we have equations (4). 
. .
CTC(N) can be seen as CTC(N,N [7] prove the correctness of theoretical results.
B. Throughput Analysis of TCTC(N, k)
In order to identify the different parameter of IM and 
According to the definition of switch throughput, we obtain:
Combining (4)- (7), and (8), the switch throughput of TCTC(N, k) can be computed. 
V. CONCLUSION
As it was proved in [7] , the throughput of CTC(N) with FIFO input queues is bounded by 63%. In this paper, we presented a new architecture using small CTC components called Two-Stage Contention-Tolerant Crossbar, denoted as TCTC(N, k), And we proved that it achieves high throughput by developing the its queuing model.
