We propose a novel architecture, a Combined Input-Crosspoint-Output Buffered (CIXOB-k, where k is the size of the crosspoint buffer) Switch. CIXOB-k architecture provides 100% throughput under uniform and unbalanced traffic. It also provides timing relaxation and scalability. CIXOB-k is based on a switch with Combined Input-Crosspoint Buffering (CIXB-k) and round-robin arbitration. CIXBk has a better performance than a non-buffered crossbar that uses iSLIP arbitration scheme. CIXOB-k uses a small speedup to provide 100% throughput under unbalanced traffic. We analyze the effect of the crosspoint buffer size and the switch size under uniform and unbalanced traffic for CIXB-k. We also describe solutions for relaxing the crosspoint memory amount and scalability for a CIXOB-k switch with a large number of ports.
I. INTRODUCTION
The explosion of Internet traffic has led to a greater need for high-speed switches and routers that have over 1-Tbit/s throughput [1] . Crossbar switching fabrics are very popular for switch implementation because of their non-blocking capability, simplicity, and market availability. A switch with a crossbar fabric and queues at the output ports to store those cells that could not be sent to the output lines is called Output Buffered (OB). In an OB switch, all cells coming to an input are forwarded to the destined outputs as they arrive. This architecture is not scalable because the required internal bandwidth or speedup (S) -defined as the number of times that the switch core works faster than the input line rate -is equal to the number of ports (S = N); the working speed of the switch core and the output memory make an OB switch implementation infeasible for even a medium-sized switch. However, this architecture has been used as a comparison reference for any switch model because of its desirable characteristics such as high throughput and low delay.
Crossbars could have input queues to store those cells (or packets) that could not go through because of contention for an output; this architecture is known as Input Buffered (IB). An IB architecture is scalable and its implementation does not have the restrictions of the OB model because the core fabric works at the input line rate (S = 1 ). However, IB switches need to resolve input and output contention by means of arbiters at the inputs and outputs. The requirements for such arbiters are (a) low complexity, (b) fast contention resolution and, (c) high efficiency to provide a high performance. Low complexity is needed to make imThis research is supported by NSF Grant 9906673. Roberto Rojas-Cessa is with Polytechnic University, 6 Metrotech Center, Brooklyn, NY 11201 USA. Email: rrojas@kings.poly.edu Eiji Oki is with NTT Network Innovation Laboratories, 3-9-11 Midoricho Musashino-shi, Tokyo 180-8585 Japan. This work was done while he was a Visiting Scholar at Polytechnic University, Brooklyn, NY. Email: oki.eiji@lab.ntt.co.jp.
H. Jonathan Chao is with Polytechnic University, Brooklyn, NY 11201 USA. Email: chao@poly.edu. plementation feasible. For a high-capacity switch, a fast resolution is necessary so that the arbiter can select a cell among those eligible in the allotted time.
Head of Line (HOL) blocking is a well-known problem for a crossbar with FIFOs at the inputs [2] . This problem is overcome by using separate queues at the inputs, one for each output. This queuing system is called Virtual Output Queuing (VOQ). For a crossbar with VOQs, maximum matching algorithms have been proposed to achieve 100% throughput. Maximum matching algorithms are efficient but with such a high complexity [3] that implementation is infeasible for high-speed systems. Schemes based on a maximum size or weight matching, like Longest Port Queuing (LPQ), Oldest Cell First (OCF), and Longest Port First (LPF) [4] , have been proposed [3] .
Maximal matching schemes have been considered as an alternative to maximum matching schemes; iSLIP [5] , Dual RoundRobin Matching (DRRM) [6] , [7] , and Longest Output Occupancy First Algorithm (LOOFA) [8] are examples. To make up for the lack of efficiency that a maximal scheme has (compared to a maximum type), speedup, a number of iterations -the number of times that an algorithm is performed in a single scheduling cycle to obtain a cumulative result-, or a combination of both is used, as in LOOFA. iSLIP is a good example of an iterative matching scheme. Although iSLIP provides 100% throughput for uniform independent traffic, because of the arbitration time and connection state amount of this arbitration scheme, it has been proposed for a small number of ports [9] due to its centralized implementation (i.e., 32 for iSLIP). Transmission of phases -request, grant, and acknowledge -is performed within a cell slot between input and output arbiters. This transmission of information reduces the available time for arbitration because these transmission phases are performed during the cell slot in serial with input and output arbitration, even when the transmissions are done within a single chip (so that the off-chip delay is avoided). Another drawback with the proposed single-chip centralized implementation is that the pin count limits the number of ports.
A switch architecture using speedup, such that 1 < S < N , is called Combined Input and Output Buffered (CIOB), where queues are placed at the inputs and outputs. As the demand for high switching rates increases, this speedup becomes a bottleneck since the available time for arbitration is inversely proportional to the cell slot duration divided by S. The DRRM scheme considers speedup instead of a number of iterations to improve the matching performance. Although the overhead information exchanged between input and output arbitration is reduced in this scheme, the arbitration time becomes insufficient for a switch with a large number of ports and with a high port speed. For a long time, buffered crossbars have been considered as a solution to improve switching throughput instead of nonbuffered crossbars. However, it is known that the number of buffers would grow in the same order as the number of cross-
where N is the number of ports), making implementation infeasible for the memory required by a large buffer or a large N.
In pure buffered crossbars -a pure crossbar architecture has only buffering at the crosspoints and none in any other placea large crosspoint buffer has been utilized to minimize cell loss. The number of ports is limited by the memory amount that can be implemented in a module chip. An example of this architecture was proposed in [11] , where a 2 2 switch module with a crosspoint memory of 16 kbytes each was implemented. In this architecture, a large crosspoint buffer is needed to store all those cells that could not be switched to the output port to comply with the required cell loss rate.
To reduce the memory in the crosspoint buffer, input queues are used. FIFO queues have been proposed, where HOL blocking, as in a non-buffered crossbar, remains in this architecture. Examples of these architectures were presented in [12] , [13] , and [14] . A buffered crossbar with a single-cell buffer was proposed in [12] and [13] , together with a FIFO input buffer at the input ports. This architecture provides an improvement over non-buffered crossbars with FIFO input buffers. The wellknown limited throughput of a FIFO input-buffered architecture of about 58% was improved to 91% with a priority scheme (also called HOL blocking scheme by the same author). However, the FIFO buffers at the inputs limit the maximum throughput performance in this architecture because the HOL blocking can not be completely eliminated. In [14] a similar architecture with a 4-cell crosspoint buffer is considered. This buffered crossbar, used with 32-cell input FIFOs, achieves an acceptable cell loss (10 ;8 ). In this architecture, a flow control mechanism is also used to avoid cell loss at the core. All cell loss occurs at the input FIFO for a very congested output. This study shows that with input FIFOs, a small-sized crosspoint buffer, and a control mechanism, the cell loss rate can be kept small and the HOL blocking diminished to a certain degree.
As with maximal matching schemes as in non-buffered crossbars, the HOL blocking problem for FIFO buffers can be overcome in a buffered crossbar with the consideration of VOQs.
100% throughput is obviously achieved for a buffered crossbar with infinite crosspoint buffer sizes [15] , [16] , and [17] . To our knowledge, no minimum finite memory size has been specifically proven to provide 100% throughput for a buffered crossbar.
[18] proposed a Combined Input-crosspoint buffered (CIXB-1) switch model, where the crosspoint buffer has one-cell size, with VOQs at the inputs and simple round-robin for input and output arbitration. It showed that the combination of input buffers and single-cell crosspoint buffers and a round-robin arbitration scheme provides 100% throughput under uniform traffic. A VOQ structure is provided in the input buffers. In this architecture input and output arbitration are more independent than in a non-buffered crossbar model using a maximal matching arbitration, simplifying arbitration time complexity. Also, in a CIXB-k switch -where k is the crosspoing buffer size -the arbitration can be performed during a complete cell slot. The arbitration time can be separated from the transmission time, allowing synchronization flexibility and consideration of a large number of ports. The properties of a buffered crossbar allows a simplification of the arbiter design and the adoption of distributedfashion arbiters. In this switch, fixed-size cells are transmitted through the switching fabric to ease implementability.
It has been shown that CIXB-1 has a better delay performance for uniform and unbalanced traffic compared to a non-buffered crossbar with iSLIP arbitration.
However, CIXB-1 does not provide 100% throughput for unbalanced traffic 1 and the timing relaxation is limited, the distance between the port cards and the crossbar has to be less than one time slot. It is desirable to provide 100% throughput under unbalanced traffic and relax the timing in a larger proportion. It is important to show the trade off between the value of k and the timing relaxation. When adopting a CIXB-k architecture, a small size module has to be implemented due to the pin count and memory amount limitation. For implementation of a large N switch, several CIXB-k modules have to be placed in a bidimensional array. However, since two or more CIXB-k modules are addressed to the same output and since the distributed arbiters within a module are independent, timing relaxation in the outputs of the modules and synchronization may be lost.
In this paper we present a Combined Input-Crosspoint-Output buffered (CIXOB-k, where k is the crosspoint buffer size) switch. This model keeps the advantages of CIXB-k and provides 100% throughput under uniform and unbalanced traffic. We show the effect of CIXB-k when increasing the crosspoint buffer size under unbalanced traffic and observe that with a minimum speedup, 100% throughput can be achieved. CIXOB-k uses a smaller speedup than that of a non-buffered crossbar with a round-robin arbitration scheme to provide 100% throughput under unbalanced traffic.
Since a large memory amount is limited by the available VLSI technology, we discuss a solution for making a high amount of memory feasible of implementation. Also, we show that CIXOB-k provides scalability for a large number of ports while maintaining the properties of CIXB-k and keeping the memory amount low. Our solution for scalability can be also used for a CIXB-k switch model. CIXOB-k also provides a solution for timing relaxation.
This paper is organized as follows. Section II describes the CIXOB-k switch model and its properties. Section III shows the performance of the described architecture. Section IV shows a solution for scalability. Section V describe our conclusions.
II. SWITCH MODEL
In this section, we describe the CIXOB-k switch model. We introduce the proposed architecture and the terminology used in the rest of the paper. We consider fixed size packets, named cells. A variable length packet can be segmented into cells for internal switching and reassembled before it leaves the switch. The transmission time has a fixed length, called cell or time slot.
Our switch model has a structure as shown in Figure 1 and described below:
Input Queue. There are N VOQs at each input port. A VOQ at input i that stores cells for output j is denoted as V O Q i j .
Crosspoint Buffer XPB. Each crosspoint has a one-cell buffer. Only those inputs with a cell in the crosspoint buffer are considered for output arbitration. Output Queue. There is an output queue at each output port to receive a transmitted cell before the cell leaves the switch.
Flow Control. A flow control mechanism tells input port i which XPB i j is occupied, so that the V O Q i j is inhibited (a non-empty VOQ is considered eligible for input arbitration if this VOQ is not inhibited). In this paper, we consider a credit-based flow control mechanism, unless otherwise specified. 
Round Trip (RT).
We define RT as standing at the input port. RT is the composite time of the transmission cell delay from a port card to the crossbar (d1) plus the Output Arbitration time (OA) and the transmission of the flow control information back to the portcard (d2), as shown in Figure 2 . Cell and data alignments are included in the transmission times. In a general case, as presented in [18] : RT = d1 + OA+ d2 k ; IA (1) where IAis the input arbitration time and k is the crosspoint buffer size (i.e., CIXOB-4 means k = 4 ). The constrains for IAand OAare: IA 1
and
Arbitration. Round-robin arbitration is used at the input and output ports. An input arbiter selects a VOQ, among the eligible VOQs, to send a cell to the BX. Request eligibility for VOQs is determined by the flow control mechanism. An output arbiter selects a buffered cell among non-empty crosspoint buffers to forward a cell to the output.
A. Properties of CIXOB-k
In this section, we list the properties of CIXOB-k. Property 1: As originally described in [18] , CIXB-1 provides 100% throughput under uniform traffic. This property is disclosed as in [18] . 2 CIXOB-k inherits this property since it represents a CIXB-k when S = 1 (or no speedup). 1) is satisfied.
III. PERFORMANCE STUDY
To observe the performance of CIXOB-k we simulated CIXBk. In this section, we present the simulation results of a 32 32 CIXB-k under Bernoulli and bursty (On-off model) traffic with a uniform distribution, and unbalanced traffic. We compare the performance of CIXB-k with an OB switch and a non-buffered crossbar with iSLIP arbitration under uniform traffic. Also, we compare CIXB-k with IB switches with non-buffered crossbar with iSLIP and Parallel Iterative Matching (PIM) [19] arbitration schemes under unbalanced traffic. We also show the effect of different buffer sizes of CIXB-k under unbalanced traffic.
A. Uniform traffic
We show the delay performance of CIXB-1 since CIXOB-k inherits its properties. The results of the simulation of CIXB-1, iSLIP with 1 and 4 iterations (1-SLIP and 4-SLIP, respectively) and OB for traffic uniformly distributed to all output ports with Bernoulli arrivals are shown in Figure 3 . Under independent uniform traffic, CIXB-1 has a smaller average delay than 4-SLIP. We can also see that CIXB-1 has comparable average delay performance to OB. Figure 5 shows the tail delay probability of CIXB-k for different k values at different traffic loads. It is seen that the tail delays are almost the same for any crospoint buffer size for Bernoulli uniform traffic for a given traffic load. The buffer size can be kept as small as possible to minimize the amount of memory without having a significant performance degradation. As CIXOB-k uses the same switch model plus speedup, it also offers 100% throughput under this traffic model. The OB at CIXOB-k can be used to relax the transmission time between BX and the port card, even in the lack of speedup (S = 1 ).
B. Unbalanced Traffic
In this section, we show the necessary speed up for CIXOB-k to provide 100% throughput under unbalanced traffic by observing the throughput performance of CIXB-k.
B.1 Traffic Model
We use the same model as in [18] . We define non-uniform traffic by using an unbalanced probability w. 
When w = 0 , the offered traffic is uniform. On the other hand, when w = 1 , the traffic is completely unbalanced. This means that all the traffic of input port s is destined for output port d only, where s = d.
B.2 Speedup in CIXOB-k
As it was shown in [18] , CIXB-1 with round-robin offers a better throughput than iSLIP under unbalanced traffic. Here, we compare CIXB-k with IB switches with non-buffered crossbars using iSLIP and PIM for one and four iterations. Figure 6 shows the throughput under this traffic model for all these schemes. We can see that the achievable throughput of CIXB-1, is always better than that of a non-buffered crossbar with iSLIP and PIM.
However, this is less than 100% throughput. However, as the buffer size is increased, implementation cost arises. Figure 8 shows the minimum throughputobtained for different values of k and for different switch sizes (i.e., values of N). These values are obtained from Figure 7 .
As seen, CIXB-k with a large k provides a throughput close to 100%. Furthermore, a large memory size is considered impractical for implementation in a single 32 32 switch module. A solution is to use a small k and a small speedup. Figure 9 shows the speedup, obtained from Figure 8 , to achieve this objective. It can be seen that the speedup is very small and the value range is narrow for different k values, so it is practical to consider the smallest k which is determined by the time for transmission between port cards and BX. In this section we discuss relaxation of memory constraint in CIXOB-k and a way to scale up this switch model for a large number of N.
A. Relaxing the Memory Amount
In CIXOB-k, a major concern is to allocate the memory amount in the crossbar since for a buffered crossbar the amount of memory is in order O(N 2 ). For k > 1, the memory amount is M = kN 2 . The value of k is mainly selected by the needed time relaxation. In a switch module, the maximum permissible size of k is set by the current VLSI technology limitation. However, if a larger k value is needed, a multiple-plane implementation can be used [6] . Figure 10 shows this implementation. The total amount of memory is divided by the number of planes:
where m is the number of planes. In this architecture, a cell is segmented into the number of planes (i.e., m) and each segment is forwarded simultaneously or in the same order. At the inputs and outputs, there is a VOQ for each plane. In this way the memory constraint is relaxed. This technique can also be used by a switch as CIXB-k, where there is no speed up. The major implementation constraint for a CIXOB-k is still the pin count, which constrains the number of ports. A solution for extending the number of ports N is presented below.
B. CIXOB-k with a large N
We can scale up the switch size by using n n CIXOB-k switch modules, where n < N , and connecting them as a bidimensional matrix as it would be done with non-buffered crossbar modules. However, since two or more CIXB-k modules are connected to the same output port card and since the distributed arbiters within a module are independent, timing relaxation can be lost in the outputs of the modules and synchronization among all modules has to be resolved.
In order to maintain the timing relaxed at the output when joining two or more output modules, h = N n Output Queues OQ(l)
are allocated at the output port cards, where 0 l h ; 1. A round-robin arbiter selects the next OQ(l) that forwards a cell to the output line each time slot. Figure 11 shows this description. 4 Since all OQs are independent, they are able to receive a cell simultaneously. When using this scalable implementation, the throughput of the switch is not affected because the OQs do not affect the behavior of BX. This solution can also be used by CIXB-k. 
V. CONCLUSIONS
We have presented a CIXOB-k switch model that relaxes the timing for arbitration and cell transmission and provides 100% throughput for uniform and unbalanced traffic. We presented the behavior of CIXB-k under unbalanced traffic and compared to a non-buffered switch model with iSLIP and PIM. We showed that CIXB-k offers better performance than IB switches with iSLIP and PIM. By studying the performance of CIXB-k for different values of k, the sufficient speedup to provide 100% throughput under unbalanced traffic can be determined by selecting the value of the crosspoint buffer size. The sufficient speedup for CIXOB-1 is smaller than the one needed by a non-buffered crossbar with a maximal-matching arbitration scheme, as iSLIP.
With a relaxed timing with this switch model, a necessary value of k can be selected. The resulting amount of memory needed can be relaxed by using a multiple-plane implementation. We also presented a solution for to scale CIXOB-k for a large number of ports N. The solutions for scaling up memory and switch size can also be applied to CIXB-k.
