Abstract-The performance of Internet routers is greatly defined by the adopted switch architecture. Combined inputcrosspoint buffered (CICB) packet switches are being considered of research interest because of their high switching performance. One of the main requirements in these switches is that the amount of memory needed to achieve 100% throughput under flows with high data rates must be proportional to the number of ports and crosspoint buffer size, which is set by the distance between the line cards and the buffered crossbar. Therefore, long distances between the line cards and the buffered crossbar can make a CICB switch costly to implement or infeasible. In this paper, we propose and discuss two CICB packet switches with flexible access to crosspoint buffers. The proposed switches allow an input to use any available crosspoint buffer at a given output, instead of having rigid access where an input can only access a dedicated crosspoint buffer at a given output, as is the case on previous existing architectures. The proposed switches provide high switching performance and support long distances between the buffered crossbar and the line cards, while using crosspoint buffers of small size. Our switches reduce the required crosspoint buffer size by a factor of N, where N is the number of ports, keep service of cells in sequence, and use no speedup.
1-4244-0041-4/06/$20.00 ©2006 IEEE. crosspoint buffer size in number of cells, and L is the cell size in bytes. The value of k is defined by the duration of the round-trip time. For example, a CICB switch with dedicated allocation of crosspoint buffers (i.e., a set of crosspoint buffers that can only be accessed by a given input) requires the size of k to be equal to or larger than the round-trip time to avoid throughput degradation or crosspoint-buffer underflow for flows (here defined as the data arriving at input i and destined to output j, where 0 < i, j < N -1) with high data rates. The round trip time RTT, as defined in [5] , is the sum of the delays of 1) the input arbitration IA, 2) the transmission of a cell from an input to the crossbar dl, 3 ) the output arbitration OA, and 4) the transmission of the flow-control information back from the crossbar to the input, d2.
In a CICB switch, the required crosspoint-buffer size to avoid underflow by flows of data rate C b/s, where C is the port speed, is:
RTT =dl+OA+d2+IA<k, (1) such that cells are transmitted continuously every time slot [5] .
Furthermore, as the buffered crossbar can be physically located far from the input ports in a real implementation, actual RTTs can be long. To support long RTTs in a buffered-crossbar switch, the crosspoint-buffer size needs to be increased [8] , such that up to RTT cells can be buffered. However, the memory amount that can be allocated in a chip is limited, specially because of the use of advanced high-speed interconnection technology with large area requirements. This can make the implementation costly or infeasible when the distance between line cards and the buffered crossbar is long, while achieving high throughput. A solution to keep the crosspoint buffer small while supporting long RTTs is needed.
In this paper, we propose two CICB switches with flexible access to crosspoint-buffer. In these switches, an input can send a cell to any crosspoint buffer at a given output, contrary to CICB switches where inputs can only access to their dedicated crosspoint buffer per output. Herein, the switches without flexible access are called CICB switches with rigid access (CICB-RA). Note that most CICB switches in the literature [1] [2] [3] [4] [5] [6] [7] belong to this category, except those switches with shared memory [9] . We start our discussion by showing the throughput degradation of an CICB switch with rigid access in function of the round-trip time and the crosspoint buffer size. We introduce a general architecture of a switch with flexible access, where an input can send a cell to any crosspoint buffer independently of other inputs. We call this switch CICB with full access (CICB-FA) to crosspoint-buffers. To avoid speedup in the crosspoint buffer, the queues at the inputs are matched to the crosspoint buffers for different outputs. In addition, we introduce a simplified switch, where the interconnecting stage follows a predetermined connectivity similar to that of the Birkhoff-Von Neumann switch [10] , allowing one set of crosspoints of different outputs to be accessed by an input at a given time slot. We call this switch CICB switch with single access (CICB-SA) to crosspoint buffers.' These two switches support flows with high data rates while using k < RTT. We discuss the pros and cons of these two switches. As a result, we show that a CICB switch with flexible crosspointbuffer access requires 1 of the buffer amount in a switch N with rigid crosspoint buffer access to achieve similar of better performance, without using speedup. This paper is organized as follows. Section II shows the throughput degradation of a CICB with rigid access to crosspoint buffers. Section III introduces the CICB switch with full access to crosspoint buffers. Section IV introduces the CICB switch with single access to crosspoint buffers. Section V discusses the service of cells in sequence by first-come firstserve output arbitration. Section VI presents the throughput performance of the proposed switches. Section VII presents the conclusions.
II. THE EFFECT OF LONG RoUND-TRIP TIME IN A CICB SWITCH WITH RIGID ACCESS TO CROSSPOINT BUFFERS
To keep up with high data rates, switch ports must be able to handle flows of up to C b/s,2 where C is the data-rate capacity of a port in a switch or router. In a CICB switch with rigid access crosspoint buffers to each VOQ (also referred as CICB with rigid crosspoint-buffer access), the maximum flow rate that can be handled is C k . Note that when rf(ijj)= C, RTT where rf(ijj) is the rate of f(i,j), the maximum flow rate is equivalent to the achievable throughput.
We simulated a CICB switch that uses longest queue first (LQF) as input arbitration and first-come first-serve (FCFS) as output arbitration scheme to observe the throughput obtained under different k and RTT values by a 32 x 32 switch, and to validate the traffic model used to simulate flows with high data rates. We consider RTT > 0 in this paper. We also assume that the distances between input ports and the buffered crossbar are identical. To model flows with different rates, we use the 1-w unbalanced traffic model [5] , which uses w + lN as the fraction of the input load directed from input i to output j = i, where w is the unbalanced probability. The remainder of the input load (i.e., 1N") is directed from input i to output j i (with a uniform distribution). Therefore, the fraction of C that fJ(i,j) uses is rf(i,j) "w + . The maximum data rate of f(i, j) is represented by making w=1.0 or r"') C,and the 'The study of the CICB-SA switch is motivated by the high performance of a round-robin based switch [11] . 2 Icontrast, switches unable to support such flows can only handle aggregated data rates of C b/s, where each flow might have a data rate r,ingle such that rsingle < C. 256 minimum data rate is represented when w=0.0 or f (inj) N We emphasize our observations in these two w values of the unbalanced traffic model. Figure 1 shows that the throughput degrades when rf(i,j) mini(..,w0 rf (.) (i.e., w=0.0) in curve 2, where RTT= 31 and k= 1, and curve 5, where RTT=61 and k=2. In these two cases the throughput is below 99%. A preliminary conclusion is that the throughput falls under 99% when RTT> kN. However, the arbitration schemes may be the factor that causes the throughput loss. The figure also shows that the throughput remains close to 100% when: RTT < kN, as shown by curve 1, where k= RTT =1, curve 3, where k=2 and RTT=3, and curve 4, where k=2 and RTT=33, all at w=0.0.
As the flow data rate increases (i.e., w), the throughput degradation increases. The worse-case scenario is observed when rf(i,j) =C b/s (i.e., w=1.0) as the achieved throughput k is RTT' as shown by curves with RTT > k. The case of port-speed flows, although mostly ignored, is when a flow, at input i, with a rate equal to the port capacity is being sent to output j. 
III. CICB SWITCH WITH FULL ACCESS (CICB-FA) TO CROSSPOINT BUFFERS
The N x N CICB-FA switch has virtual output queues (VOQs) in the input ports, a fully interconnected stage that provides connectivity for input i to any of the N2 crosspoint buffers, and a buffered crossbar. written into a crosspoint buffer, each crosspoint has a N-to-1 multiplexer, denoted as MUX (h, j). Furthermore, each input can send one cell to the crosspoint buffer, and each crosspoint buffer can receive up to one cell at each time slot. There is an output arbiter for each output port. There is an output access scheduler (OAS) per output port and an input access scheduler (IAS) per input port, both located at the buffered crossbar. IAS and OAS perform a parallel matching to determine which XPB can be accessed by a cell (or input). There are N VOQ counters at the buffered crossbar, denoted as VC(i, j), which counts the number of cells at VOQ(i, j). In this paper, we consider crosspoint buffers with k = 1 and with no speedup. The way this switch works is as follows. When a cell destined to output j arrives at input i is stored in VOQ(i, j).
The input sends a request for this cell to the buffered crossbar and the corresponding VOQ counter VC(i, j) is increased by one. In the time slot after the increment of VC, a request is sent to the OAS for output j. The OAS for output j selects up to N cells for crosspoints at output j after considering all requests from non-empty VOQs and the availability of XPBs. The access scheduler notifies the IAS which requests were selected. Since an input may be granted access to XPBs at different outputs (i.e., IAS receives several grants), the IAS performs accepts one grant and notifies the OAS. The scheme used by IAS and OAS is LQF selection. After being notified by a forward signal, an input sends the cell to the crosspoint buffer one time slot after receiving the forward-signal information. After a cell arrives in the XPB, the corresponding VC decreases by one.
The output arbiter at output j (note that this is not part of the crosspoint access process) selects an occupied crosspoint buffer to forward a cell to the output in a first-come first-serve (FCFS) fashion. FCFS is used for output arbitration to keep cell in sequence as it will be discussed in Section V. This switch uses no speedup as inputs and crosspoint process one cell per time slot. IV. CICB SWITCH WITH SINGLE ACCESS (CICB-SA)To CROSSPOINT BUFFERS
The switch with full access has N2 N -to -1 multiplexers. In addition, the crosspoint access scheduler needs to perform matching between inputs and outputs. To minimize the complexity and hardware amount, we present a simpler CICB switch with flexible access to crosspoint buffers, the CICB switch with single access (CICB-SA).
This switch has VOQs in the input ports, an interconnecting stage that uses pre-determined and cyclic configurations, similar to those used in a Birkhoff-Von Neumann switch [10] , and a buffered crossbar. Figure 3 shows this switch architecture. In this switch, the input ports are also called external inputs, each of which is denoted as EIi. [12] , as these are based on a load-balanced CICB switch.
VI. SWITCHING PERFORMANCE CICB-FA and CICB-SA were tested under computer simulation, with a confidence interval of 95% for the average cell delay. We consider several admissible traffic patterns and flow data rates in the performance study of the proposed two switches. we consider Bernoulli arrivals under uniform and nonuniform distributions. We extend the traffic with uniform distributions to bursty arrivals (i.e., Markov modulated onoff traffic). We show that the performance under traffic with uniform distributions remains high as that delivered by CICB-RA switches. We also show that the proposed switches deliver higher throughput than CICB switches with rigid access under nonuniform traffic patterns, such as the unbalanced, diagonal, and power-of-two traffic patterns. We show that the throughput is 100% for RTT < k. Furthermore, we show that these switches using a weigh-based arbitration can deliver close to 100% throughput under admissible traffic patterns for RTT> k. This is a unique feature of these switches as CICB switches with fixed access cannot support such long RTT values.
258
A. Uniform Traffic
We tested all switches under uniform traffic to study the effect of using matching processes for access to the XPBs. Therefore, we set k = 1 and RTT = 1. Figure 4 shows the average cell delay of a CICB switch (or CICB-RA) using LQF as for input arbitration and FCFS as output arbitration, CICB-SA, and CICB-FA, all under uniform traffic. CICB-FA and CICB-SA also use LQF but for scheduling access to crosspoint buffers, so this is analogous to using LQF as input arbitration in CICB. The average cell delay only considers the queuing delay. For low input loads, CICB shows smaller average cell delay than the proposed switches. This is because in CICB-SA and CICB-FA, cells spend an extra time slot at the VOQs as their requests are sent to the crosspoint access scheduler and have to be granted (RTT = 1) before forwarding the actual cells. Under larger input loads, when the average cell delay is larger than one time slot, the average delays of all switches have similar magnitude: the delay is small in any case. This indicates that the access scheduling has not measurable effect in the switching performance. The figure also shows the average delay of all switches under bursty traffic with average burst lengths I {10, 100}. These results show that the delay increases in proportion to the burst length. 
B. Nonuniform Traffic: Unbalanced
The effect of long RTTs in the proposed switch model can be studied by measuring the switch throughput under the unbalanced traffic model, as in Section II, in addition to studying the switching performance under this traffic pattern when RTT < k. The features of this traffic model is the nonuniform distribution of the input traffic to one output port. Figure 5 shows the throughput performance of CICB, CICB-SA, and CICB-FA when k = 1 for different RTTs. When RTT is not long, say RTT < 1, all switches deliver close to 100% throughput under this traffic pattern. This follows the known performance for CICB switches with rigid access under weight-based arbitration schemes.
When RTT is large, say RTT> k, we can observe the following. CICB has the throughput degraded as w increases. 
2N-71
For example, power of two traffic of a 4 x 4 switch is represented as: This traffic model presents a large nonuniformity in the traffic distribution among N possible destinations. This traffic model, although the sum of rows and column is less than one, it has shown to be difficult for switches to achieve high throughput. Figure 8 shows that both switches deliver 100% throughput under this traffic pattern for RTT = 1 and k = 1.
D. Nonuniform Traffic: Diagonal
The diagonal traffic can be represented as dp(i, j) = dpi for i j, ( This traffic model presents load distributions among two outputs per each input. The distributions are given by the diagonal degree probability, d. Figure 9 shows the switching performance of CICB-FA and CICB-SA under diagonal traffic for 0 < d < 1. This figure shows that these two switches can support RTT < 31 and achieve close to 100% throughput. VII. CONCLUSIONS We presented the effect of long round trip times RTTs in CICB switches with rigid access to crosspoint buffers, where the supported crosspoint buffer size is k < RTT. CICB switches with rigid access to crosspoint buffers have their maximum throughput as the ratio of kRT when input ports RTT' handle a single flow with a data rate equal to the port capacity. To overcome this, we proposed two novel CICB switch architectures where inputs can access any crosspoint buffer of a given output. We call these CICB switches with flexible access to crosspoint buffers. We study the case of CICB switches with flexible access and with crosspoint buffers of one-cell size. Our proposed switches with k = 1 can support an RTT close to N-time-slot long, and provide high throughput for high and low data-rate flows under a great variety of admissible traffic patterns. As a comparison, for a given RTT size, a CICB switch with flexible access requires a minimum memory amount of RTT x N cells while a CICB switch with rigid access requires a minimum RTT x N2 cells. Therefore, the proposed switches relax the memory requirement by a factor of O(N\T. In addition, we show that these switches use the buffered crossbar effectively to assign timestamps to cells arriving in crosspoint buffers. This simplifies the handling of cells to provide in-sequence transmissions. All these features are achieved by CICB switches with flexible access without using speedup.
