Abslracr-This paper presents a modified architecture for an input queued switch that reduces external speedup. Maximal size scheduling algorithms for Inpubbuffered crossbars requires a speedup between port card and switch card. The speedup is typically in the range of 2. to compensate for the scheduler performance degradation. This implies, that the required * bandwidth between port card and switch card is 2 times lhe actual port speed, adding to cost and complexity. To reduce this bandwidth. a modified architecture is proposed that Introduces a small amount of input and output memory on the witch card chip. Thts architecture allows €or internal speedup in the switch card and the external speedup between port card and switch card can be reduced significantly. A simulation study is used for bllffer dimensionlng and demonstrates the feasibility of the proposed architecture.
I. INTRODUCTION ROSSBAR switch fabrics have been studied extensively C in the literature. In combination with Virtual Output Queuing CVOQ) the architecture provides a scalable solution with respect IO memory access bandwidth. An input queued bufferless crossbar requires a complex scheduling mechanism that matches inputs with outputs. The scheduling algorithm is typically classified as either a maximum weight match or a maximal match type. A maximum weight match algorithm assigns a weight lo each pair of inputs and outputs, and the maximal weight match pairs the inputs and outputs that result in the highest total weight. The weight could indicate the age of a cell or occupation of a VOQ. In the simplest case, the weight just indicares, by a one or a zero, whether there is a packet available or not. In ~s case, the scheduler calculates a maximum size match because it pairs the maximum number of inputs and outputs. On the other hand, a maxima1 match algorithm is characterized by the property that ail unmatched inputs has no cell in any queue destined to an unmatched output. This implies that no hrther matches can be added unless the existing matches are rearranged.
A number of Maximum weight matching algorithms have [4] . One approach is to implement small VOQs close to the switch core and putting large buffers on the remote line cards. This approach is in fact utilized in the tiny-tera concepr (LCS protocol) [SI, However, additional chips are required to implement the additional switch card VOQs, and this will add to complexity and power consumption. Another solution proposed in [4] is to have the VOQs on the line cards communicare arrivals instead of state information to the central scheduler. In this case, the scheduler will caIculate the state information for all VOQs. The performance of this approach for a specific scheduling algorithm i S L P has been evaluated in [6]. The scheme denoted AiSLIP was shown to have good performance close to that of BLIP, and much better performance than in the case where delayed state informarion is communicated from the line cards. ASLIP is however susceptible to loss of arrival information and furthermore, the system still requires a speedup in the order of two between port cards and switch card. Furthermore, it is proposed in [41 to integrate VOQs and switch fabric on the same chip, however this will only work for very small systems.
Another way of improving performance by adding buffer capacity to the swirch chip is the buffered crossbar with VOQ, first introduced in [7] . Buffered crossbars have several advantages compared to non-buffered crossbars including simpler arbitration, synchronization relaxation and better performance. The main drawback, however, is the total amount of crossbar memory that is proportional to the square of the number of input/output ports and RTT.
This paper presents a modification to the basic input queued switch architecture. The goal of the proposed architecture is to be able to support reasonable large RTT values and on the same time reduce the required speedup between port cards and switch card. Speedup is expensive, especially given that switch chips are currently most limited by the IO bandwidth across chip and card boundaries. In fact, high-speed serial link communication adds significantly to the overall power consumption. In the proposed architecture, small VOQ input buffers and output buffers are added to the switch chip, and this allows for decoupling of port speed and scheduling speed. Internally, a speedup of two between input and output buffers can be realized, but externally, the port speed can be reduced compared to the basic architecture. The size of the added buffer capacity in the switch chip will impact the tradeoff between performance and implementation complexity. To small buffers would lead to poor performance and to large buffers would not be feasible to implement. In this paper, a simulation study has been performed to quantitatively assess the tradeoff between performance and buffer size. Actually, it will be shown that a significant reduction in the required speedup can be obtained with a reasonable and feasible amount of switch chip buffer capacity.
A detailed description of the switch model is given in sec. 11. The simulation study presented in sec. III compares the performance of h s switch architecture lo a basic input queued system. The simulation study is furthermore used as a guideline for system dimensioning, and the memory requirements will be compared to a buffered crossbar. Finally, concluding remarks ar2 given in sec. IV.
S W C H MODEL
The basic bufferless crossbar architecture is shown in Fig. 1 The main motivation for this architecture is to reduce external speedup between port cards and switch card, and this is achieved by introducing an internal speedup between switch card input and output buffers. Each new input queue system has a dedicated VOQ for each output. The VOQs are implemented in a shared memory following e.g. a linked Iisr approach. A speedup of 2 can now be performed internally between switch card input and output buffers, and the switch card output buffers are therefore required to perform rate adaptation between internal and external speed. Since the switch card VOQ buffers have limited capacity, backpressure signals towards the port card VOQs are required.
The Round Trip time for backpressure RT-BP is defined as the number of timeslots it takes to stop the cell flow to a specific switch card VOQ measured from the time when backpressure was asserted by that VOQ. The round trip time is composed of a propagarion delay for the backpressure signal, the time it takes before the port card scheduler is blocked and the data path delay from the port card scheduler to the switch card VOQ. The port card scheduler in this study performs a simple Round Robin (RR) arbitration.
In the following, the number of cells in VOQ number i in the shared memory is denoted Qi . The backpressure threshold for queue i is Bi , that is, a hackpressure signal is generated if Q! L Bi . Due to the round trip time for backpressure signals, the size of queue i can grow to Qi,max = Bi + RT-BP . The total number of cells in the shared buffer is Q = CQ, , The total capacity of the shared memory S is typically much smaller than xQi,,,, , therefore a global backpressure threshold B is introduced to avoid queue overflow. The global backpresswe signal is then asserted if Q 2 B . The global threshold must be selected such that B + RT -BP 5 S in order to avoid overflow in the shared buffer.
Backpressure is asserted if the VOQ buffer level equals or exceeds RT-BP Bi = RT -BP ). The total occupancy of a switch card input buffer could potentially reach 2 *RT-BP*N, but the size is typically less than that, and the global backpressure signal is required, that blocks all port card VOQs. The size should be large enough to reduce the global backpressure to a minimum.
The switch card output buffer could potentially become congested as well. In this case, a backpressure signal is transmirted to the scheduler such that requests to this output are ignored.
In addition to benefits from reduced external speedup, the architecture has other advantages, including synchronization relaxation between port cards and switch card. Furthermore, communication between port cards and switch card is simplified because the scheduler works on local information hom the switch card VOQs. Only simple backpressure signals are required between port cards and switch card.
In. SIMULATION AND RESULTS
A simulation study has been carried out in order to compare the new architecture in Fig. 2 with the well-known bufferless crossbar in Fig. I . Each port card receives cells from a source. In each timeslot, the source generates a c e l with probability equal to the load p. The switch size is 32x32. The destination is selected randomly according to a uniform distribution. Assigning the same destination to a number of consecutive cells generates bursty traffic. Fig. 3 shows the average delay as a function load for a burst length of 0, 10 and 20 respectively.
The round trip time for backpressure is set to four, RT-BP = 4. The size of switch card input and output buffer is set to 100 and 20 respectively. It is concluded that the average delay for the modified switch architecture with internal speedup of 2 is close to the delay of an output buffered switch. In order to determine the required switch card input buffer capacity, the input buffer occupancy has been examined for various load and burst values. The results are shown in Fig. 4 , for load values of 85 %, 90 % and 95 %. The switch card output buffer size is 20. This value is explained and justified later. The occupancy increases rather slowly with the burst size. A detailed investigation shows less than logarithmic growth, and this result is used to dimension the buffer by taking only the system load into account. Assuming a load of 95 %, the average occupancy is below 40, and by allocating 80 buffer locations, global backpressure is practically eliminated. The total number of switch card buffer locations becomes (80+20)*32 = 3200, feasible to implement on a single chip. 
