Reducing external speedup requirements for input-queued crossbars by Berger, Michael Stubert
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
General rights 
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Dec 17, 2017
Reducing external speedup requirements for input-queued crossbars
Berger, Michael Stübert
Published in:
Workshop on High Performance Switching and Routing, 2005. HPSR.
Link to article, DOI:
10.1109/HPSR.2005.1503227
Publication date:
2005
Document Version
Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Berger, M. S. (2005). Reducing external speedup requirements for input-queued crossbars. In Workshop on
High Performance Switching and Routing, 2005. HPSR. IEEE. DOI: 10.1109/HPSR.2005.1503227
Reducing External Speedup Requirements for 
Input-queued Crossbars 
M. S. Berger, Member, ZEEE 
Abslracr-This paper presents a modified architecture for an 
input queued switch that reduces external speedup. Maximal size 
scheduling algorithms for Inpubbuffered crossbars requires a 
speedup between port card and switch card. The speedup is 
typically in the range of 2. to compensate for the scheduler 
performance degradation. This implies, that the required 
* bandwidth between port card and switch card is 2 times lhe 
actual port speed, adding to cost and complexity. To reduce this 
bandwidth. a modified architecture is proposed that Introduces a 
small amount of input and output memory on the witch card 
chip. Thts architecture allows €or internal speedup in the switch 
card and the external speedup between port card and switch card 
can be reduced significantly. A simulation study is used for bllffer 
dimensionlng and demonstrates the feasibility of the proposed 
architecture. 
Index Terms-Input queued switch. Scheduling, Speedup. 
I. INTRODUCTION 
ROSSBAR switch fabrics have been studied extensively C in the literature. In combination with Virtual Output 
Queuing CVOQ) the architecture provides a scalable solution 
with respect IO memory access bandwidth. An input queued 
bufferless crossbar requires a complex scheduling mechanism 
that matches inputs with outputs. The scheduling algorithm is 
typically classified as either a maximum weight match or a 
maximal match type. A maximum weight match algorithm 
assigns a weight lo each pair of inputs and outputs, and the 
maximal weight match pairs the inputs and outputs that result 
in the highest total weight. The weight could indicate the age 
of a cell or occupation of a VOQ. In the simplest case, the 
weight just indicares, by a one or a zero, whether there is a 
packet available or not. In ~s case, the scheduler calculates a 
maximum size match because it pairs the maximum number of 
inputs and outputs. On the other hand, a maxima1 match 
algorithm is characterized by the property that ail unmatched 
inputs has no cell in any queue destined to an unmatched 
output. This implies that no hrther matches can be added 
unless the existing matches are rearranged. 
A number of Maximum weight matching algorithms have 
Manuscript received March 18,1005. This work was suppotted in part by 
the European Union @U) project Ethernet Switching at Ten gigabit and 
Above (=TA). ET-1,001-33182. 
M. S. Berger is with Research Center COM. Technical Universiry of 
Denmark. 2800 Kgs. Lyngby. Denmark. Phone: +45 45 25 38 53. e-&: 
msb@com.dtu.dk. 
been presented in [I]. Their main disadvantage is timing 
complexiry, leading to an interest in maximal matching 
algorithms such as PIM and i sLP  [2 ] .  SLIP matches input 
with output by having a round robin scheduler for each input 
and ourput. The input schedulers independently select an 
output, and the output scheduler selects among contending 
inputs. The iterative SLIP, ISLIP, performs a number of 
iterations of SLIP. To compensate €or the lower performance 
of a maximum matching algorithm, speedup is introduced 
between the VOQs and the crossbar. A speedup of 2 is 
sufficient to obtain 100 % throughput [31. 
The basic input queued switch is composed of port cards 
containing VOQ buffers and a switch card containing the 
bufferless switch chip(s) and the scheduler chip. Typically, the 
large physical size of a packet switch system implies that port 
cards and the central switch cards are separated and localed in 
different shelves and racks. This leads to a high Round Trip 
Time CRTT) between port cards and switch cards, Potentially, 
this might affect the performance of the switch system, 
because the central scheduler calculates a match based on 
delayed VOQ state information. Solutions to this problem are 
discussed in [4]. One approach is to implement small VOQs 
close to the switch core and putting large buffers on the remote 
line cards. This approach is in fact utilized in the tiny-tera 
concepr (LCS protocol) [SI, However, additional chips are 
required to implement the additional switch card VOQs, and 
this will add to complexity and power consumption. Another 
solution proposed in [4] is to have the VOQs on the line cards 
communicare arrivals instead of state information to the central 
scheduler. In this case, the scheduler will caIculate the state 
information for all VOQs. The performance of this approach 
for a specific scheduling algorithm i S L P  has been evaluated 
in [6] .  The scheme denoted AiSLIP was shown to have good 
performance close to that of BLIP, and much better 
performance than in the case where delayed state informarion 
is communicated from the line cards. ASLIP is however 
susceptible to loss of arrival information and furthermore, the 
system still requires a speedup in the order of two between 
port cards and switch card. Furthermore, it is proposed in [41 
to integrate VOQs and switch fabric on the same chip, 
however this will only work for very small systems. 
Another way of improving performance by adding buffer 
capacity to the swirch chip is the buffered crossbar with VOQ, 
first introduced in [7]. Buffered crossbars have several 
advantages compared to non-buffered crossbars including 
. 
0-7803-8924-7/05/$20.00 (~ )2005  Em. 222 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on December 2, 2009 at 05:45 from IEEE Xplore.  Restrictions apply. 
simpler arbitration, synchronization relaxation and better 
performance. The main drawback, however, is the total 
amount of crossbar memory that is proportional to the square 
of the number of input/output ports and RTT. 
This paper presents a modification to the basic input queued 
switch architecture. The goal of the proposed architecture is to 
be able to support reasonable large RTT values and on the 
same time reduce the required speedup between port cards and 
switch card. Speedup is expensive, especially given that switch 
chips are currently most limited by the IO bandwidth across 
chip and card boundaries. In fact, high-speed serial link 
communication adds significantly to the overall power 
consumption. In the proposed architecture, small VOQ input 
buffers and output buffers are added to the switch chip, and 
this allows for decoupling of port speed and scheduling speed. 
Internally, a speedup of two between input and output buffers 
can be realized, but externally, the port speed can be reduced 
compared to the basic architecture. The size of the added 
buffer capacity in the switch chip will impact the tradeoff 
between performance and implementation complexity. To 
small buffers would lead to poor performance and to large 
buffers would not be feasible to implement. In this paper, a 
simulation study has been performed to quantitatively assess 
the tradeoff between performance and buffer size. Actually, it 
will be shown that a significant reduction in the required 
speedup can be obtained with a reasonable and feasible 
amount of switch chip buffer capacity. 
A detailed description of the switch model is given in sec. 
11. The simulation study presented in sec. III compares the 
performance of h s  switch architecture lo a basic input queued 
system. The simulation study is furthermore used as a 
guideline for system dimensioning, and the memory 
requirements will be compared to a buffered crossbar. Finally, 
concluding remarks ar2 given in sec. IV. 
11. S W C H  MODEL 
The basic bufferless crossbar architecture is shown in Fig. 1. 
The system is usually denoted Combined Input and Output 
Buffered (CIOB) switch. A switch system of size IV n N 
consists of N InpuVOutput port cards and a switch card 
implementing the N Z  crosspoint matrix. Each input port card 
contains VOQs with one buffer for each of the N outputs. The 
output port card contains a buffer to store cells in case of 
speedup. Queue status information is sent to the central 
scheduler in the switch card. In this paper, rhe SLIP scheduler 
is assumed, even though improved algorithms have been 
proposed, e.g. [81. Typically, a large round trip time of several 
cells between port card and switch card due to transmission 
time and synchronization. One of the main drawbacks of the 
CIOB architecture is the required speedup of 2 between port 
cards and switch card. Increasing the number of high-speed 
serial links significantly increases power consumption and chip 
pin count. Fig. 2 presents the modified swiich card architecture 
with a small amount of buffer memory introduced on the 
switch card chip. 
Switch Card 
e : I  
Fig. 1. Bufferless crossbar with combined input and output queuing. This 
architecture typically requires an external speedup of 2 between port cards 
and switch card to obtain 100 k throughput. 
The main motivation for this architecture is to reduce 
external speedup between port cards and switch card, and this 
is achieved by introducing an internal speedup between switch 
card input and output buffers. Each new input queue system 
has a dedicated VOQ for each output. The VOQs are 
implemented in a shared memory following e.g. a linked Iisr 
approach. A speedup of 2 can now be performed internally 
between switch card input and output buffers, and the switch 
card output buffers are therefore required to perform rate 
adaptation between internal and external speed. Since the 
switch card VOQ buffers have limited capacity, backpressure 
signals towards the port card VOQs are required. 
The Round Trip time for backpressure RT-BP is defined as 
the number of timeslots it takes to stop the cell flow to a 
specific switch card VOQ measured from the time when 
backpressure was asserted by that VOQ. The round trip time 
I :  e 
Fig. 2. Modified architecture with internal buffering on the switch card. 
There is internal speedup between switch card input and output buffers. 
whch reduces external speedup requirements significantiy. 
0-7803-8924-7/05/$20.00 ( ~ ) 2 0 0 5  m. 223 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on December 2, 2009 at 05:45 from IEEE Xplore.  Restrictions apply. 
is composed of a propagarion delay for the backpressure 
signal, the time it takes before the port card scheduler is 
blocked and the data path delay from the port card scheduler to 
the switch card VOQ. The port card scheduler in this study 
performs a simple Round Robin (RR) arbitration. 
In the following, the number of cells in VOQ number i in 
the shared memory is denoted Qi . The backpressure threshold 
for queue i is Bi , that is, a hackpressure signal is generated if 
Q! L Bi . Due to the round trip time for backpressure signals, 
the size of queue i can grow to Qi,max = Bi + R T -  BP . The 
total number of cells in the shared buffer is Q = CQ, , The 
total capacity of the shared memory S is typically much 
smaller than xQi , , , ,  , therefore a global backpressure 
threshold B is introduced to avoid queue overflow. The global 
backpresswe signal is then asserted if Q 2 B .  The global 
threshold must be selected such that B + RT - BP 5 S in order 
to avoid overflow in the shared buffer. 
Backpressure is asserted if the VOQ buffer level equals or 
exceeds RT-BP Bi = RT - BP ). The total occupancy of a 
switch card input buffer could potentially reach 2 *RT-BP*N, 
but the size is typically less than that, and the global 
backpressure signal is required, that blocks all port card 
VOQs. The size should be large enough to reduce the global 
backpressure to a minimum. 
The switch card output buffer could potentially become 
congested as well. In this case, a backpressure signal is 
transmirted to the scheduler such that requests to this output 
are ignored. 
In addition to benefits from reduced external speedup, the 
architecture has other advantages, including synchronization 
relaxation between port cards and switch card. Furthermore, 
communication between port cards and switch card is 
simplified because the scheduler works on local information 
hom the switch card VOQs. Only simple backpressure signals 
are required between port cards and switch card. 
In. SIMULATION AND RESULTS 
A simulation study has been carried out in order to compare 
the new architecture in Fig. 2 with the well-known bufferless 
crossbar in Fig. I. Each port card receives cells from a source. 
In each timeslot, the source generates a c e l  with probability 
equal to the load p. The switch size is 32x32. The destination 
is selected randomly according to a uniform distribution. 
Assigning the same destination to a number of consecutive 
cells generates bursty traffic. Fig. 3 shows the average delay as 
a function load for a burst length of 0, 10 and 20 respectively. 
The round trip time for backpressure is set to four, RT-BP = 
4. The size of switch card input and output buffer is set to 100 
and 20 respectively. It is concluded that the average delay for 
the modified switch architecture with internal speedup of 2 is 
close to the delay of an output buffered switch. 
lo00 
100 
# 
10 
0.2 0.3 0.4 0.5 0.6 0.7 0.a 0.9 1 
load 
Fig. 3. Average delay vs. load for different values of burstiness. Delay is 
measured in number of timeslots. With an internal speedup of 2, the modified 
architecture has an average delay close to that of an output buffered switch. 
In order to determine the required switch card input buffer 
capacity, the input buffer occupancy has been examined for 
various load and burst values. The results are shown in Fig. 4, 
for load values of 85 %, 90 % and 95 %. The switch card 
output buffer size is 20. This value is explained and justified 
later. The occupancy increases rather slowly with the burst 
size. A detailed investigation shows less than logarithmic 
growth, and this result is used to dimension the buffer by 
taking only the system load into account. Assuming a load of 
95 %, the average occupancy is below 40, and by allocating 80 
buffer locations, global backpressure is practically eliminated. 
The total number of switch card buffer locations becomes 
(80+20)*32 = 3200, feasible to implement on a single chip. 
100 I , , , I  I , , , , I , , , ,  I , , , I  
80 
5 60 
8 
1 
4 
- 2 40 
20 
0 
......... 
...... ............ 
c... .......... i . ............. I>,,,,;;!;,; . /  - - !  j 
.. & ................... j " ................. I ............... 
/ iff< -- 
,/ - j  
/-= i 
! " " I " " ! " " '  
. ,_. 
.....I ......... ".L . . . . . . . . . . . . .  i. .... 
........ ill." ..I." 2" .............. :. 
- - + -  ; 5 - - - + + - -  
,.--L...-*.i.--. -.*-.: 
c i  
0 5 10 15 20 25 30 35 40 
Burst size 
Fig. 4. Average input buffer occupancy YS. burst size. The occupancy grows 
slowly with burst s m ,  and the buffer can therefore be dimensioned based on 
load values onty. 
0-7803-8924-7/05/$20.00 (~)2005 IEEE. 224 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on December 2, 2009 at 05:45 from IEEE Xplore.  Restrictions apply. 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on December 2, 2009 at 05:45 from IEEE Xplore.  Restrictions apply. 
