Cell-level path allocation in a three-stage ATM switch by Collier, Martin & Curran, Thomas
Cell-Level Path Allocation in a Three-Stage ATM Switch 
Martin Collier and Tommy Curran. 
School of Electronic Engineering, Duhlin City University 
Glasnevin, Duhlin 9, Ireland. 
collierm@dcu.ie, currant@dcu.ie 
ABSTRACT A method of cell-level path allocation for three- 
stage ATM switches has recently been proposed by the authors. 
The perform'mce of ATM switches using this path allocation 
algorithm has been evaluated by simulation, and is described 
here. Both uniform and non-uniform models of output loading 
are considered. The algorithm requires knowledge of the 
number of cells requesting each output module from a given 
input module. A fast method for counting the number of 
requests is described. 
1. Introduction 
A new method of allocating paths through a three-stage ATM 
switch has recently been proposed by the authors [l]. The 
method applies to a switch featuring intermediate channel 
grouping, such as that in Fig. 1. The algorithm hunts 
sequentially for available paths through the intermediate stage 
of the switch, but multiple searches are conducted in parallel, 
so that comparatively few iterations are required to search 
through all possible paths. Hence cell-level path allocation can 
be performed, i.e., the updating of paths after every time slot. A 
full description of the algorithm, and its implementation, may 
be found in [l]. It uses an may  of Ll.& processors called 
atomic( ) processors to achieve the necessary parallelism 
(whereL1 and & are as defined in Fig. 1). Each processor must 
be initialised with the value of Ki, the number of cells from 
input module i requesting output modulej. 
The method for counting cell requests described in [l]  has an 
execution time which increases in proportion as the switch size 
increases. This can limit the maximum achievable switch size. 
A faster method of obtaining the value of Kc is described in 
Section 2. 
The performance of switches using this path allocation 
algorithm is considered in Section 3 (for uniform loading of 
outputs) and Section 4 (for non-uniform loading). 
2. A faster method of request counting 
The method of request counting described in [ l ]  has the 
disadvantage that it sequentially determines the number of cells 
requesting output modules &-1, &-2, etc. The total number of 
clock cycles required is nl  + &, i.e., the sum of the numher of 
input ports per input module, and the number of output 
modules. Hence, the required clock rate may be excessive in a 
large switch, given that request counting, path allocation, and 
routing tag assignment, must (ideally) all take place within the 
duration of one time slot. 
A faster, parallel, implementation would simultaneously 
calculate Kc (the number of requests from input module i for 
output module 5) for all values of j (0 S j < &). Suitable 
hiirdwnre will now be described. The execution time for this 
hardware is 2rlog,(n, +L,)1+1 clock cycles, which is 
suhst:intially less than that for the sequential hardware 
descrihed in [l]. at the price of an increase in hardware 
complexity. 
/ 
I 
,SI (S2): the number of channels in the channel 
group connecting each input (output) module to 
each intermediate switch module. 
V: channel rate (155 Mb/s) 
n1 (ti2): the number of input (output) ports per 
input (output) module: 
HI: the number of intermediate switch modules: 
Ll (L2): the number of input (output) modules: 
Fig. 1 A three-srclge switch with intermediate channel grouping 
The hardware required is shown in Fig. 2. Data cells from the 
1 1 ,  input ports associated with input module i are merged with 
L2 control packets (one per output module) by a Batcher sorting 
network. The merge operation is performed in such a way that 
idle cells (i.e., empty cells from inactive input ports) are sorted 
to the highest-numbered output ports of the Batcher network. If 
the control packet for output module j appears at output Dj of 
the Batcher network. then the data cells (if any) requesting that 
output module appear at lower-numbered output ports of the 
sorter (ports D,-l.D,-2. etc.), as shown in Fig. 2. 
Undcr these circumstances, it may readily be shown that 
u=o 
wherc K,,, is the number of data cells requesting output module 
1 4 ,  and i is fixed, since the Batcher network processes only 
rcquests from input module i. 
The key to the new method of request counting is the 
ohscrv;ilion that 
1179 0-7803-1 825-0194 $4.00 0 1994 IEEE 
7- - I 
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 09:43:48 UTC from IEEE Xplore.  Restrictions apply. 
. .  . 
data 
cells 
in 
control 
packets 
in 
K,n= 3 
Z K , , = 2  
K, ,= 0 
tu 
ntomic I j 
processors T 
serial adders 
(upper i n p u t  
inverted) 
Do, j = O  
K . =  
'I Dj-Dj.1-1. j > 0 '  
out l )~~t  module two. It can be seen that the correct values (i.e., 
1. 2 and 0) are returned to processors Xlo* Xil and X I Z  
respective1 y. 
cacti st;ige of the concentrator (one cycle to identify if the 
1):icliet is active. and another to determine where to route it) and 
ai1 xldilionnl clock cycle is required before the serial adder 
gcner;rtes the least significant hit of the appropriate KG value. 
Hence the nuinhcr of clock cycles required by the request count 
h;irdw;uc hcfore path allocation c;in commence is 
The necessary subtraction can be performed very Thc s\ll>111itted packets like two cycles to propagate through 
since 
D j  - Dj-l  - 1 = D .  / + B .  / + I  
where is the 1's complement of D, ohtained by bitwise 
inversion of D. It follows that the value of KY can be generated 
using a serial adder, and can then be stored in the K register of 
the appropriate atomic.() processor (labelled X i  in [ 11). 
It is necessary to generate a concentrated list of the values qOg, o~~ + 11.1. 
D o ,  D , ,  D 2 ,  ..., Db-, as input data for the serial adders. These 
values are available at the address generators which have 
received control packets from the sorter outputs, but are not 
concentrated onto contiguous outputs. Hence a concentrator is 
required. This is the purpore of the binary self-routing network 
shown in Fig. 2, which is variously known as the indirect 
binary n-cube [2] and the 'reverse banyan' [3]. 
The address generators forward only control packets to this 
network. Address generators which have received a data cell or 
:in idle cell through the Batcher network submit an inactive 
packet to the concentrator. The address generator which 
receives control packet; from output DJ of the Batcher network 
appends a data field to the packet containing the value of 0,. 
This packet is then routed to output j of the concentrator. A 
total of L2 control packets is thus simultaneously launched into 
the concentrator, and these are routed to the serial adders at 
outputs zero through &-1, without blocking. The absence of 
blocking within the concentrator will be verified i n  Appendix 
A. 
The concentrated list of DJ values is then read by these serial 
adders, the upper input (as shown in Fig. 2) being inverted. 
Hence the K,/ values are generated, and passed to the ~ f o n i i r ( )  
processors. The example considered shows three requests for 
output module zero, two for output module one, and none for 
This is much less than [he / z1  + L2 cycles required by the 
h:rrtlw:irc tlcscrihed in [ 11. Taking as an example the sample 
switch considered in that paper, for which nl  = 96 and L2 = 32, 
I I I C  number of clock cycles is reduced from 128 lo 15. 
3. Performance with uniformly loaded output 
modules 
Ttic pcr1orm;ince 01 a three-stage switch using the cell-level 
p ; i t I i  ;itlocation algorithm described above will now he 
cwlwtcd.  The cell loss probability must be determined by 
siniu1:ilion since no analytical method is currently available. 
Tlic simulation model is based on the following assumptions: 
( i )  There is no input queueing; cells which are not allocated 
:I path on the first attempt are discarded. 
( i i )  The switch is offered a maximum load: each input port 
of the switch submits a cell in every time slot. 
( i i i )  The destin:ition of each cell is drawn from a uniform 
distrihution: :ill output modules receive the same load. 
( iv)  The switch modelled is that shown in Fig. I ,  for various 
clioices of the parameters L I ,  L2, m, SI, S2 and ttl .  
( v )  The i1I;ixinium number of cells generated is 10". If zero 
ccll loss is recorded during the simulation. the cell loss 
prohahility is assigned the value 5 x lo-". The 
prohal>ility of  losing contention is assumed to be 
1180 
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 09:43:48 UTC from IEEE Xplore.  Restrictions apply. 
independent for each cell, at low levels of loss. With this 
assumption, the probability that the cell loss probability 
( C U )  is below 5 x lo-", given that no losses were 
recorded, is above 95%, i.e., 
109 pr[cLP < 5x 10-111 = (1-5x 10-11) > 0.95. 
(vi) The cell loss probability is assumed to be equal to the 
expected number of cells lost per time slot, as a 
proportion of the offered load. 
0.1 
0.01 
s .d 0.001 
; 0." 
n 
n ? lE-885 
$ lE-006 
* - 1E-887 
d 
111 
1E-888 
1E-009 
1E-010 
1800 2686 3400 4280 
Total inputs (nl x L1) 
Fig. 3 Performance with unij2rm trafic 
(Ll = m = L2, S1 = S2 = 4) 
0.1 
0.01 
p 0.881 
+ 0.m1 - 
n 
n. ? 1E-885 
% lE-006 
- lE-887 
1E-888 
lE-009 
1E-010 
* 
d 
111 
60 70 BB 90 100 110 120 138 
Inputs pw input mdule ( n l )  
Fig. 4 Variation of cell loss with n l  
(Ll= m = & , SI= S2 = 4). 
The influence of the choice of channel group size in Fig. 1 on 
the cell loss probability is considered in Figs. 3-6. The results 
where S1 = S2 = 4 are shown in Fig. 3. Note that Ll = = m 
for the three switches simulated. The curves obtained show 
that, with the number of switch inputs fixed, the cell loss 
probability falls as the number of switch modules is increased, 
as expected. A more interesting result is observed if the results 
are plotted for a fixed size of input module, as shown in Fig. 4. 
These loss curves indicate that, if modular growth is achieved 
by increasing Ll, & and m in the same proportion, the cell loss 
probability will decrease. Hence, the additional loss due to the 
higher switch load is offset by the increasing diversity of paths 
available. 
t I I 
0.1 
0.01 
2 0.881 
.e -
3 O.OOO1 
? 1E-005 
% 1E-886 
n. 
4 - lE-007 
d 
1 E M  
1E-009 
1E-010 
la00 2688 3400 4200 
Total inputs (nl x L1) 
Fig. 5 Performance with uniform trafsic 
(L1 = m = & , SI = 8, S2 = 4). 
Fig. 5 shows the corresponding results for three similar 
switches. where S1 equals 8. Doubling the channel group size at 
the input side of the intermediate stage gave rise to only a 
inarginal decrease in the cell loss probability. Doubling the 
channel group size at the output side of the intermediate stage 
reduces the loss considerably, as shown in Fig. 6. 
0.01 
0 .Wl  
z 2 0.0001 
.I n 4 1E-005 
2 1E-886 
.2 1E-007 
d lE-008 
1E-009 
L 
tJl 
- 
1E-010 
18B0 2600 3409 4200 
Total inputs (nl x L11 
Fig. 6 Performance with uniform trafsic 
(Ll = m = & , S1 = 4, S2 = 8). 
These simulations indicate that the intermediate modules 
should be designed as expansion modules, with more outputs 
than inputs. to obtain the best performance. This seems 
intuitively reasonable: a cell may be routed to any intermediate 
module, but can be routed to only one output module. An 
alternative to increasing S2 is to change the value of m. 
However, this has the disadvantage that the number of 
itcmtions required by the path allocation algorithm will 
increase, so that higher speed hardware may be required. The 
effect of varying m in the range 30 to 34, for a switch with L1 = 
L2 = 32, SI = 4 and S2 = 8 is shown in Fig. 7. 
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 09:43:48 UTC from IEEE Xplore.  Restrictions apply. 
The efficiency of the path allocation algorithm may be 
demonstrated by comparing its performance with the lower 
bound on achievable performance. This bound corresponds to 
the knockout loss in [4]; at most S2m cells can be delivered to 
each output module in any one time slot. This bound, for the 
case where L; = = m = 32 and SI = S2 = 4, is compared to the 
simulation results in Fig. 8. It can be seen that the performance 
of the algorithm is close to the lower bound. 
0.01 
0.e0I 
z 
5 0.8881 
3 lE-005 
I€* 
1E-887 
1E-BBB 
1E-009 
1E-010 
L 
YI - 
2686 3000 3488 3800 4268 
Total inputs (n l  x L11 
Fig. 7 Effect of changing m (L1= & =32, SI = 4. $2 = 8). 
0.1 
0.01 
i2 .A 0.a01 
2 
- 
*- 0.8881 
2 1E-005 a 
6 1E-BB6 
- 1E-BB7 
1E-008 
1E-809 
1E-010 
- - 
- - Simulation 
68 78 88 96 100 110 120 138 
Inputs p e r  input module (nl) 
Fig. 8 Lower bound compared to sinaiilntioti results 
(L1 = m = =32, SI = S2 = 4). 
4. Performance with non-uniformly loaded 
output modules 
The above results all apply to a situation where there is a 
uniform load across all the output modules. Modifying 
assumption (iii) of the simulation model allows non-uniform 
loads to be assessed. 
Fig. 9 shows the results obtained for a switch with LI = L2 = i n  
= 32, for various channel group sizes, in the case where 75% of 
arriving cells request one from sixteen (contiguous) output 
modules, each output module of the sixteen being selected with 
equal probability. The remaining 25% of cells select (with 
equal probability) one of the other sixteen output modules. 
Hence 75% of the load is offered to only SO% of the available 
outpuls. 
It can 1x seen that the cell loss probability increases 
drainatically, compared with the case of uniform loading (i.e., 
Fig. 3 and Fig. 5) .  when SI = S2 = 4, or when SI =8 and S2 = 4. 
Furlhermore, the additional bandwidth available from the input 
stage of the switch to the intermediate stage in the latter case 
results in a negligible decrease in the cell loss probability. In 
contrast, the cell loss probability increases by a relatively small 
amount (compared with Fig. 6) for the case where SI = 4 and S2 
= 8. This further underlines the benefit of having a large 
bandwidth at the output side of the intermediate stage. 
0.1 
0.01 
s 0.0431 
0.0001 
? I€-005 
lE-006 
.- IE-007 
1E-008 
1E-009 
.d 
n 
m 
d 
d 
01 
 
1E-BlE l l j  
1668 2880 2400 2888 3208 3608 
Total inputs (nl x L1I 
Fig. 9 Ptq%rtnciiic.e with tionuniform trafjicfiw various values 
ofintcv-mdiate stage bandwidth (L, = m = L2 =32). 
The effect of progressively increasing the asymmetly of the 
load on the switch outputs shall now be investigated in the case 
of ;I switch where LI = L2 = m = 32, SI = 4, S2 = 8 and nl = 96. 
The output modules are divided into two groups. One group 
(ternled the demand group) conhins k output modules, with 0 < 
k < L2. The probability of a cell requesting an output module in 
the demand group is chosen so that the expected number of 
cclls requesting each module in the demand group is r&m, 
whcre 0 < r < 1. If a cell does not request an output module in 
l l i e  tiemand group, it requests one of the remaining L2 - k 
oulput modules, with uniform probability. 
The cell loss probability is shown in Fig. 10 for various values 
of k and I', in the case where the output modules in the demand 
group are contiguous. The probability of cell loss is shown only 
for cells requesting an output module in the demand group. No 
losses wcre recorded in the simulations for cells requesting the 
I.rackground group. 
The prohability of loss can be seen to increase significantly as 
thc size of the demand group (k) is increased. However, this 
prolnbilily stays below IO-'" if the load offered to each output 
motlule is lxlow 55% of its input capacity (i.e., if r < 0.55). 
Since each output module has 256 inputs, each can deliver data 
n t  thc fit11 ATM rate to as many as 140 output ports, without 
excessive cell loss. even in the presence of a severe traffic 
iinhalnnce. Setting 112 = 140 in Fig. 1, with the values of the 
other switch parameters as chosen above, results i n  a 3072 x 
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 09:43:48 UTC from IEEE Xplore.  Restrictions apply. 
4480 switch, with a cell loss probability below 10-lo in the 
presence of a 100% offered load and an asymmetric loading of 
the outputs. A square (3072 x 3072) switch should have 8 much 
lower figure for cell loss. 
1E-BB5 
5 1 E S  
1E-7 
2 1 E - W  
' 1E-069 
1E-918 
r 
I .- 
n 
n 
sa. 
U 
4 
8-55 8.6 8.65 8.7 8.75 
Nwmalised load on d w d  group, r 
Fig. 10 Variation of cell loss with tr@c imbalance, where 
contiguous output modules have a high load 
(Ll = m=&=32,S1 = 4, S 2 =  8, n1= 96). 
I I I 
NDrmalised load on denand group, r 
Fig. 11 Comparison ofperformance for contiguous and non- 
contiguous demand groups 
(L1 = m=&=32,  S1 = 4, S2 = 8, nl = 96). 
The cell loss probability depends to some extent on the pattern 
of traffic imbalance present. Fig. 11 compares the loss for a 
switch with a contiguous demand group and a switch with a 
demand group whose output modules are interspersed with 
those carrying background traffic. The variation in cell loss 
probability is most pronounced in the case of the more 
unbalanced load (k = 16). Note that the cell loss probability is 
higher for the contiguous demand group. 
5. Concluding remarks 
The pamllel request counting nlgorithm described here allows 
the necessary clock rate for the path allocation algorithm 
described in [ l ]  to be reduced and/or a larger switch to be 
constructed. compared with the original method of request 
counting. 
The effect of varying the parameters of the switch in Fig. 1 on 
its performance has been determined by simulation. The use of 
expansion switch modules in the intermediate stage of the 
switch has been suggested as an alternative to increasing the 
number of intermediate modules, as a means of reducing cell 
loss. 
It has been demonstrated that the algorithm is not unduly 
sensitive to imbalances in the offered load. Only the case of a 
100% offered load has been considered here; the case where the 
switch inputs are operating well below saturation requires 
further investigation. 
Appendix A 
A sulllcient condition for the reverse banyan in Fig. 2 to be 
non-blocking is that, for any pair of input ports 11 and I2, 
requesting output ports Oland 0 2  respectively, with I2 > I1 and 
O1 > O1. the following is true: 
This (well-known) result may be established, for example, by 
appropriately modifying the proof in Appendix A of [SI, which 
pertains to a (forward) banyan. In the present application, the 
input ports are 
and 
and the output ports are 
and 
with k >j. 
Hence the non-blocking condition becomes 
k - j l D k -  0) 
The number of requests for each output module must be non- 
negative, i.e., 
Since 
0 2  - 01 I I2  - II .  
I1  = Dj 
12 = Dk 
O1 = j  
0 2 "  k, 
Ky 2 0. 
k 
u=j+l 
D k - D j =  C K i u + k - j  
i t  follows that the condition for the network to be non-blocking 
is always satisfied. 
References 
[ I ]  M. Collier and T. Curran, "Path allocation in a three-stage ATM switch with 
intermediate channel grouping", Prm. lNFOCOM '93Session 8b, pp. 927- 
934, April 1,1993. 
121 M.C. Pease, "%e indirect binary n-cube niicroprocessor array", IEEE Trans. 
C ~ J I I ~ ~ I I ; . ,  vol. 26, no. 5, pp. 250-265, May 1977. 
H. Kim and A. Leon-Garcia, "A multistage ATM switch with interstage 
huffen", Proc. of the lnrrrnational Switching Symposium, Stockholm, 1990, 
K. Eng et al., "A modular broadband (ATM) switch architecture with 
optimum performance", Prm. of the International Switching Symposium, 
Stockholm, 1990, vol. IV, pp. 1-6. 
T.T. Lee, "Nonblocking copy netwolks for multicast packet switching", 
Joifriuil Of SFIFCC~. Areas Comniitn., Vol. 6, no. 9, pp. 1455-1467, Dec. 
1988. 
vol. V. pp. 15-20. 
1183 
~ I I 
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 09:43:48 UTC from IEEE Xplore.  Restrictions apply. 
