ABSTRACT A method of cell-level path allocation for threestage ATM switches has recently been proposed by the authors. The perform'mce of ATM switches using this path allocation algorithm has been evaluated by simulation, and is described here. Both uniform and non-uniform models of output loading are considered. The algorithm requires knowledge of the number of cells requesting each output module from a given input module. A fast method for counting the number of requests is described.
Introduction
A new method of allocating paths through a three-stage ATM switch has recently been proposed by the authors [l]. The method applies to a switch featuring intermediate channel grouping, such as that in Fig. 1 . The algorithm hunts sequentially for available paths through the intermediate stage of the switch, but multiple searches are conducted in parallel, so that comparatively few iterations are required to search through all possible paths. Hence cell-level path allocation can be performed, i.e., the updating of paths after every time slot. A full description of the algorithm, and its implementation, may be found in [l]. It uses an m a y of Ll.& processors called atomic( ) processors to achieve the necessary parallelism (whereL1 and & are as defined in Fig. 1 ). Each processor must be initialised with the value of Ki, the number of cells from input module i requesting output modulej. The method for counting cell requests described in [l] has an execution time which increases in proportion as the switch size increases. This can limit the maximum achievable switch size. A faster method of obtaining the value of Kc is described in Section 2. The performance of switches using this path allocation algorithm is considered in Section 3 (for uniform loading of outputs) and Section 4 (for non-uniform loading).
A faster method of request counting
The method of request counting described in [l] has the disadvantage that it sequentially determines the number of cells requesting output modules &-1, &-2, etc. The total number of clock cycles required is nl + &, i.e., the sum of the numher of input ports per input module, and the number of output modules. Hence, the required clock rate may be excessive in a large switch, given that request counting, path allocation, and routing tag assignment, must (ideally) all take place within the duration of one time slot. A faster, parallel, implementation would simultaneously calculate Kc (the number of requests from input module i for output module 5) for all values of j (0 S j < &). Suitable hiirdwnre will now be described. The execution time for this hardware is 2rlog,(n, +L,)1+1 clock cycles, which is suhst:intially less than that for the sequential hardware descrihed in [l] . at the price of an increase in hardware complexity. The hardware required is shown in Fig. 2 . Data cells from the 11, input ports associated with input module i are merged with L2 control packets (one per output module) by a Batcher sorting network. The merge operation is performed in such a way that idle cells (i.e., empty cells from inactive input ports) are sorted to the highest-numbered output ports of the Batcher network. If the control packet for output module j appears at output Dj of the Batcher network. then the data cells (if any) requesting that output module appear at lower-numbered output ports of the sorter (ports D,-l.D,-2. etc.) , as shown in Fig. 2 The necessary subtraction can be performed very Thc s\ll>111itted packets like two cycles to propagate through since
/+I where is the 1's complement of D, ohtained by bitwise inversion of D. It follows that the value of KY can be generated using a serial adder, and can then be stored in the K register of the appropriate atomic.() processor (labelled Xi in [ 11) . It is necessary to generate a concentrated list of the values qOg, o~~ + 11. 1.
.., Db-, as input data for the serial adders. These values are available at the address generators which have received control packets from the sorter outputs, but are not concentrated onto contiguous outputs. Hence a concentrator is required. This is the purpore of the binary self-routing network shown in Fig. 2 , which is variously known as the indirect binary n-cube [2] and the 'reverse banyan' [3] . The address generators forward only control packets to this network. Address generators which have received a data cell or :in idle cell through the Batcher network submit an inactive packet to the concentrator. The address generator which receives control packet; from output DJ of the Batcher network appends a data field to the packet containing the value of 0,. This packet is then routed to output j of the concentrator. A total of L2 control packets is thus simultaneously launched into the concentrator, and these are routed to the serial adders at outputs zero through &-1, without blocking. The absence of blocking within the concentrator will be verified in Appendix A. The concentrated list of DJ values is then read by these serial adders, the upper input (as shown in Fig. 2 ) being inverted. Hence the K,/ values are generated, and passed to the ~f o n i i r ( ) processors. The example considered shows three requests for output module zero, two for output module one, and none for This is much less than [he /z1 + L2 cycles required by the h:rrtlw:irc tlcscrihed in [ 11. Taking as an example the sample switch considered in that paper, for which nl = 96 and L2 = 32, IIIC number of clock cycles is reduced from 128 lo 15.
Performance with uniformly loaded output modules
Ttic pcr1orm;ince 01 a three-stage switch using the cell-level p;itIi ;itlocation algorithm described above will now he cwlwtcd. The cell loss probability must be determined by siniu1:ilion since no analytical method is currently available. Tlic simulation model is based on the following assumptions: ( i ) There is no input queueing; cells which are not allocated :I path on the first attempt are discarded. ( i i ) The switch is offered a maximum load: each input port of the switch submits a cell in every time slot.
( i i i ) The destin:ition of each cell is drawn from a uniform distrihution: :ill output modules receive the same load. (iv) The switch modelled is that shown in Fig. I , for various clioices of the parameters L I , L2, m, SI, S2 and ttl. ( v ) The i1I;ixinium number of cells generated is 10". If zero ccll loss is recorded during the simulation. the cell loss prohahility is assigned the value 5 x lo-". The prohal>ility of losing contention is assumed to be independent for each cell, at low levels of loss. With this assumption, the probability that the cell loss probability ( C U ) is below 5 x lo-", given that no losses were recorded, is above 95%, i.e., The influence of the choice of channel group size in Fig. 1 on the cell loss probability is considered in Figs. 3-6 . The results where S1 = S2 = 4 are shown in Fig. 3 . Note that Ll = = m for the three switches simulated. The curves obtained show that, with the number of switch inputs fixed, the cell loss probability falls as the number of switch modules is increased, as expected. A more interesting result is observed if the results are plotted for a fixed size of input module, as shown in Fig. 4 . These loss curves indicate that, if modular growth is achieved by increasing Ll, & and m in the same proportion, the cell loss probability will decrease. Hence, the additional loss due to the higher switch load is offset by the increasing diversity of paths available. These simulations indicate that the intermediate modules should be designed as expansion modules, with more outputs than inputs. to obtain the best performance. This seems intuitively reasonable: a cell may be routed to any intermediate module, but can be routed to only one output module. An alternative to increasing S2 is to change the value of m. However, this has the disadvantage that the number of itcmtions required by the path allocation algorithm will increase, so that higher speed hardware may be required. The effect of varying m in the range 30 to 34, for a switch with L1 = L2 = 32, SI = 4 and S2 = 8 is shown in Fig. 7 . 
Performance with non-uniformly loaded output modules
The above results all apply to a situation where there is a uniform load across all the output modules. Modifying assumption (iii) of the simulation model allows non-uniform loads to be assessed. Fig. 9 shows the results obtained for a switch with LI = L2 = i n = 32, for various channel group sizes, in the case where 75% of arriving cells request one from sixteen (contiguous) output modules, each output module of the sixteen being selected with equal probability. The remaining 25% of cells select (with equal probability) one of the other sixteen output modules.
Hence 75% of the load is offered to only SO% of the available outpuls.
It can 1 x seen that the cell loss probability increases drainatically, compared with the case of uniform loading (i.e., Fig. 3 and Fig. 5 ). when SI = S2 = 4, or when SI =8 and S2 = 4.
Furlhermore, the additional bandwidth available from the input stage of the switch to the intermediate stage in the latter case results in a negligible decrease in the cell loss probability. In contrast, the cell loss probability increases by a relatively small amount (compared with Fig. 6 ) for the case where SI = 4 and S2 = 8. This further underlines the benefit of having a large bandwidth at the output side of the intermediate stage. The output modules are divided into two groups. One group (ternled the demand group) conhins k output modules, with 0 < k < L2. The probability of a cell requesting an output module in the demand group is chosen so that the expected number of cclls requesting each module in the demand group is r&m, whcre 0 < r < 1. If a cell does not request an output module in llie tiemand group, it requests one of the remaining L2 -k oulput modules, with uniform probability. The cell loss probability is shown in Fig. 10 for various values of k and I', in the case where the output modules in the demand group are contiguous. The probability of cell loss is shown only for cells requesting an output module in the demand group. No losses wcre recorded in the simulations for cells requesting the I.rackground group.
The prohability of loss can be seen to increase significantly as thc size of the demand group (k) is increased. However, this prolnbilily stays below IO-'" if the load offered to each output motlule is lxlow 55% of its input capacity (i.e., if r < 0.55).
Since each output module has 256 inputs, each can deliver data nt thc fit11 ATM rate to as many as 140 output ports, without excessive cell loss. even in the presence of a severe traffic iinhalnnce. Setting 112 = 140 in Fig. 1 
I

I I
NDrmalised load on denand group, r Fig. 11 Comparison ofperformance for contiguous and noncontiguous demand groups (L1 = m=&=32, S1 = 4, S2 = 8, nl = 96).
The cell loss probability depends to some extent on the pattern of traffic imbalance present. Fig. 11 compares the loss for a switch with a contiguous demand group and a switch with a demand group whose output modules are interspersed with those carrying background traffic. The variation in cell loss probability is most pronounced in the case of the more unbalanced load (k = 16). Note that the cell loss probability is higher for the contiguous demand group.
Concluding remarks
The pamllel request counting nlgorithm described here allows the necessary clock rate for the path allocation algorithm described in [l] to be reduced and/or a larger switch to be constructed. compared with the original method of request counting.
The effect of varying the parameters of the switch in Fig. 1 
Appendix A
A sulllcient condition for the reverse banyan in Fig. 2 to be non-blocking is that, for any pair of input ports 11 and I2, requesting output ports Oland 0 2 respectively, with I2 > I1 and O1 > O1. the following is true:
This ( it follows that the condition for the network to be non-blocking is always satisfied.
