ahh allocation in a threc-stage ATM switch is channel grouping in the algorithm is described forwarded to the appropriate cells by a routing tag assignment network. A fast method of routing tag assignment is described, which employs a non-blocking copy network. This reduces the clock rate required of the circuitry, for a given switch size.
Introduction.
The use of three stages of switching to :illow a large ATM switch to be constructed using smaller switch modules has received considerable attention [ 1-71. A key consideration in the design of such switches is the routing algorithm used. The process of determining a routing pattern through the second or intermediate stage of the switch which results in the ;tvoid:ince of blocking in that stage is referred to as path allocation by the author. Cell-level path allocation algorithms have l'e:rturcd in a number of switch designs [3-71. Such :ilgorithms require special hardware, so that reconfiguration can be perl'onned at the necessary rate (once per time slot). One such technique, proposed by Collier and Curran, supports the use of intcriiicc1i:ite channel grouping, i.e., channel grouping at both the input and output sides of the intermediate stage modules.
This algorithm may be applied to the tlirce-st;ige switch shown in Fig. 1 . This is an nlLl x nzL2 switch, with L,, in. :ind L2 modules in the input, intennedinte and output slagcs respectively. There are Sl links in the channel group connccling input and intermediate stage modules, and S2 links i n the channel group connecting intermediate and output stage modules.
The algorithm requires the following hardware. :IS discussed in [ 5 ] .
Request counting circuitry. A total of L, copics of this circuitry is required (one for each input module). This circuitry has nl inputs and L2 outputs, and counts the nutnbcr of cells requesting each output module. This data is required to initialise the algorithm in each time slot.
The atomic() processor :irray. This ~r n y coi1t:iiiis L,L2 processors, which implement the algorithm in parallel. Processor Xij deals with requests from the i-th input module [or the j-th output module. Circuitry for routing tag assignment. A total of L, copies of this circuit is also required. This circuitry forwards the results of the path allocation algorithm from the processors to the relevant cells.
The atomic() processor m a y was described in [5] for the special case where the same number of modules is used in each stage. This description shall be generalised in Section 2 of this paper, so that the motivation for using intermediate channel grouping shall become apparent.
The hardware described in [5] for performing request counling and routing tag assignment operates sequentially, and so is relatively slow. This limits the maximum size of switch which can be implemented for a given system clock rate. Details of a faster method of request counting were presented in [6] . A fast inethod of routing tag assignment will be described in Section 3 of this paper. The performance of a switch employing [his algorithm for path allocation has been evaluated in [6].
The path allocation algorithm.
The :ilgorithrn requires in' iterations, where m' = max (Ll, &, m) . preceded by :in initialisation step. It operates on the fo 1 Io w i 11 g (1 um t i t ies: It is implemented by an L1 x & array of processors. The processor in row i and column j of the m a y is called Xij. Processor XI, executes the procedure atomic(i, (i+j-k) mod m', j ) during iteration k of the algorithm. The procedure atomic(i,rj) is del'iiicd as wlicrc I?,, is the number of cells which will be routed from input module i to output module j via intermediate module r. , and the new forwarded to X(,+I) mod Xt,o+l)modm'. The K register value is retained I~cally. IfLl < ni', X, is not an atomic() processor, for i 2 L1, but simply a 5 register. Similarly, if L2 < m', X,] is not an atomic() processor, for j 2 &, but simply an A register. These extra registers introduce delays to ensure that processor XI, receives A,, and B ,
simultaneously. An example of such an anay is shown in Fig. 2 .
A large value for m will bc required if intcrmedinte ch:innd grouping is not used. Hence, the execution time of the algorithm, in clock cycles, will be long, and so a high system clock rate will be required. This penalty can be avoided if the bandwidth of the intermediate stage is increased by increasing SI and Sz, rather than by increasing m.
3. Routing tag assignment using a copy ~~t w~r~.
Principles of operation.
The functions to be performed during routing tag assignment resemble those carried out by the 'allocation network' in Pattavina's switch [8]. Hence, fast routing tag assignment could be performed using a running sum adder network, similar to that employed by Pattavina. An alternative method for fast routing tag assignment will now be described, which uses a modified version of Lee's copy network [9] . This network is used lo broadcast a routing tag simultaneously to all the addiess generators which should receive it.
A Batcher network and a copy network :ire used, as shown in Fig. 3 . The copy network has iz1 + Lz inputs and outputs. The routing packet generators are connected to L2 of the copy network inputs, and the remaining inputs are idle. Routing packet generator RPG, receives the value of K,] from the appropriate atomic() processor. The crlon/ir() processor X,] generates a sequence of K,] values, one after every iterntion of the algorithm, commencing wilh K,," (the initial value of K,,, determined by the request counting hardware) and decrementing, after every iteration, in accordance with the atomic() procedure, as paths are allocated to cells. The Batcher network merges the data cclls (arriving al the input ports of input module i) with a set of control packets (one for each output module) in such ; i way that the data cells requesting output module j appear at higher-numbered output ports of the Batcher network than control packet j . Helice, the cells requesting output module j appear :it outpuls DP1 + 1 through DP1 + K: of the'Batcher network, where the v:iluc of Dl is the address of the sorter output port at which control p:icket j appears. Evidently
The routing packet generator for output module 1 (RPC;,) must forward the relevant routing tags to the data cells at outputs Dl.l + 1 through LIP, + K: of the Batcher network. Thc request counting hardware described in [5, 6] c:m be (frivially) modified to generate Dl-l.
through which routes were all = 0, an inactive packet is s u~m i thus receive a number of routing tokens; only the most en^^^ received token is used to generate a routing tag. be routed when the path allocation algorithm has t e r m~~~~e~.
This requires only a few clock cycles. The hardware [SI introduced an additional delay, since the bro Lokeris occurred sequentially rather than simultaneously.
The routing packets submitted to the copy network do not collide, since (as shown in [IO] ) they satisfy the condi~on for avoiding internal contention in Lee's copy network, presented in A set of null tokens is broadcast to those cells [SI.
etwork h this method of routing tag assignment is the amount of data which must be processed during each broadcast. In general, Lee's copy network must process two bits (one each from the upper and lower address) in :idclition to the xtivity hits, at each stage. Hence, the interval between successive passes of the algorithm, in bit times, will be quite large. The speed of the algorithm can be increased by observing that, in this application of the copy network, the lower address bit processed at each node never changes after the first (inilialisation) step of Ihe algorithm. Hence, on subsequent p:tsses of Ihe algorithm, there is no need to distribute the lower address, so that the header on the routing packet may be shortened, reducing the delay through the copy network.
The proof of the assertion that the lower address bit to be processed at each node of the copy network never changes after the first iteration is given in Appendix A.
An Example of R uting tag Assignment
The succcss of this approach to routing tag assignment shall bc demonstrated by an example described in Tables 1 through 3. The example considered features 4 modules in each stage of the switch. Table 1 indicates the number of cells from input module 0 (fMu) which have requested each of the four output modules and ;I possible outcome of the path allocation process. Table 2 it1dic;llc.s he contents of the I( register of each atomic0 processor ;issocialed with IMo after initialisation (0-) and after each iteration (U' , 1' . 2' and 3' ). The resulting values of the routing pflckct headers are shown in Table 3 .
The broadcasts for each iteration of the algorithm are illustrated in Figs. 4 (a) through 4 (e). The value printed beside each link on the broadcast trees represents the lower address bit lo be proccsscd by the next stage of the copy network. It can be seen that this never changes after initialisation. After five broadcast operations, the correct number of cells has been assigticd a path via each intermediate switch module. 
Clock cycles required
The length of each routing packet (after the first) is L, = 1 +[log, (m+ 111 +pog2 oL1 + L~ 11 I 1 2 0 0 2 0 0 1 i.e., one activity bit, enough bits to represent the token address, and sufficient bits to represent the requested upper copy network output. total cells routed contention cells losing Hence, routing packets may be submitted to the nctwork :it the rate of one every L, clock cycles. If this exceeds the numbcr of clock cycles required for one iteration of the path allocntion algorithm, an undesirable delay is introduced, whcrcby the path allocation can only proceed at the rate of one iteration per L, clock cycles.
A solution to this difficulty involves n reduction in the value of L,. The token address is not broadcast, except during initialisation. The address generator slores the tokcn address received then, and on subsequent iterations c:ilcul:ites the token address by decrementing the previous value. Thcrefore, the value of L, can be reduced by [log, ( m + 1)1 bits.
requires only a further
Once path allocation is complete, the t;ig assignrnctit process L, + 2rlog, (121 + ~2 4 6 0 1 0 1 0 0 clock cycles to terminate (this is the time required to generate the final null routing packet, and to propagate it through the copy network). This compares favourably with the corrcsponding number of clock cycles for the method described in [SI, which, in the worst case, requires [nlmiiz (Sl, &) ].A clock cycles, where A (typically equal to one) is the number of clock cycles required to propagate a routing packet through an address generator. Which of the two strategies is to be preferred depends on the required operating speed of the circuitry. routing packet for OM2 rotiting pnckct for OM3 0,13,12 0,13,12 0,13,12 0,13,12 0,13,12 1,14,14 0,14,13 0,14,13 0,14,13 0,14,13 
Conclusions.
The principal difficulty in implementing a cell-level path nllocntion algorithm for three-stage ATM switches is the high bit rate required of the circuitry. The fast method of routing tag ;rssignmciit described here, together with the fast method of request counting presented in [6] , allows the bit rate of the algorithm described in [SI to be reduced considerably, for a given switch size. It was estimated, in [SI, that a switch with L1 = /U = L2 = 32, = 96, SI = 4 and SI = 8 would require a system clock rate of 290 MHz. The use of the faster hardware tcchniques describcd here and in [6] (in conjunction with a inster iinplcmentation of the atomic() processor described in [ 101 requiring only 9 clock cycles per iteration) allows this rate to be rcduccd to about 130 MHz. Hence, it should be possible to iinplcmcnt the algorithm in CMOS VLSI.
Appendix A
The lower address bit need not be transmitted through the copy nct\vork on the second and subsequent passes of the routing tag assignincnt algorithm. This may readily be demonstrated if Ihc Boolean interval splitting algorithm proposed by Lee [9] is dcscribcd i n the following terms.
Let the l o w e r nnd upper outputs of il switch element of the copy rietwork be referred to as OUTo and OUT, respectively. Coiisidcr the switching which occurs at stage k of the network. The iiicoming p:icket is described by three quantities, A (the x t i v i t y ; A= 1 if : I packet is present: A=O for an idle input), L (the lowcr address) and U (the upper address). The corresponding c~u:ti~~i~ies for OUT, are A(OUTo), L(OUTo) and U(OUTo), and lor OUT, are A(OUTI), L(OUTl) and U(OUT1). Let Ik be the lowcr :iddress bit inspected at stage k.
Using this not:ition, it follows that where 1, x 2 0 0, x < 0 '
Note that when A-0, the values of L and U are 'don't cares'. There are three possible outcomes of the Booleaii interval splitting algorithm (routing to OUTo, routing to OUT,, OB routing to both outputs). It inay be shown that the above description of the data on OUTo and OUTl gives resulfs consistent with those described by Lee, in all three cnses. This proves the validity of this description of Lee's algorithm.
It is apparent that the values of L(OUTo) and L(OUT,) arc dependent only on L and k. Hence it follows that the value of In for each node in the tree consisting of copy network links which are carrying a packet which originated at a single input is dependent only on L.
It can be concluded that two broadcasts, both from Ihe s m e input port, and with (A, L, U) equal to (1, L, U,) and (1, L, U Z ) respectively, present the same bit as la to switching elements which lie on the tree common to (1, L, VI) and ( I , L, U2) .
However, if Ul < UZ, the tree associated with (1, L , U , ) contains only links which are shared with Lhe (1, L, U*) tree. Therefore, if the (1, L, U,) broadcat is preceded by that for (11, L, Uz), the value of L need not be transmitted, provided the values of lA arc stored in the relevant switch elements. In the r o~~~~~g tag assi nmetlt ~~~~~c~~~~~~ the broadcasts on involve a non"decre~~ng successive iterations of th faiaout, with an u~c~a~~~~~ lower addres iteration, the lower ~l e~~o r~ is contenti switch clernemat will other th:lla that from token e routing padtets in 
