A method for path allocation is described for use with three-stage ATM switches which feature multiple channels between the switch modules in adjacent stages. The method is suited to hardware implementation using parallelism to achieve a very short execution time. This allows path allocation to be performed anew in each time slot. A detailed description of the necessary hardware is presented. This hardware counts the number of cells requesting each output module, allocates a path through the intermediate stage of the switch to each cell, and generates a routing tag for each cell, indicating the path assigned to it.
necessary, to select among the available paths from source to destination.
We distinguish between two time scales over which paths may be allocated. In one approach, all cells belonging to a virtual circuit are allocated the same path. Thus path allocation is performed at call setup time, and this path is allocated for the duration of the call. In the second approach, path allocation is performed independently in each time slot, and so the path is allocated for the duration of one time slot only.
We refer to these two approaches as path allocation at call level, and path allocation at cell level, respec lively.
We consider below the problem of implementing a cell-level algorithm for path allocation in the channelgrouped three stage network of Fig. 1 . The hardware implementation must be such that the resulting circuitry is not required to operate at a prohibitively high speed. In practice, this means that the parallelism in the hardware must be maximised. Our motive for adopting a channel-grouped architecture is that it reduces the execution speed required of the path allocation hardware. The use of channel grouping can also improve performance in ATM switches [7, 8] . The path allocation algorithm and the hardware necessary to implement it are described below.
2:
level.
An algorithm for path allocation at cell 2.1: algorithm.
The objectives of a path allocation
There are S1 routes 
3:
A new algorithm for path allocation.
3.1:
Basic principles.
A new and efficient algorithm will now be described. It is suitable for use in a channel-grouped three-stage switch and requires only knowledge obtainable at the input si& of the switch. The key to its high performance is the encoding of data concerning the availability of paths into binary words ( Kij : the number of requests from input module i for output module j.
Note that Ah and Kij need only be local to the input module. The B$s must be forwarded to each input module in turn. This is performed by a ring structure connecting each input module. Such an arrangement is shown in Fig. 2 The data required by a processor for the next iteration of the algorithm should be available locally, or from adjacent processors.
An implementation satisfying these two constraints will now be described.
33:
Implementation of the algorithm
The algorithm requires a total of L1.L2 processors.
The The mechanism for passing information to an input cell concerning the path allocated to it will be described in section 3.5. The hardware layout for the case where m = 4 is shown in Fig. 2 , which illustrates the array of sixteen processors required, and the contents of their registers during iteration zero of the algorithm. Each row in Fig. 2 contains four processors, which are co-located with the corresponding input module. Each column in Fig. 2 processes requests for a single output module. Thus, for example, the processor in row one and column two of the array handles requests for cells to be routed from input module one to output module two. A total of 64 paths is available through the switch (four for each input-output module pair). The sixteen processors attempt to allocate cells to sixteen of these paths during each iteration. After each iteration, the updated value of Air is passed to the adjacent processor in the same row, and the updated value of Brj is passed to the adjacent processor in the same column. The directions of data flow are indicated by arrows in Fig. 2 . No two processors can allocate a path sharing a channel in the same iteration. Nevertheless, after four iterations, all possible paths have been allocated.
s2J
' I
33:
The processing element.
The processor must execute the arom'co procedure, 1.
2. Perform three subtractions. Fig. 3 shows a possible implementation of the processing element, which uses bit-serial arithmetic.
Determination of the minimum requires values to be presented most significant bit first, while bit-serial subtraction requires values which are presented least significant bit first. Hence the processor must be able to perform bit reversal on the quantities processed. A bitparallel implementation avoids this difficulty. at the cost of increased complexity.
The processor design will involve a trade-off of circuit complexity against operating speed, since there is an upper bound on the permissible execution time.
The time within which the algorithm is required to execute depends on whether cells losing contention are discarded or are queued until the next time slot.
Consider the case where cells losing contention join an input queue. The queue controller, when it submits a cell to the path allocation process, retains a copy in the input buffer. It then awaits an acknowledgement signal from the path allocation hardware, indicating whether the cell has been allocated a path through the switch, or has been discarded. It then submits the copy cell (if the original cell was discarded) or it purges the copy cell and submits the next cell in the input buffer ( i f the first cell was successfully routed). The acknowledgement must be returned within the duration of one time slot so that cells can be submitted to the switch in successive time slots. Hence the time taken for the path allocation process to execute should be less than the duration of one time slot. This stringent requirement could be relaxed if preservation of cell sequence was not mandatory, since a cell losing contention could then rejoin the queue in a later time slot.
No acknowledgements are required if there is no input queueing. Hence the path allocation process need not execute within one time slot. However, it must still be possible to submit cells to the switch in successive time slots. Additional copies of the path allocation hardware are required to ensure this, equal in number to the execution time of the path allocation process in time-slots. For example, if the path allocation process has an execution time of two time slots, cells arriving during even-numbered time slots will be processed by one copy of the hardware, and cells arriving during odd-numbered time slots will be processed by another. Hence a tradeoff may be performed during switch and thus must perform two types of operation:
Find the minimum of three numbers.
8b.l.3
' I I design between processor speed and the number of procewm requid. Omitting the input queues has the additional advantage that no hardware is required to generate the acknowledgements.
3.4: Counting requests
Hardware is also needed in each input module to perform the following tasks before and during the path to count the number of requests for each output module so as to obtain the initial values of the to forward a routing tag based on the results of allocationprocess:
Kij' S; path allocation to each input cell.
The counting of requests can be performed by the hardware of Fig. 4 . This merges the input cells with a set of control packets, one for each input module, in a Batcher sorter (with nl+m inputs and outputs), in the manner described in [ 12) . Idle inputs submit an inactive packet to the sorter. The sorter output contains (Starting at the lowest-numbered output in Fig. 4 ) the control packet for output module 0, followed by all the data cells intended for output module 0, followed by the control packet for output module 1, etc. The inactive packets are sorted to the highest-numbered outputs.
The address generaton in Fig. 4 serve different purposes during the counting of requests and in routing tag assignment.
When counting requests, they determine the type of packet which is present at the corresponding output of the Batcher network, and generate a bit (the identity bit) which is 1 for a control packet or an inactive packet and 0 for a data packet (cell). A copy of the identity bit is stored in a one-bit register which is connected to its neighbours in adjacent address generaton in such a way as to form a shift register.
T h m bits zue shifted nl+m times (in the direction shown in Fig. 4 ) from address generator to address generator, and hence into a counter of wordlength rlog,(m)l which is reset upon receiving a 1 (a control packet or inactive packet) and incremented on receipt of a 0 (data cell). This token passing algorithm is w i l y implemented in hardware. Each address generator stores data concerning tokens as a routing packet containing two fields, which are the token address (indicating the address of the intermediak switch module to which access is being granted) and the token count (indicating the number of such tokens). Routing Pass Five: One (i.e., 7-3-1-0-2) null token is received, indicating that one cell has lost contention.
At the end of Pass Five, a route through each intermediate switch module has been assigned to the correct number of cells, as shown in Fig. 5 . In particular, no route is assigned via ISM3. The key to achieving the required result was the correct choice of token count for each pass of the algorithm. The sequence chosen was (7,4,3,3,1) . This is identical to the sequence of Ko1 values generated by processor during the path allocation process, i.e., it is equal to the number of cells not yet allocated a path after each cycle of path allocation. Thus, the necessary sequence of token counts may be obtained from the TCout output of the processor, shown in Fig. 3 .
Each pass of the algorithm can commence after the frrst iteration of the preceding pass. Hence the execution time increases only slowly with the number of passes. Routing tag assignment can thus be carried out as follows.
The routing packet generator associated with processor Xij generates a routing packet, concurrently with each iteration of the path allocation algorithm.
The token address is the value of r. the intermediate switch module through which the processor is attempting to route cells. This value can be easily generated by a counter decremented after every iteration of the path allocation algorithm. The token count is set to the value of Kij. This routing packet is forwarded to the address generator associated with control packet j through the Batcher network in Fig. 4 .
Each address generator performs the actions illustrated in Fig. 6 after every iteration of the path allocation algorithm. Upon completion of the path allocation algorithm, the value of Kij is equal to the number of cells which have lost contention. This is forwarded to the relevant cells in a special routing packet, whose token address field indicates that these are null tokens, which flag the corresponding cells as having lost contention.
Upon completion of this process, the token address and token count values stored by each address generator comprise a unique routing tag, which is then prefixed to the associated data cell. The cell is submitted to the first stage of the switch, and is thereby routed to the appropriate intermediate switch module.
The operation of this algorithm for the example considered earlier is shown in Table 11 . The data stored in each address generator at the end of each iteration of the algorithm are shown. After nine iterations, the routing tags have been successfully assigned.
This routing assignment algorithm has the benefit of simplicity but its execution time is quite long. However, it operates in parallel with the path allocation process, although it takes longer to execute, because of the delay in propagating routing packets through the address generators.
4:
A design example.
A 3072 x 8192 switch can be constructed by choosing L1= L2 = m = 32, n1= 96. n2 = 256, Si = 4 and S2 = 8 in the switch shown in Fig. 1 . The probability of cell loss due to non-allocation of paths through this switch has been obtained by simulation. The simulation model assumes that all switch inputs have a 100% load, that traffic is uniformly distributed among the output modules, and that cells not allocated paths on the first attempt are discarded. The resulting figure for cell loss probability is below The 8b.l.5 good propagation delays). The required clock speed could be reduced using a more complex procesror design (e.g., using bit-parallel arithmetic). An alternative is to construct two copies of the hardwait, which process requests in alternate time-slots. The clock rate required in this case should be below 150 MHZ.
5:

Conclusions.
290 MHZ (neglecting any speed-up required to make A new algorithm for path allocation in three-stage broadband networks has been described. A complete hardware implementation of this algorithm has been presented, including a method for generating the initial data required by the algorithm, and far farwarding the results to each cell at the input si& of the switch, in the form of a routing tag. 
