Path allocation in a three-stage broadband switch with intermediate channel grouping by Collier, Martin & Curran, Thomas
Path Allocation in a Three-Stage Broadband Switch with Intermediate 
Channel Grouping 
Martin Collier and Tommy Curran 
School of Electronic Engineering, 
Dublin City University, Glasnevin, Dublin 9, Ireland. 
Abstract 
A method for path allocation is described for use with 
three-stage ATM switches which feature multiple 
channels between the switch modules in adjacent 
stages. The method is suited to hardware 
implementation using parallelism to achieve a very 
short execution time. This allows path allocation to be 
performed anew in each time slot. A detailed 
description of the necessary hardware is presented. 
This hardware counts the number of cells requesting 
each output module, allocates a path through the 
intermediate stage of the switch to each cell, and 
generates a routing tag for each cell, indicating the 
path assigned to it. 
1: Introduction. 
A range of designs has been proposed for broadband 
switching (e.g., [14]). Many of these proposals are 
only practical in the design of small switches. For 
example, the number of switch elements in the 
Sunshine switch [3) becomes excessive as the switch 
size increases [5].  A different approach must be taken 
to the design of large switches. 
An obvious method of implementing a large switch, 
given these constraints on switch size, is to design the 
switch with multiple stages, where each stage consists 
of smaller switch modules. The Clos network [6] 
exemplifies a switch of this type. This solution 
typically introduces a new problem whereby multiple 
paths from source to destination become available. 
Thus, even if the individual switch modules possess the 
self-routing feature, this feature is not retained by the 
overall switch. Some method of path allocation is then 
necessary, to select among the available paths from 
source to destination. 
We distinguish between two time scales over which 
paths may be allocated. In one approach, all cells 
belonging to a virtual circuit are allocated the same 
path. Thus path allocation is performed at call setup 
time, and this path is allocated for the duration of the 
call. In the second approach, path allocation i s  
performed independently in each time slot, and so the 
path is allocated for the duration of one time slot only. 
We refer to these two approaches as path allocation at 
call level, and path allocation at cell level, 
respec lively. 
We consider below the problem of implementing a 
cell-level algorithm for path allocation in the channel- 
grouped three stage network of Fig. 1. The hardware 
implementation must be such that the resulting 
circuitry is not required to operate at a prohibitively 
high speed. In practice, this means that the parallelism 
in the hardware must be maximised. Our motive for 
adopting a channel-grouped architecture is that it 
reduces the execution speed required of the path 
allocation hardware. The use of channel grouping can 
also improve performance in ATM switches [7,8]. The 
path allocation algorithm and the hardware necessary to 
implement it are described below. 
2: 
level. 
An algorithm for path allocation at cell 
2.1: 
algorithm. 
The objectives of a path allocation 
There are S1 routes from each input module to each 
intermediate switch module. There are S2 routes from 
each intermediate switch module to each output 
module. We must choose, for every input cell (if 
possible) an intermediate switch module through which 
to pass on the way to the selected destination, such that 
8b.l .I 
0743-166W93 $03.00 0 1993 IEEE 927 
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 09:41:33 UTC from IEEE Xplore.  Restrictions apply. 
no input module attempts to route more than SI cells 
via any intermediate switch module, and no 
intermedhk switch module attempts to route more 
than S2 cells to any output module, in any one time 
slot. 
Note that, in this problem, an attempt is made to 
reserve bandwidth for each input cell, such that it can 
pass through the intermediate stage without blocking. 
An alternative, and simpler, strategy would not test for 
the availability of a path from the intermediate stage to 
the output stage, and would select intemxhte stage 
modules based on some simpler criterion (e.g. random 
selection). The former strategy is preferred for a 
number of reasons: 
(i) no queueing occurs in the intermediate stage: 
thus the delay through the intermediate stage is 
uniform, regardless of the path taken; this makes 
it possible to presene cell sequence on a virtual 
circuit: 
(U) the intermediate stage can never be congested, 
(iii) intermediate stage modules can be of simple 
design, since contention cannot occur. 
22: 
allocation. 
Existing algorithms for cell-level path 
A number of solutions to this problem have been 
p"!, in the special case where SI = S2 = 1 [9-111. 
These algorithms typically use one bit to represent 
each channel in the switch. Thus the number of bits 
being processed by the path allocation algorithm is 
large for a large switch. The algorithm in (91 was 
adapted to handle atchitectm with intermediate 
channel grouping in (81. This adaptation i n c m  the 
complexity, and thus the execution time, of the 
algorithm. 
It is possible to reduce the execution time of the 
path allocation algorithm, for a given intermediate 
stage bandwidth, through the use of intermediate 
channel grouping, as demonstrated below. 
3: A new algorithm for path allocation. 
3.1: Basic principles. 
A new and efficient algorithm will now be 
described. It is suitable for use in a channel-grouped 
three-stage switch and requires only knowledge 
obtainable at the input si& of the switch. The key to its 
high performance is the encoding of data concerning 
the availability of paths into binary words (thereby 
reducing the number of bits to be processed by the 
algorithm for a given switch size) and the extensive use 
of pdlelism. It operates on the following quantities: 
Ah : the number of channels available from input 
module i to intermediate switch module r, 
Brj : the number of channels available 6rom 
intermediate switch module r to output module j: 
Kij : the number of requests from input module i for 
output module j. 
Note that Ah and Kij need only be local to the input 
module. The B$s must be forwarded to each input 
module in turn. This is performed by a ring structure 
connecting each input module. Such an arrangement is 
shown in Fig. 2, where each row is located at an input 
module. Let Rirj be the number of cells to be routed 
from input module i to output module j via 
intermediate switch module r. The values of A h  B 
and Kij are updated using the procedure ufomic(is3 
described below: 
R* =min(Kii,Bd,A,,) 
This procedure is 'atomic' in the sense that it is the 
basic building block from which the path allocation 
algorithm is constructed. The procedure determines the 
capacity available from input module i to output 
module j via intermediate switch module r (i.e. the 
minimum of Ah and Brj). The number of quests 
which can be satisfied is equal to the minimum of the 
number of requests outstanding (Kij) and the available 
capacity. 
A sequential implementation of the path &cat ion  
algorithm requires the repeated execution of 
utomic(is,j) on a single processor for all possible 
number of requests from input module i for output 
module j (which number is obtained by examining the 
requests at the switch module inputs), Ah is set equal 
to SlandB -issetequaltoS2 
A &el implementation requires multiple 
processors, each executing the atomic() procedure for a 
Merent set of procedure parametem, subject to the 
following constraints: 
0 No two prfxessors shall simultaneously 
require access to the Same quantity. For example, 
utom.~(2,0,0) uses A20. Bm and KB SO that 
utomic(2,0X), afom'c(2,X,O) and atomic(X,O,O) cannot 
be executed concurrently with utomic(2,0,0) for any X 
values Of i s  and j. Initidly Kij is set equal to the total 
8b.l.2 
928 
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 09:41:33 UTC from IEEE Xplore.  Restrictions apply. 
The data required by a processor for the next 
iteration of the algorithm should be available locally, or 
from adjacent processors. 
An implementation satisfying these two constraints 
will now be described. 
33: Implementation of the algorithm 
The algorithm requires a total of L1.L2 processors. 
The detailed operation of the algorithm depends on the 
number of switch modules in each stage. The simplest 
case, where L1= L2 = m, is described here, but only 
minor modifications to this algorithm will be required 
if the values of L1. L2. and m are not equal. 
Processor Xjj is initialised by loading the following 
three values: 
(i) the initial value of Kii; 
(iii) the initial value of B(i+j) mod m~ (i.e., 
The algorithm then requires m iterations (iterations 
zero through m-1). Processor X-- executes atom'c(i, 
(i+j-k) mod m, j) during iteration k . After each 
iteration Xij fowards the updated value of B . to 
mod m, j and of ~ i r  to xi, Q+l) mod m *rJand 
retams Kij. 
The mechanism for passing information to an input 
cell concerning the path allocated to it will be 
described in section 3.5. The hardware layout for the 
case where m = 4 is shown in Fig. 2, which illustrates 
the array of sixteen processors required, and the 
contents of their registers during iteration zero of the 
algorithm. Each row in Fig. 2 contains four processors, 
which are co-located with the corresponding input 
module. Each column in Fig. 2 processes requests for a 
single output module. Thus, for example, the processor 
in row one and column two of the array handles 
requests for cells to be routed from input module one to 
output module two. A total of 64 paths is available 
through the switch (four for each input-output module 
pair). The sixteen processors attempt to allocate cells to 
sixteen of these paths during each iteration. After each 
iteration, the updated value of Air is passed to the 
adjacent processor in the same row, and the updated 
value of Brj is passed to the adjacent processor in the 
same column. The directions of data flow are indicated 
by arrows in Fig. 2. No two processors can allocate a 
path sharing a channel in the same iteration. 
Nevertheless, after four iterations, all possible paths 
have been allocated. 
s2J 
'I 
33: The processing element. 
The processor must execute the arom'co procedure, 
1. 
2. Perform three subtractions. 
Fig. 3 shows a possible implementation of the 
processing element, which uses bit-serial arithmetic. 
Determination of the minimum requires values to be 
presented most significant bit first, while bit-serial 
subtraction requires values which are presented least 
significant bit first. Hence the processor must be able to 
perform bit reversal on the quantities processed. A bit- 
parallel implementation avoids this difficulty. at the 
cost of increased complexity. 
The processor design will involve a trade-off of 
circuit complexity against operating speed, since there 
is an upper bound on the permissible execution time. 
The time within which the algorithm is required to 
execute depends on whether cells losing contention are 
discarded or are queued until the next time slot. 
Consider the case where cells losing contention join 
an input queue. The queue controller, when it submits a 
cell to the path allocation process, retains a copy in the 
input buffer. It then awaits an acknowledgement signal 
from the path allocation hardware, indicating whether 
the cell has been allocated a path through the switch, or 
has been discarded. It then submits the copy cell (if the 
original cell was discarded) or it purges the copy cell 
and submits the next cell in the input buffer (if the first 
cell was successfully routed). The acknowledgement 
must be returned within the duration of one time slot so 
that cells can be submitted to the switch in successive 
time slots. Hence the time taken for the path allocation 
process to execute should be less than the duration of 
one time slot. This stringent requirement could be 
relaxed if preservation of cell sequence was not 
mandatory, since a cell losing contention could then 
rejoin the queue in a later time slot. 
No acknowledgements are required if there is no 
input queueing. Hence the path allocation process need 
not execute within one time slot. However, it must still 
be possible to submit cells to the switch in successive 
time slots. Additional copies of the path allocation 
hardware are required to ensure this, equal in number 
to the execution time of the path allocation process in 
time-slots. For example, if the path allocation process 
has an execution time of two time slots, cells arriving 
during even-numbered time slots will be processed by 
one copy of the hardware, and cells arriving during 
odd-numbered time slots will be processed by another. 
Hence a tradeoff may be performed during switch 
and thus must perform two types of operation: 
Find the minimum of three numbers. 
8b.l.3 
929 
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 09:41:33 UTC from IEEE Xplore.  Restrictions apply. 
'I I 
design between processor speed and the number of 
procewm requid. Omitting the input queues has the 
additional advantage that no hardware is required to 
generate the acknowledgements. 
3.4: Counting requests 
Hardware is also needed in each input module to 
perform the following tasks before and during the path 
to count the number of requests for each 
output module so as to obtain the initial values of the 
to forward a routing tag based on the results of 
allocationprocess: 
Kij'S; 
path allocation to each input cell. 
The counting of requests can be performed by the 
hardware of Fig. 4. This merges the input cells with a 
set of control packets, one for each input module, in a 
Batcher sorter (with nl+m inputs and outputs), in the 
manner described in [ 12). Idle inputs submit an inactive 
packet to the sorter. The sorter output contains 
(Starting at the lowest-numbered output in Fig. 4) the 
control packet for output module 0, followed by all the 
data cells intended for output module 0, followed by 
the control packet for output module 1, etc. The 
inactive packets are sorted to the highest-numbered 
outputs. 
The address generaton in Fig. 4 serve different 
purposes during the counting of requests and in routing 
tag assignment. When counting requests, they 
determine the type of packet which is present at the 
corresponding output of the Batcher network, and 
generate a bit (the identity bit) which is 1 for a control 
packet or an inactive packet and 0 for a data packet 
(cell). A copy of the identity bit is stored in a one-bit 
register which is connected to its neighbours in 
adjacent address generaton in such a way as to form a 
shift register. 
Thm bits zue shifted nl+m times (in the direction 
shown in Fig. 4) from address generator to address 
generator, and hence into a counter of wordlength 
rlog,(m)l which is reset upon receiving a 1 (a control 
packet or inactive packet) and incremented on receipt 
of a 0 (data cell). Each time a 1 is received the counter 
etc. After nl+m shifts the appropriate values are stored 
35: Routing tag assignment 
contents rotated into Kj, Ki,o is rotated into Ki,i, 
in the Kij registers. 
The algorithm used for muting tag assignment is 
best described by means of an example. We assume 
that there are four intermediate switch modules, 
labelled rSMo to ISM3 and consider how to assign 
routing tags to the cells from input module zero 
which have requested output module one (OM1). 
Let us assume that four cells from IMo have 
requested output module zero (OMg), and that seven 
cells from have requested OM1. Hence, the fM 
thirteen outputs of the Batcher sorter are as shown in 
Fig. 5, after the data cells have been merged with 
control packets, as discussed in section 3.4.2. 
The results of the path allocation process for cells 
from requesting OM1 are produced by processor 
Xol. Some means must be found to forward these 
results to the relevant cells, which ap- at sorter 
outputs six through twelve. We assume that paths are 
allocated in the order shown in Table I. Thus one cell 
loses contention. 
Note that a connection has already been established 
from processor (via the routing packet generator 
for OM1) to sorter output five, as shown in Fig. 5. 'Ihus 
the relevant routing information may be easily 
forwarded to sorter output five. The problem remains 
of how to relay this information to the data cells at 
sorter outputs six through twelve. 
The solution to this problem would be trivial if all 
seven cells were to be allocated a route via the Same 
intermediate switch module. The address generator at 
sorter output five could generate seven tokens, each 
granting access to that intermediate switch module. 
These seven tokens would be forwarded to sorter output 
six, where one would be seized. The remaining six 
tokens would be forwarded to sorter output seven. After 
seven iterations of this procedure, al l  seven data cells 
would possess the necessary token, stored in the 
associated addms generator. 
This token passing algorithm is wi ly  implemented 
in hardware. Each address generator stores data 
concerning tokens as a routing packet containing two 
fields, which are the token address (indicating the 
address of the intermediak switch module to which 
access is being granted) and the token count (indicating 
the number of such tokens). Routing packets are 
received by the address generators associated with 
control packets (such as that at output five in our 
example), from the appropriate routing packet 
generator. Other address generators receive a routing 
packet from their neighbouts at the start of each 
iteration of the algorithm. If the token count is zero, 
this muting packet is discarded. Otherwise the token 
count is decremented, and the routing packet is stored 
in the address generator, and is forwarded to the 
adjacent generator at the start of the next iteration. 
The hardware required is thus very simple, with a 
8bml m 4  
930 
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 09:41:33 UTC from IEEE Xplore.  Restrictions apply. 
unidirectional flow of data from address generator to 
address generator. Once all the tokens have been 
distributed (after seven iterations), subsequent 
iterations of the algorithm cause no changes in the 
route assignments, so that the number of iterations 
performed is not critical, provided it is not too low. 
The same hardware may be used to solve the 
problem of assigning routing tags to the data cells when 
paths through more than one intermediate switch 
module have been allocated. The technique is to 
execute multiple passes of the above algorithm, each 
for a different token address, and each with an 
appropriately chosen value for the token count. An 
address generator may then receive a succession of 
routing packets, each with a different token address. 
However, only the last such packet received will 
contain the c o m t  token. 
Five passes of the algorithm suffice to supply all the 
data cells with the correct token in the example of Fig. 
5, as shown below. 
Pass One: Seven tokens are received for ISMI. At 
the end of this pass, all seven cells have been assigned 
a route through ISMI. 
Pass Two: Four (i.e., 7-3) tokens are received for 
IS&. At the end of this pass, four cells have been 
assigned a route through IS%. Three cells (i.e., those 
at sorter outputs ten through twelve) have retained their 
tokens for ISMI. 
Thus, at the end of Pass Two, the correct number of 
cells has been assigned a route through ISMI.  
Pass Three: Three (i.e.. 7-3-1) tokens are received 
for ISM3. 
Pass Four: Three (i.e., 7-3-1-0) tokens are received 
for ZSM2. 
Pass Five: One (i.e., 7-3-1-0-2) null token is 
received, indicating that one cell has lost contention. 
At the end of Pass Five, a route through each 
intermediate switch module has been assigned to the 
correct number of cells, as shown in Fig. 5. In 
particular, no route is assigned via ISM3. The key to 
achieving the required result was the correct choice of 
token count for each pass of the algorithm. The 
sequence chosen was (7,4,3,3,1). This is identical to 
the sequence of Ko1 values generated by processor 
during the path allocation process, i.e., it is equal to the 
number of cells not yet allocated a path after each cycle 
of path allocation. Thus, the necessary sequence of 
token counts may be obtained from the TCout output of 
the processor, shown in Fig. 3. 
Each pass of the algorithm can commence after the 
frrst iteration of the preceding pass. Hence the 
execution time increases only slowly with the number 
of passes. Routing tag assignment can thus be carried 
out as follows. 
The routing packet generator associated with 
processor Xij generates a routing packet, concurrently 
with each iteration of the path allocation algorithm. 
The token address is the value of r. the intermediate 
switch module through which the processor is 
attempting to route cells. This value can be easily 
generated by a counter decremented after every 
iteration of the path allocation algorithm. The token 
count is set to the value of Kij. This routing packet is 
forwarded to the address generator associated with 
control packet j through the Batcher network in Fig. 4. 
Each address generator performs the actions 
illustrated in Fig. 6 after every iteration of the path 
allocation algorithm. Upon completion of the path 
allocation algorithm, the value of Kij is equal to the 
number of cells which have lost contention. This is 
forwarded to the relevant cells in a special routing 
packet, whose token address field indicates that these 
are null tokens, which flag the corresponding cells as 
having lost contention. 
Upon completion of this process, the token address 
and token count values stored by each address 
generator comprise a unique routing tag, which is then 
prefixed to the associated data cell. The cell is 
submitted to the first stage of the switch, and is thereby 
routed to the appropriate intermediate switch module. 
The operation of this algorithm for the example 
considered earlier is shown in Table 11. The data stored 
in each address generator at the end of each iteration of 
the algorithm are shown. After nine iterations, the 
routing tags have been successfully assigned. 
This routing assignment algorithm has the benefit of 
simplicity but its execution time is quite long. 
However, it operates in parallel with the path allocation 
process, although it takes longer to execute, because of 
the delay in propagating routing packets through the 
address generators. 
4: A design example. 
A 3072 x 8192 switch can be constructed by 
choosing L1= L2 = m = 32, n1= 96. n2 = 256, Si = 4 
and S2 = 8 in the switch shown in Fig. 1. The 
probability of cell loss due to non-allocation of paths 
through this switch has been obtained by simulation. 
The simulation model assumes that all switch inputs 
have a 100% load, that traffic is uniformly distributed 
among the output modules, and that cells not allocated 
paths on the first attempt are discarded. The resulting 
figure for cell loss probability is below The 
8b.l.5 
93 1 
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 09:41:33 UTC from IEEE Xplore.  Restrictions apply. 
I I '  
performance of this type of switch will be considered in 
greater detail in a future paper. 
The input module dimensions are 128 x 128. At 
most 96 of the 128 inputs carry active data The 
dimensions of the intermediate and output stage 
of VLSI technology in the shoat tenn, &wing a switch 
with a throughput of 400 Gb/s to be constructed. 
References 
modules are 128 x 256, and 256 x 256 respectively. 
The input and intermediate switch modules can be of 
simple design, since they are contention-free. Thus the 
only stage of the switch which rep"& a major 
design challenge is the output stage, wheze the 256 x 
256 switch modules should also introduce a low cell 
loss probability. 
We now estimate the speed required of the path 
allocation hardware. The counting of requests r e q h  
a Batcher network with 128 inputs, and requires 128 
(i.e., n1 + m) clock cycles to execute. One execution of 
the atomic() procedure, if implemented using the 
technique of fig. 3, requires appmximately 15 clock 
cycles, depending on the implementation. Thus perhaps 
480 (i.e., 15.32) cycles will be needed to test all 
possible paths. The number of processors required is 
1024 (i.e., 32.32). but the IC count should be relatively 
low because of the simplicity of the ~processar design. 
The route allocation process will add at most (964.2 = 
184 cycles to the execution time, assuming that two 
cycles are required to popagate routing packets 
through the address generators (at least 4 cells will be 
allocated a path). This represents a total of 792 clock 
cycles. Thus the clock rate required for the complete 
algorithm to execute within one time slot is appmx. 
good propagation delays). The required clock speed 
could be reduced using a more complex procesror 
design (e.g., using bit-parallel arithmetic). An 
alternative is to construct two copies of the hardwait, 
which process requests in alternate time-slots. The 
clock rate required in this case should be below 150 
MHZ. 
5: Conclusions. 
290 MHZ (neglecting any speed-up required to make 
A new algorithm for path allocation in three-stage 
broadband networks has been described. A complete 
hardware implementation of this algorithm has been 
presented, including a method for generating the initial 
data required by the algorithm, and far farwarding the 
results to each cell at the input si& of the switch, in the 
form of a routing tag. The operating speed required of 
the design (E 150 MHz) appears within the capabilities 
[l] J.S. b e r ,  "Design of a koedcrut packet switching 
network", IEEE Trans. G"., Vol. COM-36, m. 6, 
pp. 734-743, Juas 1988. 
[2] A. Huang a d  S. IGuuer, "SLUlitc: 8 wideband digital 
switch". Globrcom '84 CarJcnncc Record, pp. 121- 
125. Nov. 1984 
[3] J.N. Giacopelli. W.D. S i c  and M. Littlewood, 
packet switch architechua", Proc. 4 the Intemiional 
Switching S y " ,  Stockholm. 1990, vol. III, pp. 
[4] H. Kuwahara, N. EnQ. M. O g h  and T. KO&, "A 
shared buffer munary switcb f a  an ATM exchange". 
Proc. ICC '89. pp. 118-122. 
(51 T.T. Lee, "A modular architecture f a  very large packet 
switches", IEEE Trans. Cmvnun., Vol. COM-38, rm. 
[q C. Clos. "A study of na-blocking switching networks", 
Bell Systems Tech. l ourd .  vol. 32. M. 2. pp. 406 
424. Mar. 1953. 
broadband packet switch", IEEE J.  Sekct. Areas 
1988. 
[8] K.Y. Eng ud C-L I. "Performance analysis of a 
growable architecture f a  brodbaod packet (ATM) 
switching", Globecam '89 Ca@rence Record, pp. 
[91 ICY. Eng, MJ. Karol and YS. Yeh. "A growable 
packet (ATM) switch architecture: design principles 
and applicaiims". Globecom '89 Covrfeence Record. 
[lo] A. C i e r o s .  "Large packet switch and contention 
resolution device", Proc. (4 the In temat id  
Switching Symposium, Stockholm, 1990, vol. IH, pp. 
(111 R. Proctor md T. Maddan, "Synchronous ATM 
switching fabrics", Proc. qf the lnternatlionol 
Switching SympoSum, Stockholm. 1990, vol. IV, pp. 
[12] C. Day, J. Giacopelli and J. Hickey. "ApplicatiOM of 
self-muting switches to LATA fiber optic I1ctwoI1Es". 
Proc. ISS '87, pp. 519-523. 
"Sunchine: a high paform8nce ~- fou t ing  brordbrrnd 
123-129. 
7, pp. 1097-1106, Jdy 1990. 
[7] A. PaurVina, "Multi~hurnel bandwidth d o d o n  h 8 







Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 09:41:33 UTC from IEEE Xplore.  Restrictions apply. 
V: channel rate (155 Mb/s) 
L1 (L2> : the number of input (output) modules 
n1 (n2) : the number of input (output) ports per input 
(output) module 
m: the number of intermediate switch modules 
S1 (S2) : the number of channels in the channel group 
connecting each input (output) module to each 
intermediate switch module 
Fig. 1: A three-stage switch with intermediate channel grouping. 
routingvia ISM1 IsMO ISM3 ISM2 
no. of paths 
Table I: An example of path allocation. 
c 
Fig. 2: Processor contents during iteration 0. 
MUX : Multiplexor 
D: Synchronisation Delays 
B R  Bit Reversal 
TC: Token Count 
Fig. 3: The afomicO processor implementation. 
8b.l.7 
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 09:41:33 UTC from IEEE Xplore.  Restrictions apply. 
fbnvudrd 
<n m inputm) 
X,& 
AG: AddmrOMerptor 
Proceuor (input module i. output module j) 
Routing PackaOenerator 
Fig. 4: Request count, routing tag generation. 
7 
Fig. 6: Address generator (AG) operations 
during routing tag assignment. 
--------- 
Ea& column npnrcntr (LO iteration of 1& algorithm. Each tow 
rrpruents the contents of (LO d d m s  gmerptor after it lm KM a 
token. Ao 'x' indicates a null token. 
Fig. 5: An example of route assignment. Table II : An example of routing tag assignment 
8b.l.8 
934 
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 09:41:33 UTC from IEEE Xplore.  Restrictions apply. 
