Abstract-Multistage interconnection network (MIN) has been widely used for parallel computer systems, and also recognized as an efficient switching fabric for digital communication. In this paper, we propose a new switching mechanism for MIN's called unit step buffering (USB) which significantly improves the network performance. Here each cell is allowed to move only one buffer entry position using short network cycle. The proposed USB scheme is compared to the traditional scheme by analytical modeling and computer simulation. They reveal that throughput and delay are improved about 60%-80% for practical size MIN's with reasonable traffic in asynchronous transfer mode (ATM) switching environment. Improvement on parallel computer systems with larger size packets is more significant as about 100%. More importantly, the scheme does not require any additional hardware or operational overhead.
I. INTRODUCTION
M ULTISTAGE interconnection network (MIN) [1] has been recognized as an efficient interconnection architecture and switching fabric for parallel computer systems and digital communication [2] , respectively. A MIN consists of a number of switching elements (SE's) connected in multiple stages, and allows flexible end-to-end connections at relatively low cost. Designs with fault tolerance capability or high reliability [3] , [4] are also available. Due to these properties, numerous commercial and experimental parallel computer systems [5] , [6] and switching architectures for the integrated service digital network (ISDN) [7] , [8] have been developed based on MIN's.
The performance of a MIN is determined mostly by the switching mechanism adopted. In MIN's, buffer modules are used to hold the packets (or cells) which cannot proceed due to link conflict or buffer space unavailability. In what follows, we refer a buffer module to a queue with the first-in first-out (FIFO) operation, and a buffer to an individual buffer entry, respectively. The traditional switching mechanism of buffered MIN's [9] - [12] is that, in one network cycle, the buffer space availability of the last stage is first passed to the preceding stage and this process continues down to the first stage. Then all the cells proceed one stage forward at the same time. We call this big clock cycle (BCC) scheme. It was later identified [13] , [14] that the performance of MIN's can be significantly improved if the packet movements are decided based on the buffer availability of the succeeding stage without considering the possibility for having an empty buffer due to a packet leaving it. Therefore, the buffer space availability is not needed to be propagated from the last stage to the first stage as in the BCC scheme, which results in a shorter network cycle. We call this small clock cycle (SCC) scheme.
Another important aspect regarding the switching mechanism identified in this paper is the distance of packet movement. In the traditional design, when a packet moves forward to the SE in the succeeding stage in one network cycle, it is assumed to be located at the last empty buffer (tail) of the receiving buffer module. The tail of a buffer module changes according to the number of packets in it. Here the longest packet movement delay occurs when the receiving buffer module is empty and thus the incoming packet moves forward to the front of it. We call this scheme multiple step buffering (MSB). A network cycle in the MSB scheme needs to be long enough to guarantee the complete packet movement to the head. In reality, however, most buffer modules are almost always full for reasonable input load [15] , and thus the packets can finish the movement early in each network cycle. Then the unnecessarily long network cycle is a substantial waste of the bandwidth. If we let each cell always move only one buffer position by employing a much shorter network cycle (means faster switching), then the bandwidth of a network cycle can be fully utilized. This scheme is called unit step buffering (USB).
Considering the combinations of the BCC/SCC and MSB/USB scheme, there are four schemes, such as BCC-MSB (BM), BCC-USB (BU), SCC-MSB (SM), and SCC-USB (SU). The BM and SM scheme have been already well studied [12] , [14] . Due to the nature and interrelationship of network cycle and switching mechanism, the SU scheme is expected to be the best choice. In this paper we identify the relative effectiveness of the schemes using analytical modeling and computer simulation. The performance enhancement of the BU scheme over the BM scheme is very significant as the throughput is increased about 60-80% for practical size MIN's with reasonable amount traffic in ATM switching environment. Similar results are also obtained for delays. size of 1 kbyte (compared to 53 bytes for ATM cell), the throughput and delay improvement are about 100%. The performance increase of the SU scheme over the SM scheme is more substantial. Comparing the SU and BU scheme, the SU scheme outperforms the BU scheme about 20%-50% for ATM switching. Another important aspect is that the proposed USB scheme requires no additional hardware or operational overhead. The rest of the paper is organized as follows. The packet switching mechanism required in MIN is discussed in more detail in Section II, which motivates the proposed approach. In Section III, analytical models of the proposed approach based on big and small clock cycle are developed. The schemes are then evaluated and compared using both the analytical models and computer simulations for various operational conditions in Section IV. Section V concludes the paper.
II. PACKET SWITCHING MECHANISM
In this section, the packet movement mechanism employed for MIN is examined first. The clock cycle design issue follows it.
A. Packet Movement
Let us denote the size of a buffer module, and the time required for shifting the content of a buffer one position, respectively. Since a buffer module is basically a shift register, it takes time units for an incoming cell to be placed at the tail of a buffer module holding cells. If a buffer module is empty, the time required for an incoming cell to reach the head of the buffer module is . This is the longest time for a cell to move from one SE to another. Note, however, that the buffer modules of a MIN with some practical amount traffic in the steady state are almost always full [15] . In this situation most of the cell movements will be completed much earlier than . The performance degradation due to the waste of network bandwidth will thus be substantial if the network cycle length is set to . To avoid this problem, a new mechanism is proposed where each cell is allowed to move only one buffer position in one network cycle. This results in that the network cycle length can be set to minimal, and the network bandwidth is also fully utilized. Fig. 1 compares the cell forwarding mechanisms employed in the conventional MSB scheme and the proposed USB scheme. The thick lines represent that the two head cells compete for the same output link. Here the head cell in the upper buffer module is assumed to win the contention. Observe from Fig. 1(b) that the incoming cell is located not at the tail but at the end of the buffer module with the USB scheme.
B. Clock Design
A network cycle of the conventional MSB scheme consists of two phases as shown in Fig. 2(a) , where . In Phase 1, each buffer module decides if the receiving buffer module has a buffer space for accepting a cell, which takes . Assuming a unit time for one cell movement between stages, is for an -stage MIN with the BCC scheme. It is always 1 for the SCC scheme regardless of the number of stages. In Phase 2, each buffer module sends the head cell whose duration is set to in the MSB scheme. It is for the USB scheme. A network cycle for the USB scheme is thus as shown in Fig. 2(b) . Table I summarizes the length of a network cycle of the four combinations of the schemes.
The standard ATM cell is 53 bytes long, and the ATM switch fabric is usually designed based on the -byte wide time-multiplexed bus [16] . Then . For example, let us assume a 6-stage 4-buffered MIN with 8-byte wide time multiplexed bus. The network cycle of the BU scheme is , while that of the BM scheme is . Hence, the network cycle of the BM scheme is about 2.6 times longer than that of the BU scheme. For the SU scheme, it is . Note that the length of a network cycle of the BU scheme is about 1.63 times longer than that of the SU scheme. The difference in the network cycle lengths thus needs to be accounted for when the performances of the BU and SU scheme are compared.
The ratio of the length of Phase 1 to that of Phase 2 depends on the packet size and the bus width. Phase 1 is usually much shorter than Phase 2 for practical size MIN's. Referring to Table I , the length of a network cycle in the BU scheme is , and in the SU scheme, respectively. Therefore, the length of a network cycle in the BU scheme is times longer than that of the SU scheme for an -stage MIN. In general, the difference gets larger as the number of buffers or packet size increases, or the bus width decreases. Due to the smaller network cycle length and the fact that almost all buffer modules are nearly full, the USB approach outperforms the conventional MSB approach as verified by analytical modeling and computer simulation in Section IV. We next develop the analytical models for the MIN's with the proposed USB approach. The analytical models for the MSB approach can be referred to in [12] and [14] .
III. ANALYTICAL MODELS FOR THE USB SCHEME
In this section, we develop analytical models for the synchronous multibuffered MIN's of 2 2 SE's employing the USB approach. The buffer modules in each SE are located at the input port. The following assumptions and definitions are used in the development of the models.
A. Assumptions and Definitions
Assumptions:
• The cell arrival process is Bernoulli.
• The probability of cell arrivals at each network inlet is the same for all network inlets.
• The cells generated in each input node are uniformly distributed over all the output nodes as usually assumed in earlier conventional MIN models [9] - [14] .
• Each cell has the same probability to win the contention, and the blocked cell is resubmitted to the originally destined buffer module.
• The output nodes are fast enough to receive a cell in one network cycle from the SE's in the last stage. In what follows, front cell denotes the cell residing in the front buffer entry, while head cell represents the frontmost one. Note that a buffer module may not have the front cell even though it has some cells in it. In other words, the front cell is always the head cell, but the reverse is not always true. In each network cycle, the front cells in an SE contend with each other for a link. Therefore it is important to know if the front cells of a buffer module (if it has one) has been blocked or not. The state of a buffer module is defined as follows based on the status of the head cell and the number of cells.
States:
• State 0: It is empty.
• State : It has cells and the head cell moved into the current position in the previous network cycle. Note that the head cell is not necessarily the front cell except for State .
• State : It has cells and the front cell could not move forward either due to contention or the buffer unavailability of the destined buffer module in the previous network cycle. Therefore the front cell has stayed there at least for one network cycle. When the front cell is blocked, all the adjacent cells directly behind it cannot move forward. This is called head of line blocking [17] , [18] . When a cell that experienced the head of line blocking becomes the front cell, it does not have any predetermined destination buffer module. This is because it has not participated in any contention before. It is thus not necessary to distinguish the cells that experienced the head of line blocking from the newly arriving cells. The blocked cell is resubmitted to its original destined buffer module in the subsequent network cycle. If a cell is not a blocked cell, it is considered to be a normal cell. Some variables are defined next that are used to construct the analytical model.
Variables: Fig. 3 .
B. Analytical Model for the BU Scheme 1) State Transition:
The state transition diagram for a buffer module can be developed using the states and variables defined in the previous subsection, and shown in Fig. 4 , assuming the buffer size of 3. The state transitions are determined as follows.
Assume State . This is the state in which the buffer module has one cell, the location of which is one of the three buffers. Also if the cell is the front cell, it must have arrived at that position in the current network cycle. Therefore the transition to State 0 requires the three conditions that the single cell is at the front buffer, no new cell arrives, and finally the front cell moves forward. The probability of each of these condition is 1/3, , and , respectively. Consequently, the transition probability from State to State 0 is . Let us also see how the transition probability from State to State is obtained. One scenario allowing the transition is that the single cell is at the front buffer (1/3), a new cell arrives ( ), and the front cell moves forward ( ). The probability for this is . The other scenario is that the cell is not at the front buffer (2/3), and no new cell arrives ( ), whose probability is . Hence, the transition probability from State to itself is . Other transition probabilities are obtained similarly.
The generalized state equations can be obtained using the state transition diagram of Fig. 4 . Here is omitted from all the state variables at the right hand side of the equations to simplify the notation. For of (2), the five source states are , , , , and . The first term of (2), , is obvious. The second term was explained above. The term covers the transition from State to State , and it represents that the blocked front cell leaves the buffer module while a new cell arrives. The term is for the transition from State to State , and it is obtained from the condition that one of the two cells in the buffer module is the front cell, no new cell arrives, and the front cell moves out. For the last term of the transition from State to State , the blocked front cell moves out and no cell arrives.
1) Performance Measures: Two performance measures-normalized throughput and mean delay per cell-are used to evaluate the performance of MIN's. Normalized throughput is the average number of cells leaving an output node in one network cycle, and mean delay is the average time each cell spends in the network. We first present the procedure for computing these measures. As seen in the procedure above, MIN performance in terms of throughput and delay is estimated by repeatedly calculating the required variables from the last (rightmost) to the first (leftmost) stage. This is because the network status starts to change from the last stage in a specific network cycle by transmitting the cells to the destined outlets. Such changes propagate backward to the first stage, and then one network cycle is finished.
Note that the closed form equations of Throughput and Delay are extremely difficult to obtain due to the complexity of network and randomness of cell movement. Therefore, as is usually done, the values of the performance measures are determined by iteratively computing the related parameters until the system reaches the steady state. Next we present how , , , and of Steps 4, 5, and 6 of the procedure above are calculated.
2) Calculation of : is the probability that the normal front cell in BM( ) can reach the destined buffer module in Stage ( ) during . For this, it has to win the contention for a channel and a buffer space in the destined buffer module must be available. The possible state of the other buffer module competing with it is: 1) empty; 2) with the normal front cell (if exists) destined to a different buffer module; 3) with the blocked front cell destined to a different buffer module; 4) with the normal front cell (if exists) destined to the same destination; or 5) with the blocked front cell destined to the same destination. Recall that when two cells compete for a same destination, they are assumed to have the same probability to win the contention. The buffer availability of the receiving buffer module in Stage (
) for a front cell of BM( ) is determined by the condition whether there was a cell movement to that buffer module during the previous network cycle or not, and the current state of the buffer module contending with BM( ). We first obtain the probability that the normal front cell moves forward in Case 1, .
Case 1: The contending buffer module is empty. In this case, (8) Here and represent the probability that one cell and no cell was received in the previous network cycle, respectively.
represents the probability that a buffer is available in the destined BM( ) at time given that the two buffer modules in the SE at Stage are currently in State and State 0, and it received one cell in the previous network cycle. To obtain we need to know in which state the buffer module can be in now using the condition that it received a cell. The buffer module cannot be in State 0 since it received a cell. It cannot be in State either since the newly arrived cell cannot have been blocked yet and it cannot be the front cell. All other states are possible. This is more easily understood if the state transition diagram of Fig. 4 is referred to. Observe that only State 0 and do not have any incoming arc whose transition probability has the factor . Recall that is the probability that a cell is ready to arrive in a buffer module. Note that even though the receiving buffer module is full (either State or State ), it is still possible to receive a cell by transmitting the front cell to the succeeding stage simultaneously, as shown in (8a) at the bottom of the page. In this manner, is obtained for the case that the receiving buffer module received no cell in the previous network cycle and a buffer space is available. The buffer module can be in any state except State in this condition. Again, the states can be found from the state transition diagram as the ones with the incoming arc whose transition probability contains the factor , as shown in (8b) at the bottom of the page.
Case 2:
The other buffer module is in a normal state and the front cell is destined to a different buffer module.
is obtained by the same way as for . Here the normal front cell can move forward without any contention.
(9)
The term 1/2 is the probability that the front cell of the other buffer module is destined to a different buffer module. Since nonempty buffer modules do not always have the front cell, underestimates the probability. However, the buffer modules are mostly full (and so highly likely contains the front cell), and the approximation error will be trivial.
Case 3: The other buffer module is in a blocked state and the front cell is destined to a different buffer module.
is easily obtained as
Case 4: The other buffer module is in a normal state and the front cell is destined to the same buffer module.
When the two front cells compete, each cell has the same probability to win the contention which is 1/2. The probability of this case is then ( 
11)
Case 5: The other buffer module is in a blocked state and the front cell is destined to the same buffer module.
Similarly,
Here represents the buffer availability of the buffer module in Stage at time given that the two buffer modules in the SE at stage are currently in State and State ( ) while the buffer module received no cell in the previous network cycle. This implies that the receiving buffer (8a) (8b) module has been full, and thus only the possible state of the buffer module is State . Hence, . Finally, we obtain as follows: (13) 3) Calculation of : is the probability that a blocked cell in BM can move to its destined buffer module in Stage ( ) during . It can be calculated basically by the same approach as for . We again consider each of the five cases separately.
State 1: The other buffer module is empty. In this case, the probability that the blocked cell can be transmitted is The probability that the normal cell heads to the same module is , and the probability of winning the contention is 1/2. The probability is then as follows: (17) In this case of both the blocked buffer modules, no cell must have been received in the destined buffer module in Stage in the previous network cycle. The probability of winning the contention while the other blocked cell is heading to the same destination is . Therefore (18) Note that the destined buffer module was full at time , and thus the state of the buffer module at time is . Consequently, is the probability to be included. Finally, we obtain as follows: is the probability that BM receives a cell during . In other words, it is the cell arrival rate of the input port of the SE at stage at time . The probability can be calculated using the fact that a cell is received only when it is ready to come from a buffer module in the preceding stage, and a buffer space is available for accommodating it. The probability of a cell arrival at the first stage, , is obtained by considering the initial offered traffic to the network and the buffer availability of the stage. Thus, (22) The probability of a cell arrival in other stages is equal to the cell departure probability of the preceding stage. Using and , we obtain (23)
Note here that the factor represents the probability that a front cell exists when there are cells in a buffer module of buffers. Recall that and were derived assuming that a front cell exists. Finally, is obtained as shown in (24) at the bottom of the page.
5) Throughput and Delay:
Normalized throughput of a MIN is defined as the number of cells leaving the network per network cycle when the MIN is in steady state. Hence, the normalized throughput is obtained as follows:
Throughput (25) where is the time for reaching the steady state.
The mean delay is defined as the average amount of time a cell spends in the MIN. Note that some empty buffer entries can exist between the cells in a buffer module with the (24) Fig. 5 . Ds versus nc.
proposed scheme. Consequently, the delay of a cell in a buffer module depends not only on the number of cells but also the distribution of the cells in it. This disallows direct application of Little's formula for computing the delay.
To model the delay, we first define as the time required for a cell to move forward one buffer entry position. It is a function of the number of cells in a buffer module . Note that since a cell can always move forward in one cycle if the buffer module is empty.
since the front cell can leave it with the probability of .
is not linear to . The smaller is, the less chance of head of line blocking, and consequently the smaller the delay is. On the other hand, as increases, the delay increases more sharply. A quadratic function with a positive slope well represents this relationship, and thus it is adopted for . It is shown in Fig. 5 . Assuming , and using the two boundary conditions of and , we obtain and as
The time delay at the th stage, , is then obtained as (27) where , i.e., the expected number of cells in a buffer module in . In the equation above, the first term is the delay for a newly arriving cell to reach the front buffer, while the second term represents the time for the front cell move out. The mean delay is finally obtained as Delay .
C. Analytical Model for the SU Scheme
An important basic assumption of the BCC-based network cycle regarding the buffer space availability is that the front cell in a sending buffer module can move forward even though the destined buffer module is full if the front cell of the destined buffer module can move forward in the same network cycle. Here the control information carrying the buffer space availability must have been passed across the MIN from the last stage to the first stage as discussed earlier. Therefore, Phase 1 of a network cycle must be long enough to allow the propagation, and it is O( ) [13] for -stage MIN's. MIN's of different number of stages would require different lengths of clock design, causing poor system expandability. In the SCCbased clock design, the control information is exchanged only between two adjacent stages, and thus the length of Phase 1 of the network cycle is O(1) for -stage MIN's. Previous studies [13], [14] show that the SCC-based clock design outperforms the BCC-based design for practical operational conditions.
In this section, we develop an analytical model of the MIN's employing the SU scheme. The same assumptions and definitions employed in the previous section are also used here. The state transition diagram is shown in Fig. 6 
Here the two buffer modules in the SE at stage are currently in State and State 0, and the destined buffer module received a cell in the previous network cycle. To obtain we need to know in which state the buffer module can be in now using the condition that it received a cell. The buffer module cannot be in State 0 since it received a cell. It cannot be in State since the newly arrived cell cannot be blocked yet. Besides, it cannot be the front cell. All other states are possible. Refer to Fig. 6 . Notice that the receiving buffer module should not be full in order to be able to receive a cell unlike in the BCC based scheme, as shown in (37a) at the bottom of the page. In this manner, is obtained for the case that the receiving buffer module has received no cell in the previous network cycle. The buffer module can be in any state except State in this condition. Again, the states can be found from the state transition diagram as the ones with the incoming arc of factor . Thus,
We obtain other probabilities similarly.
(38)
Here represents the buffer space availability of BM at time given that the two buffer modules in SE( ) are currently in State and State ( ) while the buffer module received no cell in the previous network cycle. This implies that the receiving buffer module has been full, but depending on the movement of the front cell in the previous network cycle, it is in either State or State in the current network cycle. Hence, . Finally, we obtain as follows:
Calculation of :
In the case of both the blocked buffer modules, , no cell must have been received in the destined buffer module in Stagein the previous network cycle ( ). The probability of winning the contention while the other blocked cell is heading to the same destination is . This is the case that the destined buffer module was full at time , and it is in either State or State at time . Consequently, is the probability to be included. Finally, we obtain as follows.
Calculation of and : The probability can be calculated using the fact that a cell is received only when it is ready to come from a buffer module in the preceding stage, and a buffer space is currently available for accommodating it. To obtain for the first stage and the other stages, we use each of the following two equations, respectively:
(49) (50) (37a) Finally, is obtained as follows:
We use the same equations derived in the previous section to obtain the normalized throughput and mean delay.
IV. PERFORMANCE EVALUATION
The BU and SU scheme are compared and evaluated in this section. They are also compared to the BM and SM scheme.
A. BU Scheme
Figs. 7-10 plot the throughput and delay data of the BU scheme for MIN's containing buffer modules with capacity to hold three and five cells (three-buffered and five-buffered) of six stages (64 64) and ten stages (1024 1024), respectively. The offered traffic load varies from 0.1 to 1, and data are collected from both the analytical model and computer simulation. Here each analytical and simulation datum is obtained by averaging 10 runs. In each run, 1 000 000 iterations are taken to collect reliable data, and the variations in the last 100 000 iterations are less than 0.1%.
The performance of the BU scheme is compared to the BM scheme in Table II (a) and (b). As we mentioned in Section II, if the bus size is -byte and the ATM cell size is 53-byte, then the time to move one cell in our scheme, , is . Note that the length of the network cycle of the conventional MIN with the BM scheme is times longer than that of the proposed BU scheme. Therefore, throughput of the BM scheme is divided by this factor for fair comparison to the BU scheme. For the same reason, delay of the BM scheme is multiplied by the factor. Note that the normalization factor is decided by the bus width, number of stages (network size), and number of buffer entries. Here we consider two cases of bus widths-4-and 8-byte.
Observe from Table II(a), that the throughput of the BU scheme is about two times higher than the conventional scheme for both the 4-and 8-byte channel cases. The delay of the BU scheme is also much smaller than the conventional BM scheme for relatively high traffic load. However, the delay of the BU scheme is slightly higher than that of the BM scheme for relatively low traffic. This is because the buffer modules of MIN's with low traffic in the steady state are almost empty. The SU scheme, however, solves this problem as shown in the next subsection by using small clock cycle.
Table II(b) compares schemes for five-buffered MIN's. Notice that the improvement is much more significant than with the three-buffered MIN's. This is an expected result since the effect of single step movement on the performance will be more significant as the size of buffer increases. In other words, the waste of cycle time in the conventional scheme will be more substantial as the size of buffer module increases. The effectiveness of the proposed BU scheme for general parallel computer systems with larger size packet of 1 Kbyte is demonstrated in Table II (c). Notice that performance enhancement in both the throughput and delay is more significant than with ATM switching.
B. SU Scheme
Figs. 11-14 plot the throughput and delay data of the SU scheme for three-and five-buffered MIN's of six and ten stages. The SU scheme is then compared to the SM scheme in Table III . Note here that the length of the network cycle with the SM scheme is times longer than that of the proposed SU scheme ( ). The throughput and delay of the SM scheme is thus normalized by this factor as discussed earlier.
Observe from Table III that the throughput of the SU scheme is much higher than the SM scheme. Delay of the SU scheme is also much smaller for most of the range of the input load. Notice that the improvement for the five-buffered MIN is much more significant than for the three-buffered MIN.
Figs. 15 and 16 plot the throughput and delay data of the BU and SU scheme obtained from the analytical models. The network cycle of the BU scheme is about 1.33 and 1.63 times longer than that of the SU scheme with the 4-and 8-byte time multiplexed bus, respectively. The performance enhancement of the SU scheme over the BU scheme is about 20-50% for the throughput. The delay improvement is more extensive. For other sizes of MIN's, we observe similar results.
From the comparisons, it is obvious that the proposed scheme can significantly enhance the performance of MIN's for ATM switching as well as parallel computer systems. More importantly, this achievement is obtained without any extra hardware or operational overhead. The proposed scheme is thus very efficient to be implemented in practice.
V. CONCLUSION
In this paper we have identified the inefficiency of the conventional switching mechanism employed for buffered MIN's. We therefore developed a new approach based on unit step packet movement, which can improve the MIN performance in terms of throughput and delay. We also developed analytical models for estimating the performance of multibuffered MIN's of switching elements employing the proposed approach. The models were verified by computer simulation, and the evaluation confirms that the proposed approach outperforms the conventional schemes for practical traffic loads. The unit step buffering with small clock cycle turns out to be most effective.
It has been determined that the closer to the output stages, the worse the buffer utilization, even under uniform traffic condition [15] . Effectiveness of the proposed approach for other operational conditions such as nonuniform traffic [16] , [17] , multicasting, and ATM LAN [18] needs further investigation.
