I. Introduction
In order to achieve signi cant performance in parallel computing it is necessary to keep the communication overhead as low as possible. The communication overheads of a multiprocessor system depend to a great extent on the underlying interconnection network. An interconnection network (IN) can be either static or dynamic. Dynamic networks can connect any input to any output by enabling some switches. They are applicable to both shared memory and message passing multiprocessors. Among In a strictly hierarchical bus architecture 1], there are a number of buses connected in the form of a tree between the processors and the memories. The use of multiple buses makes the hierarchical bus-based systems more scalable compared to the popular single bus multiprocessors. However, the bandwidth of this interconnection decreases as one moves toward the top of the tree. Thus, the scalability of a hierarchical bus system becomes limited by the bandwidth of the topmost level bus.
The bandwidth problem can be alleviated through the fat tree design 5]. The simplicity of the bus based designs and the availability of a fast broadcasting mechanism are factors that make bus-based systems very attractive.
The MINs, on the other hand, o er a uniform bandwidth across all stages of the network. The bandwidth of the network increases in proportion to the increase in system size, making the MIN a highly scalable interconnection. The switches in a MIN are made up of small crossbar switches.
When the system size grows, bigger switches can be used to keep the number of stages and, hence, the memory latency low 6]. However, the complexity of a crossbar switch grows as the square of its size, and therefore, the total network cost becomes predominant in larger systems. We have observed that the tra c in the network is very low making the crossbar based MIN switches highly underutilized.
In a system using private caches, which is common in today's shared memory multiprocessors, the e ective tra c handled by the switches in the network is further reduced.
A novel interconnection scheme, called the Multistage Bus Network (MBN), is introduced in this paper that combines the positive features of hierarchical buses and MINs. The MBN consists of several stages of buses with equal number of buses at each stage. This provides a uniform bandwidth across the stages and forms multiple trees between processors and memories. Unlike hierarchical bus networks, the MBNs comprise multiple buses at higher levels reducing the tra c at higher levels. Maintaining cache coherence is a major problem in shared memory multiprocessors. Unlike MINs, the snoopy cache coherence protocols can be applied to the MBN 7] , which can improve the performance by a large extent. Also, the MBN provides much better fault tolerance and reliability compared to a conventional MIN 8] .
It is known that a distributed shared memory organization has better scalability than a centralized We also develop equations for probabilities of taking each path based on the memory requests. In order to do a realistic comparison with MINs, we introduce the design and analysis of a corresponding Bidirectional MIN (BMIN) in this paper. The BMIN allows U-turns and a packet can be routed based on the same techniques presented in this paper for the MBN. Recently, Xu and Ni 9] have discussed a U-turn strategy for bidirectional MINs as applicable to the IBM SP architecture 4]. However, the MIN employed in SP architectures is cluster-based and works di erently than the proposed MBN or BMIN.
In this paper, we analyze the performance of an MBN for distributed shared memory multiprocessors based on di erent self routing techniques. Unlike the previous analysis 8], the present analysis is based on routing along the minimum of the four paths for a given source and destination pair. The MBN has some inherent fault tolerance capabilities due to a number of switch disjoint paths between any source and destination pair. In this paper, we only concentrate on the routing and performance evaluation The rest of the paper is organized as follows. We present the structure and introduce four types of self routing techniques for the MBN in Section 2. We de ne the routing tags required to implement the four routing strategies in Section 3 and present an algorithm for the most optimal path in the network for a given source-destination pair in the same section. A performance analysis of the MBN and BMIN is then presented in Section 4. Results and comparison with the conventional and bidirectional MINs are presented in Section 5. Section 6 presents the execution-driven simulation speci cations and results. Finally Section 7 concludes the paper.
II. Structure of the MBN
We will consider a distributed shared memory (DSM) architecture throughout this paper. In such an environment, the memory modules are directly connected to the corresponding processors, as shown in Figure 1 , but the address space is shared. An example of hierarchical bus interconnection with two levels of buses is shown in Figure 2a 1 ]. In this example, there are 16 processors, 4 memories, four level 1 buses and one level 2 bus. Naturally, the top level bus is the bottleneck in the system. In order to improve the performance, a number of buses must be connected at the top level with interleaved memory design. Such a connection is shown in Figure 2b for a 16*16 system with two levels of buses.
M M M M P P P P P P P P P P P P P P P P P P P P P P P P P P P P We propose that each bus along with its controller be placed in a switch analogous to a MIN switch.
Such a network is called a Multistage Bus Network (MBN).
In an N N multistage network using k k switches, there are l = log k N stages of switches, numbered from stage 0 to stage l ? 1, as shown in Fig. 3a . Every switch has a set of left connections closer to the processor side and a set of right connections closer to the memory side. The construction of a 4*4 MBN switch incorporating a bus, a bus access controller and output bu ers is shown in Figure 3b .
There are control lines associated with each port to carry arbitration information to the bus access controller. Suzuki et al. have studied a similar bus structure in 11]. We also propose a Bidirectional MIN (BMIN) structure for comparison. The di erence between the switch architectures of BMIN and MBN is evident from Figs. 3b and c. The BMIN switch is a crossbar whereas the MBN switch is a bus. For both the networks, a packet from a stage i is passed on to the stage i + 1, or vice versa, using the destination tag digits. For a k k MBN switch there will be 2k packets (k inputs from either side)
potentially competing for the bus in a cycle. When there is more than one such packet, the bus access controller chooses any one of them at random. Others are queued to be transmitted later. On the other hand, in a k k BMIN switch all the 2k inputs can be connected to the 2k outputs if the requests are to di erent destinations. The k k MBN and BMIN switches support forward, backward, turn around connections, as explained in the next section. We describe the structure of the MBN below.
The structure of the BMIN is similar. the local memory when the source tag and the destination tag of a request are the same, If the tags are di erent, the request travels to a remote memory through the MBN.
As an example, a 16*16 MBN with 2*2 switches is shown in Figure 4 . There may or may not be a shu e interconnection before the rst stage of switches. Our routings are developed based on Figure 4 where there is no shu e before the 1st stage. Hence, a set of processors with their memories are connected to one switch at the rst stage and to another switch at the same position at the last stage.
If there exists a k-shu e connection before the rst stage a di erent set of processors will be connected to the rst stage and last stage switches. In Figure 4 , a request travels in the forward direction when it starts from the processor side and passes through stages 0; 1:::(l ? 1) , in that order. It travels in the backward direction when it starts from the memory side and passes in the reverse direction through stages (l ?1):::1; 0, as shown in Figure 5 . A packet can also travel from left to right and make a U-turn in an intermediate stage, as shown in Figure 6 . This is called Forward ? U(FU) routing. Similarly Figure 7 shows Backward ? U(BU) routing where a message enters the network from the right and makes a U-turn. These four routings provide four distinct paths between a source and a destination in the MBN. As a result, the fault tolerance and reliability of the MBN are much better than that of a conventional MIN. Exact expressions for the MBN reliability are derived in 8]. They are also valid for BMINs introduced in this paper.
In conventional MINs like the Omega, Delta and GSN 6], the destination tag is used for the purpose of self routing of a request only in the forward direction. In case of the MBN, the destination tag can also be used for self routing in the forward direction. Since the stage 0 connections are straight instead of a k-shu e, the destination tag itself can not be used for self routing in the backward direction.
As explained later, the routing tag in the backward routing case is obtained by reverse shu ing the destination tag by one digit. In order to determine where to take a turn in the above two routing techniques involving U-turns, we need to combine the source tag and the destination tag to form a combined tag. The following de nitions are needed to develop exact routing algorithms later. 
B. Optimal Path Algorithm
The distance between a source and destination in an MBN is de ned as the minimum number of switches that the packet has to travel. For a conventional MIN, this distance is always equal to l, or the number of stages in the network. In case of an MBN, however, the distance may be less than l if FU or BU routing is chosen. The FU and BU (Forward-U and Backward-U) routings are used when the turning stage happens to be less than the center stage of the network. Therefore, there will be net savings in terms of distances between a given source and all the destinations. Detailed expressions for the overall savings in distances for such an MBN are given in Section 4. We present below an algorithm to choose the most optimal routing for a given source-destination pair.
Optimal Path Algorithm These path length equations can be used to form a table for a given source and destination. As an example, Table 1 shows the path lengths from source 0 to di erent destinations in a 1024*1024 network for i 1023. The path length for each routing is quite di erent and thus a routing algorithm is required to route the request through the most optimal path. For example, if the destination is 2 then Backward-U routing will result in the optimal path length. On the other hand, if the destination is 256, then Forward-U routing will result in the optimal path length. The other two requests should use forward or backward routing strategies.
IV. Performance of the MBN
The Multistage Bus Network (MBN) is analyzed here in a distributed shared memory environment, shown in Figure 1 . We also analyze the BMIN and compare its results with those of the MBN. In both the cases, the memory module M i is directly connected to the processor P i and is called the local memory of P i . Requests from a processor to its local memory are called internal requests and are carried over the internal bus between the processor and its local memory. A memory can also receive external requests that originate from other processors and are carried over the MBN.
A. Network operation
In a distributed memory system, there are k ? 1 processors that can be reached through the switch of size k at the rst stage or the last stage to which P i is connected. Thus the external request destined to a cluster processor or memory returns from the rst stage (Forward-U routing) or last stage (Backward-U routing) without going through the whole MBN. However, if the request is neither to a local nor to a cluster memory, the request may take one of four routings described earlier. Both internal and external requests arrive at a memory queue. Only one of them is selected for service on an FCFS basis while the remaining requests are queued at the bu er of the memory. After receiving a request, a memory module will send a reply packet either directly to its local processor or to another processor through the network, depending on whether the request is internal or external.
We will compare the performance of the MBN with that of a BMIN. The transmission of request and reply packets goes through the network following the routings given earlier in the paper. We shall assume a synchronous and packet switched system for analyzing the multistage networks. Since a bu er size of four or more gives the same e ect as an in nite bu er 12], 13], for simplicity, we shall assume an in nite bu er for MBN and BMIN. The analysis can be extended to nite bu ers, but the equations will be fairly complicated 13]. The performance analysis of the MBNs and BMINs will be carried out under the following assumptions 12], 13]. Packets are generated at each source node by independent and identically distributed random processes. At any point of time a processor is either busy doing some internal computations or is waiting for the response to a memory request. If there is no pending request, each busy processor generates a packet with probability p at each cycle. The probability that this request is to the local memory (internal request) is m, and the probability to any other memory module (external request)
is (1 ? m).
A reply from memory travels in the opposite direction through the same path in the MBN or BMIN. It may be noted that in case of a MIN like Butter y 3], a reply has to traverse in the same direction (i.e., from processor to memory side) to reach the requesting processor because the MIN has unidirectional links. In 9], bidirectional links are used between stages and hence the requesting and reply messages may travel in the forward and backward directions respectively.
The messages from processor to memory are generated using probabilities as speci ed below:
Request Probability (p): The request probability is de ned in Section 3 and is used as a means of estimating the processor behavior in terms of memory requests. When a processor is busy in computation, i.e., no request is outstanding in switches or a memory module, it can send a memory request. At each cycle, the processor decides whether or not a message is to be sent based on this probability. On an average, it takes 1=p cycles to send out a request from the processor.
Local memory request probability (m): Given that a request is to be made to memory, a probability (m) is used to decide whether the request is for local or external memory.
Though simple, the above probabilities play an important role and are the only inputs to the analysis.
After each request to memory, the processor waits for an acknowledgment. Once an acknowledgment is received, the processor does useful computation for one cycle and then based on the above probabilities decides whether to continue or to send another request to the memory.
Processor utilization: The processor utilization, P u , de ned as the fraction of time a processor is busy, will be determined by the waiting time and service time faced by a request at various service centers. In a number of applications, a large portion of the requests are made to the cluster processors.
In 8], we studied the performance of the MBN with varying probabilities for cluster requests. In the study forward-U and backward-U routings were allowed only at the rst and last stages. All other requests were routed by forward (FW) routing. The processor utilization for such a case is given by In this paper, a message in MBN or BMIN will be sent along the minimum distance. In such a case,
where, corresponds to the expected delay for a local memory request to be served.
corresponds to the expected delay for serving requests to cluster memories.
corresponds to the expected delay for serving all requests, except cluster memories, that follow FU or BU routing.
corresponds to the expected delay for serving all requests that folllow Forward routing (F W) or Backward routing (BW ).
The derivation of terms, , , , and , is presented below. These terms depend on (a) the routing probabilities along each path, (b) the amount of tra c in the network, and (c) the service demand at individual service centers. Thus we get a non-linear equation with P u as the single variable that is solved by using iteration techniques.
B. Routing probabilities and path delays
The routing probabilities and path delays are derived here for MBN and BMIN under the assumption that all the non-local memories are equally addressed by a processor. These equations can be modi ed in case of nonuniform remote memory references. Since the path length of Backward Routing (BW )
is the same as that of the FW routing, we derive the term based on FW routing and multiply it by 2 to include BW routing. A similar method is used for FU and BU routing as well.
Local memory requests ( ): A local memory request does not involve switch traversal. Thus the only delay is that in servicing the request in the memory module (d m ). Given that the probability for a processor to request a memory is p and that to request a local memory is m, we can deduce that
Cluster routing ( ): Requests to cluster processors travel to the rst or last stage switch and take a FU or BU routing to the destination processor. All those source-destination pairs where all bits except the least signi cant log 2 k bits of the CT are zero entail this type of routing. Thus the number of cluster memories for a given source is k ? 1 since k k is the size of an MBN or BMIN switch.
The switch at stage 0 is traversed once for reaching the cluster memory and once for sending back the acknowledgment. Here, given that an external memory is requested, the probability for requesting cluster memories can be expressed as 
Non-cluster FU or BU routing ( ): In forward-U and backward-U routing, the request traverses in one direction up to a particular stage (as explained in Section 2) and makes a U-turn to reach the destination processor. Thus given the turning stage, FTS, the path length can be said to be 2 FTS + 1. This is because the FTS is traversed only once while all stages to the left of FTS are traversed twice but not necessarily through the same switch. We should have a FTS < bl=2c for path length optimization. and BTS dl=2e for optimal path length. As we have already covered cluster memories (F TS = 0, BTS = l ? 1) we will start with FTS 1 and BTS l ? 2.
Consider FTS < bl=2c. A similar derivation can be done for BTS dl=2e also.
We know that the number of destinations in total is N ? 1 ways in which a bit in the RCT can be 1 is k ? 1. Thus, given that an external memory is requested, we have the equation for probability of non-cluster FU and BU routing as,
where d = bl=2c
The delay in such a routing is dependent on the stage at which the U-turn is going to take place.
Thus within the summation of the above equation we should include the delay for each switch traversed in that particular path. As discussed above for a turning stage FTS we traverse through all stages to the left of FTS twice. Thus the delay except for that in the turning stage is (2 ( P i?1 j=0 2r j )). 
Forward routing ( ): Finally, for all those source-destination pairs which don't fall into the above In this type of routing all switches are traversed, thus giving a summation of all switch response times for d n = P l?1 i=0 r i . Thus the expected delay for all such routings can be expressed as,
where, p = 1 ? p ? p (9) where p and p are given by equations 4 and 6 respectively.
The equations 0 through 9 are valid when the local memory is accessed with a probability of m and all other memories are addressed with equal probabilities i.e. (1?m)=(N ?1). In an actual case, there will be more interaction between the tasks within a cluster. The equations can be easily extended to include such cases. Table 2 
C. Queueing delays in switches
In order to make the analysis simpler, each stage in the network is considered in isolation from the other stages. Consider a queueing center with n inputs. Let the probability that there is a packet at one of the inputs at any given cycle be q, and the service demand of a packet at the service center be t cycles. The number of requests coming to the queue during the service time of any previous request will form a Binomial distribution with number of trials = nt, and success probability = q. The mean number of arriving requests, E = ntq and the variance, V = ntq(1 ? q). The average queue length Q at the queueing center can be found using the Pollaczek-Khinchine (P-K) mean value formula 14],
The throughput of these requests is E=t. Hence by using Little's law, the mean response time of the center, r, can be derived as, r = Q:t E = 1 The network delay, d n , will be a sum of the response times of the stages a packet visits while routed through the network.
In case of the BMIN, there are 2k inputs and 2k outputs in a switch. The request probability at an input or output of a BMIN switch at stage i will be P u p(1?m)p i . Following the model shown in Fig. 8b, we can calculate the response time of a BMIN switch by using n = 2k, t = t s and q = P u p(1 ? m)p i =n.
The total network delay d n will be the sum of the response times of switches at di erent stages.
In both networks, the mean number of arriving requests at a memory module, E m = q i +t m q e , where q i and q e are the internal and external requests for that memory module, respectively. A packet (or request) takes the optimal path from a source to the destination. The number of switches traversed would depend on the nature of CT and RCT. The delays derived here are inserted into the equations 3,5,7 and 8 which in turn are plugged into equation 2 to obtain the processor utilization and response time of the network. Then we get a nonlinear equation with P u as the single variable that is solved by using iteration techniques.
The iteration technique used to compute processor utilization, P u , can be presented as follows:
1. Initialize P u with a guess of the expected processor utilization. The better the guess, lesser is the number of iterations for the computation.
2. Calculate the request probabilities at each stage of the network and at the memory module. An intermediate step might be to calculate the static values for p i (the probability that a stage in the network is traversed).
3. Calculate the mean switch response times and the memory response time, r i and r m respectively.
4. Based on the above values, calculate the network delay and memory delay using equations provided for ; ; and .
5. Based on these values, calculate a new processor utilization, P u .
6. Repeat steps 2-5 until the new P u is within some tolerance of the last P u .
An initial value of 0.5 for P u and an accuracy of 0.00001 were used to generate the analytical results, presented in the next section.
V. Results and Discussions
We performed extensive cycle-by-cycle simulations to verify that the proposed routings work and measured the routing probabilities and network delays 16]. The simulation was done using a synchronous packet-switched distributed memory environment. The simulation speci cations are the same as the analysis and are detailed below with a view to making the network operation more clear. Request Probability "simulation_mbn_m=0.1" "analysis_mbn_m=0.1" "simulation_mbn_m=0.9" "analysis_mbn_m=0.9" Fig. 9 . Comparison of analysis and simulation for processor utilizations of the MBN, varying m
In our simulations each cycle was considered to be the time required for the transmission of a packet from one output bu er of a switch to the next stage output bu er. This includes the transmission of the packet through the link and the time a switch takes to route it to the corresponding destination bu er. The minimum time taken for a packet to reach memory is based on the number of switches that the routing covers.
All four routings discussed in Section 2 are used in the simulations. The simulation compares each source and destination by running the optimal routing algorithm and then chooses the proper routing.
The choice between backward or forward routing is made as follows . All memory requests that could use either forward (FW) or backward (BW) routing use forward routing. All acknowledgements packets use backward routing to keep the load distribution same on both routings. Apart from these di erences the routing decisions are based solely on the tags generated by the optimal routing algorithm. The probabilities p, m, etc. are fed to the simulation as input parameters. All the memories except for the local memories are equally likely to be addressed upon a memory request.
In this section we present the relative performance of BMIN and MBN. We start by comparing the Request Probability "mbn_m=0.1" "bmin_m=0.1" "cmin_m=0.1" "mbn_m=0.9" "bmin_m=0.9" "cmin_m=0.9" Request Probability "mbn_m=0.1" "bmin_m=0.1" "cmin_m=0.1" "mbn_m=0.9" "bmin_m=0.9" "cmin_m=0.9" Finally we present the processor utilization and the response time of the MBN obtained for di erent switch sizes and di erent number of processors in Table III. Some places in the table are left empty because an N N MBN cannot be built using only those k k switches. Both the request probability (p) and the local memory request probability (m) are xed at 0.5. For N = 64, there is a decrease in P u when k is increased from 4 to 8. This is because MBN is less e cient due to increased contention and delay in an 8x8 bus-based switch. On the other hand, in a 512 512 system, when k is increased from 2 to 8, there is a good improvement. The number of switches in the entire network is still quite high, keeping the contention low enough to gain in performance. If we compare the MBN's performance to that of the BMIN's, we can see that as the switch size increases, the BMIN gives a higher processor utilization and a lower response time. This increase in performance is due to a lower contention in the crossbar switch. However, the BMIN gives this increased performance at the expense of cost. In 8], a cost parameter based on the number of connection points in a switch is presented. The number of connections is k 2 for a k k switch, where as for a bus, the number of connections is 2k. Thus, the total cost of BMIN and MBN are kNlog k N and 2Nlog k N respectively. If we include these parameters along with the processor utilization and the response time, the cost-e ectiveness of the MBN is higher than that of the BMIN, as shown in 8]. A 4 4 switch size works out to be most cost-e cient for di erent network sizes and workload inputs.
VI. Execution-driven Simulation and Results
The execution time of an application on a multiprocessor architecture is the ultimate parameter that indicates the performance. In order to show that the MBN performs similar to the Bidirectional MIN (BMIN), we study their performance by using an execution-driven simulation of various applications.
Our simulator is based on Proteus 10], originally developed at MIT. However, this original simulator modeled the indirect interconnection networks based on an analytical model. We have modi ed the simulator extensively to exactly model the BMIN and the MBN using 2 2 switches and packetswitching strategy. The system considered in this paper has private cache memories that operate based on a directory-based cache-coherence protocol 15]. The node con guration and the network The matrix multiplication was done between two 128 128 double precision matrices. The principal data structures are four shared two-dimensional arrays of real numbers: two input matrices, a transpose matrix, and one output matrix. The shared data size is about 512 Kbytes.
For Floyd-Warshall's algorithm, we used a graph of 128 nodes with random weights assigned to the edges. The principal data structures are two shared two-dimensional arrays of integers: one distance matrix and another predecessor matrix. The shared data size is about 128 Kbytes. The program goes through as many iterations as the number of vertices, and during an iteration a particular row of distance and predecessor matrix is read by all the processors. Each iteration is followed by a barrier. The blocked LU decomposition application was done on a 256 256 matrix using 8 8 blocks. The principal data structure is a two-dimensional array in which the rst dimension is the block, and the second contains all data points in that block. In this manner, all data points in a block (which are operated on by the same processor) are allocated contiguously, and false sharing and line interference is eliminated.
We implemented the Cooley-Tukey 1-D FFT algorithm. The simulations are done on an input of 2 14 points. The principal data structures are two 1-D arrays of complex numbers. There is no data sharing during rst log 2 (N ? P) stages, where N is the number of data points and P is the number of processors. In rest of the log 2 P stages, every data point is shared by two processors. During these stages, instead of using two separate input and output arrays, we interleave these arrays to avoid large number of con ict misses.
MP3D is a three-dimensional particle simulator used in rare ed uid ow simulation. We used 16000 molecules with the default geometry provided with SPLASH 18] which uses a 14 24 4 (2646-cell)
space containing a single at sheet placed at an angle to the free stream. 
B. Simulation Results
The characteristics of the applications, as measured in the simulation, are given in Table V. The   table shows the number of shared memory references, the cache misses on the shared memory references and the total number of messages generated during the execution. We begin by presenting the average message latencies experienced when the above applications are run in this shared memory environment.
The values shown in Table VI show that the response time of messages do not signi cantly di er by the use of MBN as opposed to BMIN. However, the response time of the CMIN is signi cantly higher than that of the MBN or BMIN for all applications. Thus the time taken for serving cache misses is higher using the CMIN than the MBN.
The use of the U-turn routing strategies is one of the factors in reducing the average latencies in both the interconnection networks. The introduction of U-turns not only reduces the response times of the messages but also distributes the messages throughout all the switches in the IN. An evaluation based on the execution time of these applications gives the performance of the interconnection networks in a more realistic environment. The simulator gives the execution time of an application in millions of cycles. Also, we measured times for di erent activities, as shown in Fig. 16 .
The timings are shaded di erently for the following activities.
The time spent in computation and synchronization.
The read stall time: This is the maximum read stall time experienced by any processor.
The write stall time: This is the maximum write stall time experienced by any processor.
The graph in Fig. 16 is divided into 5 sets of 3 bars each. Each set is for a di erent application and shows the execution time using MBN vs. BMIN vs. CMIN. The gure shows that the BMIN and MBN give a much better performance than the CMIN. The improvement in overall execution time is mainly due to the reduction in read stall time and write stall time which is directly related to the network latency.
It can be seen from the results that the MBN performance is very similar to the BMIN performance.
It can be further observed that, for all the applications considered, the write stall time is lower than the read stall time because of fewer write misses.
VII. Conclusions
The paper presented a multistage bus network (MBN) that o ers better performance and reliability than a conventional multistage interconnection network (MIN). An equivalent Bidirectional MIN (BMIN) was also presented for performance comparison. Self routing schemes for di erent paths were developed based on routing tags. An algorithm was presented to nd the path of minimum distance between a source and a destination. Probabilities of taking these di erent paths were derived and a queueing analysis was presented for evaluating the performance of the MBN and the BMIN. The analysis was veri ed by simulations and comparisons of results with the MIN were made. It was shown that the performance of the MBN was very similar to the BMIN in terms of response time and processor utilization, but much better than a conventional MIN. To emphasize the potential of the MBN, an execution-based evaluation was also presented in a cache-coherent shared memory multiprocessor environment that use a directory-based cache coherence protocol. The performance of the MBN in such an environment is also shown to be almost equal to the BMIN in terms of message latency and execution time of ve applications. This fact along with simplicity in hardware and better fault-tolerance makes the MBN a viable alternative to the existing MIN.
