Performance analysis of multiple-bus systems is usually carried out under the assumption of a uniform memory request model. Hot spots arising in multiprocessor systems give rise to non-uniform memory requests. It is known that a hot spot memory request pattern results in a signi®cant degradation in the performance of a buered multistage interconnection network. The aim of this research is to study the eect of hot spots on the performance of buered multiple-bus systems, and to compare the performance of buered and unbuered systems. Analytical models based on Markov chains have been developed to determine the bandwidth of buered multiple-bus systems in the presence of a hot spot memory request pattern. The models assume that unsuccessful memory requests are queued in the buers at the memory modules. Furthermore, processors with outstanding requests are not allowed to generate new requests. The model allows a fast and inexpensive method to evaluate the performance of a buered multiple-bus system in the presence of a hot spot memory request pattern. Ó
Introduction
A multiprocessor system essentially consists of a number of processors and memories interconnected by an interconnection network. Networks have topologies such as multiple-bus, multistage, hypercube, tree, etc. Multiple-bus systems have been found to be suitable for small and medium sized systems. Multiple-bus systems are modular, easily expandable and fault tolerant. However, the performance of multiple-bus systems is limited due to bus and memory contention.
Computers and Electrical Engineering 27 (2001) 293±308
www.elsevier.com/locate/compeleceng Dierent con¯ict resolution strategies are used to manage such contention [1, 2] . Commonly used criteria for the performance measurement of multiple-bus systems include bandwidth, probability of request acceptance, and processor utilization.
Multiple-bus systems can be buered or unbuered. In an unbuered system, unsuccessful processor requests (due to contention) during a cycle are discarded. The processors have to resubmit the requests at subsequent cycles. In a buered system, the unsuccessful requests are queued in buers at the memory modules, and are resubmitted to the memory modules in the next cycle. Performance evaluation of both unbuered [3±9] and buered [10, 11] crossbar and multiplebus systems have been reported in the literature. Good surveys of crossbar and multiple-bus systems appear in Refs. [12, 13] .
In evaluating the performance of crossbar or multiple-bus systems, most authors assume a uniformly distributed memory request pattern, whereby a memory request generated by a processor is equally likely to be directed to any of the memory modules [6,14±19] . However, the uniformly distributed memory request pattern is rather restrictive in real-world situations. Studies have shown that non-uniform memory request patterns may arise in multiprocessor systems. Favorite memory request pattern, a particular type of non-uniform trac pattern, has been analyzed for multiple-bus systems in Ref. [20] , while a consecutive request pattern has been studied in Ref. [21] . Successive requests with local referencing has been discussed in Ref. [22] . A hot spot request pattern in multistage interconnection networks have been studied in Refs.
[23±27]. Other non-uniform request patterns can be found in Ref. [28] .
One of the reasons for hot spot memory request patterns is the use of shared variables used for locking, synchronization, pointers to shared queues, etc. These are indivisible primitives and must be stored in a single shared location, thereby giving rise to a hot spot memory request pattern in the system. Combining has been shown to alleviate, the problem arising due to a hot spot in multistage interconnection networks in the case of the hot spot being due to a single memory location. It is, however, not applicable if a memory module is itself a hot module. Therefore, it is important to study the performance of a multiple-bus system in the presence of a hot memory request pattern.
Performance evaluation of unbuered multiple-bus and crossbar systems under a hot spot request pattern has been reported in Refs. [29, 30] , respectively. It was found that the performance degrades signi®cantly with an increase in the proportion of hot spot requests. The objectives of this paper are to · develop analytical models to study the performance of buered multiple-bus systems in the presence of a hot spot memory request pattern, · compare the performance of buered and unbuered systems.
The commonly used temporal assumption of processor requests in consecutive memory cycles is not realistic because a processor will not issue another request until the previous request has been satis®ed. The model removes the above temporal assumption of processor requests in consecutive memory cycles. Unsuccessful requests, in such a system, are buered in ®rst-in-®rst-out queues at the memory modules. Moreover, processors with outstanding requests are blocked from issuing further requests. The discrete-time Markov chain has been used for the modeling, and simulations have been carried out for those cases where the state space of the Markov chain becomes too large to be handled with reasonable eort.
The rest of the paper is organized as follows: The modeling assumptions regarding the operation of the system are described in Section 2, followed by the Markov chain performance models in Section 3. Performance results obtained from models and simulations are presented and compared in Section 4, and concluding remarks appear in Section 5.
Modeling assumptions
To keep the Markov chain model simple and tractable, several simplifying assumptions regarding the operation of the system are made. The following are the assumptions that are used in the Markov chain model described in Section 3:
· The system (Fig. 1) consists of a multiple-bus system with N processors P 0 ; P 1 ; . . . ; P NÀ1 M memory modules M 0 ; M 1 ; . . . ; M MÀ1 and B busses. Without loss of generality, we assume M 0 to be the hot memory module. · We assume synchronous operation of the system, i.e., the generation of requests by the processors and the servicing of the requests by the memory modules occur at clock cycles. The clock cycle s is split into two phases s 1 and s 2 corresponding to the generation and servicing of requests. Processors generate requests at the beginning of s 1 , and the memory modules complete the servicing of the requests at the end of s 2 . · Requests are assumed to be spatially independent, i.e., a request generated by a processor during a cycle is independent of requests generated by other processors during the same cycle. · A processor having an outstanding request is blocked, and cannot generate a new request until the previous one has been served. A processor with no outstanding request generates a new request at the beginning of s 1 with probability r. We call this the static request rate. It will be shown in Section 4 that the eective request rate of a processor is signi®cantly less than r in the case of high blocking. · The probability that the generated request is for the hot memory module (HM) or a non-hot memory module (NHM) is h or h 1 À h=M À 1, respectively. · Processor requests generated during s 1 of a cycle are put in the memory queues corresponding to the requested modules. During s 2 , a memory module, with outstanding requests, services one request from its buer. · Processor requests which cannot be served during the same cycle as they were generated remain queued at the buers for servicing at a later cycle. The buers are considered to be of in®nite size. · In the case of bus con¯icts, memory requests to be serviced are chosen at random from the outstanding requests. A total of B requests are presented to B dierent memory modules.
Performance evaluation
In this section, we develop a Markov chain based analytical model for the performance evaluation of a buered multiple-bus system operating according to the assumptions described in Section 2. It is well known that Markov chain modeling of an N Â M system, even for a uniform memory request pattern, results in a large number of states, which makes it analytically intractable [31, 32] . In the presence of a hot spot memory request pattern, the number of states is even larger. Therefore, for a multiple-bus system in the presence of a hot spot memory request pattern, we develop Markov chain models for systems having a small number of processors or memories [31] . Due to the large number of states in the chain, we rely on simulation results for systems having a large number of processors and memories. In the next few sections, we separately model systems having large number of processors or memories. We then show that the model, for systems having a large number of processors and memories, is too complex to be solved with reasonable eorts.
Modeling a 2 Â M system
In this section, we develop a Markov chain model for the average memory bandwidth of a system having two processors, M memory modules and, two busses. We assume that each memory module has a buer of size N. The state of the system is de®ned by an M-tuple S 0 ; S 1 ; S 2 ; . . . ; S MÀ1 , where S i , 0 6 i 6 M À 1, is the number of outstanding memory requests for M i at the end of s 1 . Note that P MÀ1 i0 S i 6 2: Since we are interested only in the bandwidth of the system, we reduce the states of the system to seven equivalent states ( Note that an equivalent state is composed of several states of the system. For example, the equivalent statep 6 is composed of the following states: 0; 2; 0; . . . ; 0; 0; 0; 2; 0; . . . ; 0; 0; 0; 0; 2; 0; . . . ; 0; . . . 0; 0; . . . ; 0; 2:
Having de®ned the states of the system, the transition probabilities between the states will be determined next. The transition probabilities, P 2;M fp i;j ; 0 6 i; j 6 6g, for a 2 Â M system will be represented by a 7 Â 7 matrix. The state transition from state i to state j is therefore, represented by p i;j : The Markov chain, with all the possible transitions, is shown in Fig. 2 . Each arc represents a transition from one state to another. When the system is in statep 0 , a hot memory request is serviced. In the next cycle, the processor whose request was served in the previous cycle places a request to the HM or a NHM with probabilities rh and r1 À h, respectively. The probability that it does not request is 1 À h: Therefore, the transition probabilities are given by p 0;0 rh, p 0;2 r1 À h, and p 0;1 1 À h. Other transition probabilities can be derived similarly. The transition probability matrix P 2;M is therefore, given by It is possible to make a transition from any state, back to the same state in a ®nite number of transitions, and hence the chain is aperiodic. Since the chain is also irreducible, it is ergodic and hence possesses a unique stationary probability distribution of the states. Let the stationary probability distribution of statesp 2 
. . .
The number of memory modules that can be accessed concurrently in state p i , 0 6 i 6 6, and will be represented by l i : For example, l 1 1 and l 4 2: The average memory bandwidth (AMBW) of the buered 2 Â M system having two busses is therefore, given by
where N ; M; B is the average memory bandwidth of an N Â M system having B busses.
Modeling an N Â 2 system
In this section, we develop, for r 1, a model to determine the average memory bandwidth of a system having N processors, two memories, and two busses. Since there are only two memories, the state of the system can be described by a two-tuple S 0 ; S 1 , where S 0 and S 1 are the number of requests for the HM and the NHMs, respectively. For an N Â 2 system, the total number of states is equal to all the possible partitions of N into two groups. Since there can be N 1 such partitions, the total number of possible states is N 1: For example, a 4 Â 2 system has the states p 4;2 f4; 0; 3; 1; 2; 2; 1; 3, and 0; 4g and a 9 Â 2 system has the states p 9;2 f9; 0; 8; 1; 7; 2; 6; 3; 5; 4; 4; 5; 3; 6; 2; 7; 1; 8; 0; 9g: The states of an N Â 2 system are p N ;2 fN; 0; N À 1; 1; N À 2; 2; . . . ; 1; N À 1; 0; Ng as shown in the Markov chain diagram in Fig. 3 . When the system is in state N ; 0, a hot memory request is serviced and it goes to the intermediate state N À 1; 0: The next request is to the HM or the NHM with probabilities h and h 1 À h, respectively. Thus, the next state of the system is N ; 0 or N À 1; 1 with proba-bilities p 0;0 h and p 0;1 h, respectively. Other transition probabilities can be derived similarly. Two memory requests are serviced per cycle in all the states except N ; 0 and 0; N ; when only one request is serviced. Therefore, l 0 l N 1 and l i 2; 1 6 i 6 N À 1: The transition probability matrix for the Markov chain in Fig. 3 is therefore, given by 
where a 1 À h=h: The bandwidth is then found by substituting the values of l i and p i in
to obtain AMBWN ; 2; 2 1 The AMBW for r 1 is obtained as described above. Because of the large number of possible states, the Markov chain for r < 1 is too complex for an N Â 2 system. To illustrate the complexity of the chain, the number of transitions of the chain for a 3 Â 2 system, for r < 1, is shown in Fig. 4 . Consequently, we are forced to obtain the results for r < 1 case using simulations.
Modeling an N Â M system
Consider ®rst a 4 Â 4 system under the uniform memory request pattern and r 1: The number of states in such a system is equal to the number of equivalence classes in a decreasing list partition of four requests into four groups, where a group can be empty. The total number of partitions for the above case is ®ve.
For an N Â M system under the uniform memory request pattern, the total number of partitions of N into M parts is given [33] by P M n1 PN ; n, where PN ; n is given by the recurrence relation
where, PN; 1 PN ; N 1: The N requests from the N processors can go to any number of memory modules within the M memory modules. If we denote the system state in a decreasing list format, then such a list i 1 ; i 2 ; . . . ; i M , i 1 P i 2 P i 3 . . . P i M , where the entries sum up to N, describes a unique partition of N : Each of the unique partitions will be called an equivalent state. There are 15 equivalent states for a 7 Â 7 system in the presence of a uniformly distributed memory request pattern. For a hot spot request pattern, we rede®ne the representation of the states. A state is now represented by k; i 1 ; i 2 ; . . . ; i MÀ1 , where i 1 ; i 2 ; . . . ; i MÀ1 are arranged in a decreasing list. In the above state representation, k, 0 6 k 6 N , denotes the number of requests for the HM and i 1 ; i 2 ; . . . ; i MÀ1 are the number of requests for the M À 1 NHMs. If k requests are for the HM, the remaining N À k requests are to be partitioned into M À 1 NHMs. The total number of states Q in such a representation is given by Fig. 4 . Markov chain of a system having three processors and two memories, for r < 1.
The number of transitions from a state of the Markov chain varies from one to a maximum of Q:
The transition matrix is of size Q Â Q: It is tedious to solve Q 1 equations to determine the stationary state probability distribution. Simulation was therefore, used to determine the bandwidth of N Â M systems, where N and M are greater than 2.
Results
In this section, we present performance ®gures for a buered system in the presence of a hot spot request pattern. We also compare the performance of a buered system with that of an unbuered system. Results for an unbuered system are obtained from the models developed in Ref. [34] . Results obtained for 2 Â M, N Â 2, and N Â M multiple-bus systems using the models and simulators described in the previous sections are shown in Figs. 5±7. Fig. 5 . Bandwidth vs. hot spot probability for systems with two processors, 10 memories, two busses and r < 1. Fig. 5 shows the variation in the average memory bandwidth of a 2 Â 10 system having two busses, as a function of the hot spot probability and for dierent request rates. For a 2 Â 10 system, h 0:1 corresponds to the uniform memory request case. For high processor request rates 0:6 6 r 6 1:0, an increase in the hot spot probability results in contention at the hot memory module, resulting in a sharp decrease in the bandwidth. For low processor request rates 0:1 6 r 6 0:5, the bandwidth does not degrade signi®cantly with an increase in the hot spot probability. This is due to a small contention for the HM at low values of processor request rates. It should be noted that processors with blocked requests cannot generate new requests, resulting in a drop in the eective processor request rate. The eective request rate decreases with an increase in the blocking at the memory modules. This decrease in the eective request rate, with an increase in the blocking, is analogous to a feedback approach for preventing the buers from over¯owing.
Due to the number of processors being two, the upper bound of the bandwidth is two. The bandwidth for a system consisting of two memories and two busses, for r 1, is shown in Fig. 6 . Average bandwidth is shown as a function of the hot spot probability for dierent number of processors. In such a system, h < 0:5 or h > 0:5 represent M 1 or M 0 being the HMs, respectively and h 0:5 is the uniform memory request case. Therefore, the bandwidth falls o rapidly on either side of h 0:5, and the maximum bandwidth is obtained for h 0:5: Increasing the number Fig. 6 . Bandwidth vs. hot spot probability for systems with N processors, two memories, two busses and r 1.
of processors increases the number of requests, resulting in a higher bandwidth. The increase in bandwidth is not signi®cant for N > 6 because the memories are continuously busy and an increased number of requests from the processors cannot be satis®ed with only two memories. The number of memories have to be increased to achieve an increase in the bandwidth with an increasing number of processors.
Bandwidth for a system having 10 processors, 10 memories, and ®ve busses is shown in Fig. 7 as a function of the request rate for dierent hot spot probabilities. For uniform memory request h 0:1, the bandwidth increases with an increase in the processor request rate. The bandwidth is limited to ®ve by the ®ve busses used. Doubling the number of busses, thereby making the system a crossbar, would only increase the bandwidth by 5% for the case of h 0:1 [35] . As the hot spot probability increases, the maximum achievable bandwidth decreases due to increased contention at the HM. For example, the bandwidth for h 0:5 is limited to approximately two due to memory contention at the HM. For h 1; all memory requests are directed to the HM, and the bandwidth is limited to one.
Unbuered multiple-bus systems, where the rejected requests are simply dropped, have been studied in the presence of a hot spot request pattern in Ref. [34] . A comparison of the bandwidths of unbuered and buered multiple-bus systems is shown in Fig. 8 for a system having 10 Fig. 7 . Bandwidth vs. request rate for a system with 10 processors, two memories and ®ve busses.
processors, 10 memories, and ®ve busses. The buered system has a higher bandwidth only in the case of a uniform memory request h 0:1: As the hot spot probability increases, the bandwidth of the buered system falls rapidly. The reason is that processors in a buered system remain blocked until they receive service from the memory modules. Blocked processors cannot generate requests, resulting in a much lower eective request rate. For h > 0:2, the queue lengths become large, resulting in a sharp drop in the bandwidth. Since requests are discarded in an unbuered system, the eective request rate in such a system is the same as the static request rate r: On the other hand, the eective request rate of a buered system is given by BW=N : For example, we ®nd in Fig. 8 that a buered system has an eective request rate of 2:8=10 0:28 for h 0:8 and a static request rate of 1.0. This illustrates the dierence in the static and eective request rates for a buered system, and hence accounts for a lower bandwidth of a buered system when compared to an unbuered system having the same static request rates. Due to the dropping of requests, an unbuered system in fact, gives an optimistic view of the memory bandwidth.
To show that a buered system has a lower bandwidth than an unbuered system due to a reduction in the eective request rate of a buered system, we have compared the bandwidths of a buered and an unbuered system for low hot spot probabilities and low request rates in Fig. 9 . It is seen that for a uniform trac h 0:1, the bandwidth of a buered system is always higher than an unbuered system. For hot spot probabilities of 0.2 and 0.3, the buered system is better than an unbuered system when the request rate is less than 0.63 and 0.35, respectively. Since higher request rate and/or higher hot spot probability results in blocking in a buered system (thereby resulting in a drop in the eective request rate), the bandwidth of a buered system in such a case is lower than that of an unbuered system. The analytical model presented in Section 4 has been validated with results obtained from a stochastic simulator. The simulator was driven by a hot spot memory request pattern, and the number of memories accessed per memory cycle was observed for 50,000 memory cycles, the average of which gave the bandwidth. Results obtained from the analytical model were found to be in close agreement to the simulation results.
Conclusions
Analytical modeling permits a fast and inexpensive method to evaluate the performance of multiprocessor systems. It allows the designer to choose appropriate system parameters at the design stage. The model helps in choosing the number of memory modules required to achieve a given bandwidth for a given number of processors and processor request rates. Most of the previous models of multiple-bus systems are based on the assumption of uniformly distributed memory request pattern. We have developed Markov chain-based analytical models to evaluate the performance of a multiple-bus system in the presence of a single memory hot spot in the system. The average memory bandwidth was taken as the performance measure. We have also removed the assumption of temporal independence of requests used by most researchers. We have illustrated that the number of states in a Markov chain model increases rapidly when the number of processors and memories is large in a multiple-bus system having non-uniform request pattern. Computer simulations have been used to validate our analytical models. The proposed model can be used to evaluate the performance of multiple-bus systems in the presence of other non-uniform trac, like the favorite memory, hierarchical request pattern, etc. The bandwidth of a buered system has been compared with that of an unbuered system in the presence of a non-uniform request pattern. Since processors with outstanding requests in a buered system cannot generate a new request, it was shown that the eective request rate decreases drastically from the static request rate. On the contrary, since the rejected requests are simply dropped in the case of an unbuered system, the eective request rate is the same as the static request rate in such a system.
Mohammed Azhar Sayeed is working as a Senior Product Manager, Uni®ed Communications in the ITD (Internet Technologies Division) with Cisco Systems, responsible for product management and rolled out of uni®ed communications solutions in Cisco software. Prior to working for Cisco he worked at Cabletron Systems as a Senior Product Manager and later as an ATM Marketing Manager. He started his networking career as ®eld service engineer working on installations in X.25 and Frame Relay. He has over nine years of experience in the networking industry and has designed, implemented and troubleshooted LANs and WANs using multivendor gear. Prior to working with Cabletron he was at Digital Equipment Corporation as an ATM Technical Marketing Engineer (Aviator). He has represented Digital, Cabletron and now Cisco as a speaker at major conferences such as ATM Year, Integrated Broadband Networks and Comdex. His research interests include QoS for IP telephony, Interdomain QoS and IP telephony architecture and protocols. He can be reached at asayeed@cisco.com
