A queueing model for performance evaluation of cluster-based multiprocessors is proposed in this paper. Most system components are modelled as M=D=1=L queues to capture deterministic service time and nite bu er behavior. Various subsystems are analyzed independently and then integrated for the system level analysis. Average delay, throughput, and processor utilization are the performance parameters studied in this analysis. The analytical results are rst validated via simulation. Next, several design alternatives are discussed using the model. These include the e ect of bu er length and identi cation of bottleneck centers for various design con gurations.
I. INTRODUCTION
Cluster-based multiprocessors, also known as hierarchical systems, are designed to reduce the network complexity by incorporating hierarchies of interconnection networks [1] [2] [3] . These systems take advantage of the locality of reference exhibited by many programs. Hierarchical multiprocessors o er several advantages over the single-level designs and provide various design alternatives 4]. The design of such a system is quite complex and needs careful study of many interacting performance parameters. A performance model is a useful tool for exploring the design space and examining various parameters. This paper reports a simple, yet powerful performance evaluation model for shared-memory cluster-based multiprocessors.
Performance evaluation of cluster-based multiprocessors has been studied by a few researchers 4, 5] . These models assume in nite bu er capacity and/or exponential service time distribution. Modelling of the interconnection network (IN) is an essential part of multiprocessor analysis. Normally, three types of INs have been used for the shared-memory design. These are bus, crossbar, and multistage interconnection network (MIN). Di erent techniques for the performance evaluation of these INs are summarized in 6]. The MIN analysis, being more complex than bus or crossbar, has drawn more attention. Performance of MINs with in nite bu ers is analyzed in 7, 8] . Probabilistic analysis of nite-bu ered networks have been studied in 9, 10] assuming synchronous IN. A queueing model for nite-bu ered MINs is developed in 11] assuming exponential service time and non-blocking capability.
The clusters in hierarchical systems operate independently and the inputs to the network or the memory are asynchronous in nature. This paper thus attempts to model the asynchronous behavior through queueing analysis. A detailed queueing network model of the complete system consisting of all the processors, memories, and network components could be prohibitive even for a small multiprocessor. A decomposition approach seems a natural choice to keep the analysis tractable. We use a hierarchical decomposition technique to model a two-level cluster-based system. The system is modelled as a queueing network where each subsystem is represented as a service/delay center. The subsystems are analyzed independently and then integrated to form the system-level model. The salient features of the proposed model are highlighted as follows.
The interconnection networks are modeled as nite-bu ered service centers to re ect the practical system behavior.
Deterministic service time is considered for elements of the IN and the memory. A packet is blocked at a center due to the unavailability of bu er space at the destination center.
The interdeparture rate of one center a ects the arrival rate of the next center in a nitebu ered queueing network. The performance parameters discussed here are average delay, throughput and utilization. Due to the approximations included in the analysis, a simulation study is conducted to validate the model. System behavior with respect to di erent workloads and design constraints is examined. Contrary to previous results 8], it is shown that the length of the bu er could have a signi cant e ect on the system performance. Large bu ers are shown to be suitable for throughput oriented systems, whereas smaller bu ers should be used where response time is of high priority. The model can be used for identifying the bottleneck center by analyzing the utilization of various components.
The rest of the paper is organized as follows. The system architecture is described in Section II. In Section III, the model assumptions and system decomposition are presented. Queueing analyses for various subsystems are presented in Section IV. In Section V, integration of the decomposed subsystems is described. Numerical results from the analysis and simulation are given in Section VI, followed by the conclusions in Section VII.
II. SYSTEM DESCRIPTION
A generic organization of a two level cluster-based shared memory system is shown in Figure 1 . The clusters of processing elements (PEs) form the rst level and the connection of the clusters through a global network (GN) constitutes the second level. The depicted structure is an (N N) multiprocessor designed using K clusters. Each cluster has n PEs. A PE consists of a processor, a memory module (MM) and a processor node controller (PNC). A memory module is called the private memory of the processor present in the same node. For a particular processor, the MMs of other PEs of the same cluster are called local memories (LMs), and the MMs in other clusters are called global memories (GMs). The PNC is responsible for handling the requests from the local and global memories. Each cluster is connected to the GN by a bus which is shared by all the PEs of the cluster. This bus, termed as cluster-to-global bus (CGB), can be accessed directly by the PEs of a cluster without going through the local network (LN). A request through the GN goes to any MM via a bus which is referred as global-to-cluster bus (GCB). Each cluster is thus associated with two busses as shown in Figure 1 -cluster-to-global bus and global-to-cluster bus. This type of an architecture is also studied in 4].
A processor can access its private memory, local memory (MMs of its cluster) or global memory (MMs of other clusters). An access to the private memory does not go through any network. A local memory reference goes through the LN. A global memory request is rst transmitted to the cluster-to-global bus which connects the cluster to the GN. Then, it passes through the GN, global-to-cluster bus, and nally reaches the destination MM. After memory service, an acknowledgement is sent to the requesting PE through a return path, which includes the CGB, GN, and GCB.
III. SYSTEM LEVEL MODEL
The key concept of modeling the system is hierarchical decomposition -the process of splitting the system model into smaller submodels, each of which is analyzed in isolation. The solution of the original model is formed by combining the submodels taking into account the dependency of various parameters. We consider packet switching communication where all packets are of xed length. The terms request and packet are used interchangeably throughout this paper. The model is based on the following assumptions.
(i) Each processor generates packets independently at a rate and the intermessage times are exponentially distributed. (ii) A request could be directed to one of the n MMs of its own cluster (local request) or to a global MM. The local memory request is uniformly distributed among the local MMs and similarly, a global memory request is uniformly distributed among the global MMs. (iii) A con ict occurs when two or more packets are routed to the same port. Con icts are randomly resolved by allowing only one packet to move to the destined port if there is bu er space. (iv) If a request generated by a processor nds the bu er at the rst service center full, then the packet is rejected. A packet is never lost in the network.
A. Overall System Model
A request generated by a PE traverses its own path while interfering with the tra c due to the requests generated by the other PEs. However, under the uniform tra c load, all the paths are statistically indistinguishable. The system behavior can therefore be obtained by modelling any one path.
The system is represented as a network of queues as shown in Figure 2 , where the path of a packet through various queueing centers is illustrated. A request from a processor accesses the global memory with a probability g, and with a probability (1 ? g) the request is directed to the local memory. A global memory request or acknowledgement rst accesses the CGB as shown in the gure. If the bu er at the CGB is full, all the arriving requests are rejected. In Figure 2 , 0 represents additional requests to the CGB from other PEs of the same cluster. 1 re ects the tra c into the GN from other clusters. E ect of the GN is captured by representing it as a delay center. 2 denotes the additional packets in the path which are not directed to the particular global MM with respect to the request under consideration.
We have not shown the model for the global-to-cluster bus in Figure 2 . The input to a global-to-cluster bus will be from the output of a queueing center of the GN. There can be only one packet coming into this bus in a cycle. A packet from the global-to-cluster bus is transmitted to an MM. Thus, the service time of the GCB can be merged with that of the GN subsystem. The tra c at an MM comes from three sources: private processor (processor of the same node), local PEs, and PEs of other clusters (global requests). 3 represents the additional packets from the private processor and the local PEs. 4 represents the acknowledgements to the private processor and the local PEs ( 3 = 4 ). A request to the global MM could be a read or a write. The memory sends the requested packet for a read request or an acknowledgement for a write. For the return path, 0 , 1 , and 2 are the same as described earlier.
A request to one of the local MMs goes through the LN as shown in Figure 2 . In the return path, it again accesses the LN to get back to the node which originated the request. The additional inputs at the LN, 5 , are due to the requests or acknowledgements generated by other PEs of the same cluster. 6 represents the requests or acknowledgements directed to the other MMs or PEs (except the one under consideration). At the local MM, there could be requests from the private processor and from other clusters which are denoted as 7 . The acknowledgements for the private and global requests are identi ed as 8 and is quantitatively equal to 7 .
Let D l and D g represent the average delay due to local memory access and global memory access, respectively. Then the average delay D for a request completion is given by D = gD g + (1 ? g)D l : (1) B. System Decomposition The system queueing model depicted in Figure 2 consists of four major subsystems. These are the LN, CGB, GN and memory. The LN is usually a crossbar and so there is no contention in the network. The delay is only due to the time taken for message transfer. Although we have neglected this xed delay, it can be included in the LN subsystem. Local requests thus incurs delay at the memory subsystem only. Bus or MIN-based LNs can be modelled as described later for GN subsystem. For the GN, we need to model a bus or a MIN subsystem. Message transfer time in crossbar GN can be included as a xed delay in the GN subsystem. In summary, we need to model the following three subsystems -MIN, bus, and the memory.
IV. QUEUEING MODELS FOR THE SUBSYSTEMS
Buses, and switching elements (SEs) of a MIN have nite-length bu ers. Service times of these components are assumed deterministic ( xed). Hence, they are modeled as M=D=1=L queueing centers (exponential arrival time, deterministic service time, single server and nite length bu er). A survey of queueing networks with blocking can be found in 12], and the detailed analysis of an M=D=1=L queue is reported in 13]. Here, we directly state the results required for this study.
Let be the arrival rate at a queueing center. The tra c intensity, , is equal to d, where d is the service time. Let L be the bu er length. The probability that there are k customers at the center, denoted as p (L) k , is given as
where, x denotes the probability that the bu er is full (blocking probability), and is given as 
where L s is the bu er length of the SEs, and d s is the switch service time. The blocking probabilities, x i 's, are obtained using equation (3) .
The above expression is used to compute i from i = 1 to s. The blocking at a stage a ects the arrival rate at its preceeding stage. Hence, equation (5) is solved iteratively until the network reaches a steady state ( 1 = 2 = : : : s ). After each iteration, we start with a new value of 1 equal to s obtained in the previous iteration. The average time spent at the ith stage can be obtained from equation (4), and the average delay for a packet is obtained by summing up the delays of all the stages.
B. Bus Analysis
The bus is modeled as an M=D=1=L queue. The steady state probabilities, p (6) where, b is the tra c input rate to the bus, and d b is the bus service time.
C. Memory Analysis
Inputs to a memory is from the n PEs of the cluster and the global requests from the non-local PEs via the GN. Under the nite bu er assumption, when the memory bu er gets full, the incoming packets to the memory are rejected if they were generated by the processors of the same cluster and the global requests are blocked in the GN. A blocked packet in the GN a ects the packets in the CGB which in turn a ects the request generation rate from the PNC. This chained reaction is di cult to capture in a model without sacri cing simplicity. All the reported models therefore assume in nite bu er capacity at the PNC or memory queue. In order to keep the model tractable, we assume that the memory bu er is large enough to store any number of requests.
The requests to the memory from the GN do not have an exponential interarrival time distribution due to the M=D=1=L queues as discussed earlier. It is extremely di cult to characterize the input process to the memory modules. We have assumed the request interarrival time at an MM as exponentially distributed. The validity of this assumption lies in the fact that the number of requests to an MM from the local processors is higher compared to the number of global requests, and thus dominates the arrival pattern.
A memory module can now be modelled as an M=D=1 service center. Let m be the total arrival rate to a memory module. Each processor of a cluster generates local requests at a rate (1 ? g) . There are n PEs in a clusters and the total generation rate is n(1 ? g) . The request rate to any one of the n MMs from local processors is n(1?g) ( The individually modelled subsystems need to be combined to reconstruct the system level model. Interdependence of tra c ow must be taken into account while combining the subsystems. In essence, one needs to determine the request arrival rate at each of the subsystems.
The request generation rate of a processor is requests/cycle. A local memory access has no contention in the crossbar LN and is not blocked because of the in nite bu er assumption in the MMs. A global request is rejected if the CGB bu er is full. The arrival rate at a CGB is a ected by the newly generated global requests as well as the acknowledgements from the local MMs. A CGB serves n PEs (of a single cluster), each of which generates global requests at a rate g requests/cycle. The total global request rate from a single cluster is equal to ng . Global requests to a particular cluster can be from the remaining (K?1) clusters which generate global requests at a rate (K ?1)ng . With a probability ( = ng . The arrival rate at a CGB, b , is the sum of the global request rate and the acknowledgement rate, and is equal to 2ng .
Tra c arrival rate at the GN depends upon the departure rate from the cluster buses. Equation (6) is used to compute o , the output rate from the CGB. The arrival rate at an input port of a GN, denoted as gn , is equal to K o for a bus-based GN, and o for a crossbar or MIN-based GN.
A packet is blocked at the CGB if the bu er at the entry to the GN is full. The probability that the bu er in the GN is full is p This new value 0 gn is the actual input rate to the GN. To have an e ective input rate of 0 gn to the GN, the output rate of CGB, o , must also be adjusted to a new value 0 o . 0 o is equal to 0 gn for MIN-based, and 0 gn =K for bus-based GN. In order to have an interdeparture rate of 0 o , the input rate to the CGB needs to be modi ed. We get the adjusted input rate 0 b to the CGB from equation (6) The system delay D is obtained from equation (1). System throughput is equal to the average number of request completions per cycle. The number of jobs served at an MM in a cycle ( m ) indicates the throughput per PE or the normalized throughput. Let X and SX denote the normalized throughput and the system throughput, respectively. These are expressed as, X = m = 0 , and SX = N X = N 0 .
VI. RESULTS AND DISCUSSION A. Model Validation
The modelling technique described in the previous sections has inducted a few approximations at certain stages of analysis to preserve simplicity. In order to validate the technique and justify the approximations, the system was simulated. Requests are generated randomly by each processor with an exponential distribution of interarrival time with a mean of 1= . The destination MM is determined by using a uniform random number generator. Each packet is time-stamped after its generation. The request completion time is checked to compute the delay. Throughput is obtained by counting the request completions per cycle. The average number of request completions per cycle per processor gives the normalized throughput. The simulations were run until the 95% con dence interval was within 3% of the mean. A performance model is useful not only to quantify a set of parameters for a given con guration, but also to investigate the e ect of di erent parameters on system performance. The following discussions illustrate these concepts.
The e ect of bu er length on delay and throughput of a (256x256) system is depicted in Figure 4 . It is mentioned in 8] that a small bu er length shows performance equivalent to an in nite bu er. It can be inferred from Figures 4(a) and 4(b) that this is true only when the input load is less. Under light tra c, i.e. for 0:05, a nite-bu ered MIN with L 4 mimics the performance of in nite-bu ered MINs. Variation of delay and throughput is prominent until the bu er length is considerably high for heavy tra c. The model can be used to determine the minimum size of bu er to achieve performance equivalent to the in nite bu er case. For example, the minimum bu er length required to mimic the performance of in nite bu er is approximately 20 at = 0:1.
Designers have the choice to increase or decrease the bu er length of various subsystems. This can be decided by comparing the relative utilization of various subsystems. Relative utilization can be obtained with respect to the utilization of the processor (considered as 1.0). If the tra c rate at a subsystem is and the service time is d, the utilization U is given as, U = d. Using this formula, relative utilizations of various subsystems for two multiprocessor con gurations are plotted in Figures 5(a) and 5(b) , respectively. It is inferred from Figure 5 (a) that the global bus is relatively over-utilized as expected. The system throughput can thus be improved by increasing the bu er size of the global bus. Futhermore, as the global bus could be the bottleneck at high input load, a multiple-bus implementation may be required to improve the system performance. Increasing the bu er size of the CGB will have little e ect on the performance as it is relatively under-utilized. Similarly, from Figure 5 (b), one can deduce that the bu er length of the CGB and/or the global switch can be increased to improve the throughput. Here, the CGB and the MIN are the potential bottleneck centers. Multiple path implementation from the LN to the GN can be used to alleviate this problem as has been done in the Cedar design.
Performance of a cluster-based multiprocessor is dependent on several interrelated parameters and various constraints. It is extremely di cult to come up with an optimal design satisfying all requirements. The proposed model can be used to decide the trade-o s depending upon the priorities of various performance parameters. For throughput oriented systems, large sized bu ers should be used and smaller bu ers should be used where response time is of high priority (Figure 4) . Increasing the bu er length at a bottleneck center improves the performance of the system. The system bottleneck can be pinned down by analyzing the relative utilizations ( Figure 5 ).
VII. CONCLUSIONS
A performance model is an essential tool for predicting the behavior of a system. It can also be used to analyze intricate details and various design optimization issues. One such model is presented in this paper for predicting the performance of two-level cluster-based multiprocessors. System throughput, average delay, and processor utilization are computed using this model. The analysis captures the e ect of nite bu ers and deterministic service time on system performance. The novelty of the approach lies in the aggregation of subsystems by considering the interdependence of parameters. The e ect of bu er length on the system performance have been analyzed. Various design alternatives are suggested based on performance requirements.
The proposed model could be extended for multi-level hierarchical designs and other types of hierarchical architectures. A complete system model of this nature provides better insight to design a well-balanced system as compared to an analysis of the network only.
