Abstract-Exact results are given for the processing power in a 4) a not necessarily uniform distribution of accesses across multibus multiprocessor with constant memory cycle times and the memory modules. 
Nets (GTPN), that we have developed. We describe the GTPN diagram (i.e., discrete time Markov chains). Previous reand how it is applied to the multiprocessor interference question. searchers [3] , [4] have concluded that this approach is We reach several new conclusions. A commonly used definition of processing power can lead to substantial underestimation of computationally infeasible except for very small systems. We the true processing power of the system. If the real system has a demonstrate that this approach is feasible for useful size constant memory access time and any number of buses, then systems if the Markov chain is properly formulated. In assuming an exponential access time can lead to substantial errors particular, we derive results for several models (with up to 16 when estimating processing power probability distributions. In processors and 16 memories) with the four properties listed multibus systems with only a few buses a critical memory above interrequest time exists. Performance close to that with a crossbar above is attainable when the interrequest time is larger than the critical To aid in the definition and generation of the Markov chain value. Obtaining these results illustrates the advantages, for we used a high-level description based on Petri nets. Petri nets moderate size state spaces, of the GTPN over simulation with are a graph model of computation [5] . Modifying Petri nets so respect to both model design and running time. that time is represented has recently been an active research Index Terms-Bus contention, Markov models, memory inter-area. We have generalized the method of representing time ference, multiprocessors, performance comparison, performance that has been suggested by Zuberek [6] and Razouk and Phelps evaluation, Petri nets.
[7]. We call our version of Petri nets, Generalized Timed Petri
I. INTRODUCTION
Nets (GTPN's) [8] , [9] . The [8] , [9] . In Section IV we present our multiprocessor model and results.
Section V summarizes the important contributions of our work model are:
and suggests some directions for future research.
1) all of the processors are statistically identical 2) the memory access time (memory cycle time) is a II. BACKGROUND constant 3) MRP is 1, i.e., a process is never actively executing on A. Multiprocessor Characteristics its own processor. When its current request is satisfied, it Fig. 1 illustrates the multiprocessor systems we consider. always generates another request at the start of the next cycle. The shared memory is divided into independent modules, each 4) the interconnection network is a crossbar of which permits only one access at a time. The processors are 5) requests made by each processer have an independent and connected to the memory modules through a single-stage equal probability of being directed to any one of the modules, multibus (in contrast to multiple-stage networks such as i.e, uniform memory access probabilities. banyan networks).
In spite of these simplifying assumptions, the state space of The process associated with each processor can be in three his Markov chain grows rapidly. For uniform or waiting for a memory). Processor productivity is the access probabilities, a crossbar, and a MRP of l. One logical probability that a typical process is doing productive work change is to consider to what extent performance is degraded (executing on its processor or accessing a memory, not waiting due to having fewer buses than in a crossbar. Towsley [3] for a memory). Memory utilization summed over all memories gives approximate soluitions and simulation values for this is the expected number of busy memory modules or the case. Alternatively, a crossbar could be assumed and a MRP effective memory bandwidth. Processor utilization summed less than one considered. This is a reasonable change because over all processors is sometimes called processing power presumably each processor has some local memory or a cache [10] - [13] . Effective memory bandwidth and processing power taitis using formsofiseoyacvt.Bsktan (as defined above) are the main performance estimates Smith [16], Rau [17] , Yen, Patel, and Davidson [4] , and obtained in the studies cited below.
Towsley [3] give approximate solutions for this case.
More recent studies combine these two changes, i. [25] , and F: TxS-KR+ U {0} (firing frequencies)
Marsan, Balbo, and Conte [10] . Several of these studies [25] , C: T,{yes, no} (CntComb Boolean flags) [13] , [10] are of special interest because they use a form of Petri nets, called stochastic Petri nets, to derive their continu-R : P U T-'({ r1, r2, d rk }) (resources).
ous time Markov chains. We defer further discussion until Section IV. Mudge and Al-Sadoun [26] is an approximate
The first four components of the tuple are identical to the solution that allows the memory access time to be any discrete constructs in an untimed Petri net (see [35] for more details). time random variable that has first and second moments.
Important properties of the remaining four components are The last group of studies consider nonuniform access summarized below. probabilities. All assume constant cycle time, and a crossbar.
Petri nets are often illujstrated graphically. Fig. 2 value one if at least one firing of that transition is in progress in
In Section IV we develop a GTPN model of multiprocessors the current state and is otherwise zero), and arithmetic, which can be modified easily (primarily by changing a few relational, and logical operators. Thus, a transition's firing parameters) to reflect the various assumptions made in the duration and frequency can be state-dependent, but for a given above studies. We will compare the performance estimates state they are deterministic. The CntComb Boolean flag is obtained from exact analysis of the GTPN with some of the used in computing state probabilities as explained later. The results cited in this section. We will also use the GTPN model set of resources are used to compute performance measures as to obtain results not previously reported. First we introduce explained later. the GTPN.
B. Next States and Their Probabilities III. THE GTPN MODEL
The multiplicity of an input place is the number of arcs This section describes the Generalized Timed Petri Net from that place to that transition. An output place's multiplic-(GTPN). The GTPN belongs to the class of deterministic t-ity is defined analogously. N enablings of a transition exist if timed Petri nets [33] , [34] , and removes restrictions on the net each of its input places contains a number of tokens equal to at in earlier methods for analyzing performance [6] , [7] . The least N times its multiplicity.
GTPN model was introduced in [8] and described more An enabling of a tranisition can start firing by removing completely in [9] . The reader is referred to both references for from each input place a number of tokens equal to its further details.
multiplicity. After start ifiring, the firing is in progress until
A. The Net end firing. While the firing is in progress, the time to end firing, called the remaining firing time (RFT), decreases A GTPN is a Petri net which includes: 1) a deterministic from the transition's firing duration to zero. At end firing a firing duration associated with each transition, 2) a mechanism number of tokens equal to its multiplicity is put on each output for specifying next state probabilities for conflicting transi-place. tions, and 3) a set of named resources associated with each A marking and the set of current RFT's defines a state of transition which are used to calculate performance estimates. the net. Given a particular state, the basic rule for finding the Thus, the GTPN is formally defined by the .~~~~~~~~d epends on the number of tokens on P2.
all firings in progress with the smallest RFT (Tmin). The.
time-in-stat valuec Tm hre ae n In Fig. 3 and Table I in the discussion below, we will call our measure speedup, Each transition has a set (possibly empty) of named since it is the same as the speedup measure used in the resources. A named resource can be associated with more than nonstochastic literature on multiprocessors. We will use the one transition. Whenever one of those transitions is firing, the term processing power in the sense defined in previous resource is in use. The number of those transitions firing studies. We argue that speedup is a more important single simultaneously is the current number of usages of that measure of system performance because the goal of multiresource.
processing is speeding up a program, not achieving high By using the rules above for generating next states, we can memory or processor utilization. Effective memory bandwidth determine the set of reachable states for a given initial state. and processing power are also easily computed for our GTPN By placing directed edges from parent states to child states, the models as we show below. set of reachable states can be viewed as a reachability graph.
The GTPN model used in the analysis assuming uniform Fig. 4 and enforce that only one token on P4 at a time moves to the The GTPN allows us to describe the system such that we can bQttom row (with zero delay). As tokens move across the directly derive the smaller state space. This explains the bottom, processors have their memory requests granted and difference in state space sizes. A similar direct method was return to P3. The last processor to use a memory module (T8) used by the GSPN [101. We feel that it is quite likely that returns the bus token to P2.
deriving the smaller state space without the GTPN is possible.
The frequency expressions for the transitions along the Our point is that the GT'PN aids in seeing and expressing the bottom enforce that none of the transitions along the bottom symmetry which allows the direct derivation. row start firing until all possible tokens on P4 are moved into We use the power method, an iterative sparse matrix the memory subsystem. In the&e frequency expressions a algorithm, to solve for our results. The iterations terminate vertical bar represents alogical or. When the number of tokens when the sum over all states of the absolute value of the on P2 or P4 or P5 is zero and the number of tokens on P3 is difference between the last two iterations, is less than the zero, and there are no firings of Ti, the expression evaluates convergence criterion. With our default convergence criteto one, otherwise it evaluates to zero. The Cnt Combinations nion, 5 x 10-5, all of our values agreed with Bhandarkar's column of Table II contains the value of the flag that except in the 8 processor/8 memory case where we reached determines how probabilities for maximals are to be calculated 4.9469. We repeated the analysis with a smaller convergence [10] gives exact results for a 12 results. Note that, again, the number of buses can be reduced processor/2 bus system. They vary the number of memories substantially from a crossbar with minimal effect on perform-and the load. They assume that the interrequest time is apce.
exponentially distributed with rate X and the memory access Second, we consider a 16 processor/16 memory system time is exponentially distributed with rate x. The load is the with one or two buses. In Table VI we show Goyal and ratio, p, of X to y. Our approach can be compared to theirs. Agerwala's [2] values and ours. In this table we adopt their We assume a constant memory access time and an interrequest convention of using the mean interrequest time (mean IRT = time which is, strictly speaking, a modified geometric random 1/MRP -1) instead of the memory request probability. Our variable. The important step is to make our models as similal values and theirs agree within the range of statistical error.
as possible, so that only the difference in modeling the memory access time is observed. In particular, we need to C. Comparison with Exponential Memory Access Time represent the interrequest time distribution accurately. Models
In the limit, as the time length of a trial goes to zero, a
Recall from Section II that several studies have assumed an rmodified geometric randlom variable is identical with the exponential memory access time and have derived exact exponential random variable with same mean. Consequently, performance estimates using continuous time Markov Chains, if trials are "reasonably firequent," then a modified geometric There are two possible reasons for the exponential assump-random variable is a good approximation to the exponential tion. One, is that the multiprocessor under study has an random variable with the same mean. We can approximate the exponential memory access time. Two, is that the multiproces-exponential memory interrequest time arbitrarily closely in sor under study has a constant memory access time, but that our GTPN model, by dec:reasing the duration of transition T2 assuming an exponential memory access time is a reasonable and adjusting the frequency expressions for transitions Tl and approximation which yields models that can be solved exactly. T2 appropriately. Furthermore, for a selected duration of transition T2 (greater than zero), the variance of the modified 11 geometric distribution is larger than the variance of the 10 exponential distribution we are approximating. The increased contention due to this larger variability will result in lower Though expected values are important, the nature of the s probability distribution of processing power is useful in S0.12-characterizing multiprocessor behavior. In Fig. 6 case of an arbitrary number of buses. 
D. Critical Memory Request Probability
We now describe the analyses we conducted that are not rapidly decreases and is equal to the number of buses in the comparisons with previous studies. Our first set of experi-iimiting case. This rapid decrease is clearly due to the lack of ments measured speedup for a 10 processor!10 memory buses. The drop is more gradual as the number of buses system. The memory request probability is varied from 0.1 to increases and is to a larger and larger extent due to memory 1.0. The number of bus.es is 1, 2, 3, 4, and 10. Our results in contention instead of bus contention. We note that a functional Fig. 7 suggest an important conclusion about the effect of the relationship may exist between the number of processors, number of buses on speedup. When the number of buses is memory modules, and buses, and the critical MRP. Further small, a critical memory requcest probability appears to exist. study is required to determine whether this is true.
The horizontal line drawn at speedup = 8.75 indicates
Our results are more specific than the conclusion reached by approximately where this critical MRP lies on each curve. Lang, Valero, and Alegre. With respect to the measure of Below that probability, speedup is close to that with a crossbar effective memory bandwidth, they concluded that good per-(even for just two buses). Above that probability, speedup formance is possible with the number of buses equal to one half the number of processors with a MRP of 0.5. Note that 'V. CONCLUSIONS their conclusion is supported by Fig. 7 . Furthermore, we conclude that as long as the memory request probability stays
We have presented exact performance estimates for models below the critical value, only a few buses are needed to have of multiprocessors for which only approximate and simulaclose to the performance of a crossbar.
tions estimates existed. These models include the important properties of constant memory access time, memory request F. Nonuniform Access Probabilities probabilities less than one, and bus contention. One form of All of the experiments above assume uniform access non-uniformity in the imemory access probabilities was also probabilities. Many [38] , [39] , then the SPN models may be We conducted an experiment assuming a favorite memory. more advantageous due to smaller state spaces [8] , [9] .
We considered a system with 6 processors, 6 memories, and 3
The previous stochastic modeling studies of multiprocessor buses. Results are given in Fig. 8 for memory request memory and bus interference have measured effective memprobabilities of 0.3, 0.4, and 0.5. Each curve has seven data ory bandwidth, and processing power as defined by: processor points for when zero, one sixth, two sixths, up to,six sixths of utilization times the number of processors. We suggest a better the memory requests are directed to the favorite memory. measure of processing power which is equivalent to the Note that a modest favoritism (i.e., two sixths) has only a measure of speedup that is commonly used in other bodies of small effect on speedup. As expected, the speedup decreases literature on multiprocessors. as the favoritism increases and as the memory request Our multiprocessor performance estimates provided several probability increases. In addition, as the memory request important insights. One is that assuming an exponential access probability increases the importance of favoritism increases, time for a model of a multiprocessor with constant memory causing speedup to decrease more rapidly. access time and any number of buses causes only a small This favorite memory experiment illustrates the need to underestimation of the expected value of processing power. develop approximate solution techniques based on the GTPN. However, the probability distributions for processing power Identifying one memory module as favorite causes a signifi-differ substantially. The distribution assuming an exponential cantly larger state space than when all the modules are access time has a higher variance.
identical. For example, this 6 processor/6 memory/3 bus Two, is that at low request rates only a few buses are needed system has 3384 states while a 6 processor/6 memory/3 bus to have almost the performance of a crossbar. However, when system without a favorite memory has only 496 states.
only a few buses are used, a critical request rate exists. 
