This paper is concerned with processor degradation produced by access conflicts in multiprocessor, multi-memory bank computer systems. Hardware and software parameters which influence performance are outlined and discussed• The incorporation of these parameters in analytic and simulation modeXs is discussed with a view to the predictive merit of the model• Finally a general simulation model of a multl-processor is presented, and results based on its use are analyzed and compared with those of other models•
INTRODUCTION
In this paper we discuss various topics concerning the development of models (analytic and simulation) of multi-processor multi-memory bank computer systems.
In all instances we are interested in the degradation in processor performance as a result of memory bank contention.
To that end we are interested in various bank configurations including interleaved (partial and complete), separate banks for processor programs and shared data across a collection of banks, shared program banks and separate data banks, etc.
The above considerations affect the manner in which references are generated by the processors in the configuration• Since the queuing discipline employed by the banks also affects system (processor) performance we will be interested in studying the effects of several• For example, the first come first served discipline as assumed by the queuing model of the next section (as well as by several other models) is examined. Since many systems employ priority schemes (see, for example, [7] ), especially when input-output controllers are included, we shall also be interested in strict priority as well as roundrobin service disciplines as memory port disciplines for the banks.
Hany systems also provide separate access ports for data and instruction references and on contention give priority to the data reference access port• Finally the distribution of processor inter-request times is important, especially on machines for which a variable number of instructions may be held in a machine word.
As a point of departure we next briefly review the literature.
Of particular interest are the models discussed in references [5] - [9] . The study in [5] considers the contention in a system consisting of a processor and input-output controller, to be a function of the number of instruction "look-aheads".
Employing the degree of "look-ahead" as a parameter, the expected value of processor waiting time due to input-output controller memory requests is derived.
The model further assumes that instruction fetches are completely sequential. The model in [7] and [8] calculate the expected memory bandwidth (defined as the average number of requests serviced per memory cycle by a collection of interleaved memory banks)• Instruction and data requests are modeled separately and overall bandwidth is calculated as the average of the two individual bandwidths.
No interrelationship between data and instruction references is assumed• The implication of this assumption is that the instruction requests can race ahead of the data requests, resulting in an overly optimistic model. This latter deficiency is considered in the model reported in [6] • Unlike the previous models, that of [6] is based upon the theory of ~rkov chains and in some sense emends the deficiency previously mentioned• ~re specifically the following assumptions are made
• Each memory bank operates continuously, and cyclically• • The operation of all memory banks is synchronized• • No distinction is made between instruction and date references• • Each processor makes Only one memory request per synchronized memory cycle. This implies that under optimal conditions of no conflict two memory cycles are required to fetch an instruction and its associated operand. . The request pattern of the processors is a sequence of Bernoulli trials. This implies that instructions are not executed sequentially or alternately that interleaving is not modeled• • If a processor fails to access a bank on a given cycle due to contention~ it automatically returns on the ensuing cycle (hence the use of Markov chains).
• Request contentions are decided on a probabilistic basis• By appropriately setting the associated probabilities a partial priority discipline can be modeled.
The steady state probabilities generated allow a processor degradation factor to be computed. Since statistically non-homogeneous processors are BANK COMPUTER SYSTEMS ... Continued modeled, establishing the transition matrix probabilities is extremely difficult for systems consisting of as few as four processors and two memory modules.
The model is not, therefore, amenable to the manipulation of its parameters.
Many of these deficiencies are overcome in the model reported in [9] . The price to be payed, is that the set of processors modeled must be considered statistically identical (except as noted below).
This reduces the enormous number of states in the transition matrix of the model of [6] to a manageable number.
Several memory bank configurations are studied including interleaved, separate program banks and interleaved shared data banks.
Among the assumptions made (some for analytic tractibility) the following are pertinent.
• Overlap is modeled.
That is requests for current operand and next instruction are issued simultaneously.
• The memory banks operate cyclically and synchronously as indicated above for the model of [6] . This essentially implies that all instructions require one cycle for execution.
• Bank contentions are resolved by a random selection. This implies that if a processor has a single outstanding request, it is with equal probability an instruction or data reference.
• Each instruction requires one machine word.
• Data references are independent and are made from the allowable banks on an equiprobable basis.
• An instruction requires an operand with a probability b, (Bernoulli trials).
• Program jumps are made with a probability a, (Bernoulli trials). The next bank is selected on an equi-probable basis from among the allowable banks• For the complement of models reviewed that of [9] most faithfully reproduces the general characteristics of existing systems.
We shall comment more on this in a later section.
The models discussed above all view the set of memory banks and processors as a unified system• In some instances it is convenient to consider the delay encountered by a single processor at a given bank.
Then given information on the number of references made by that processor, the expected program execution time can be calculated• This approach views each memory request as a request for service from a single server with constant service time.
A queuing phenomenon results owing to the generation of similar requests by the remaining processors• Since for many actual systems it is possible for several requests (at least two) to arrive simultaneously, queuing models with batch arrivals are appropriate• We can then consider that during a memory cycle a batch of requests arrives at the memory bank and is allowed to enter a "service buffer" associated with the memory bank at the end (beginning) of each memory cycle• For simplicity the queuing discipline of the buffer may be taken as firstcome-flrst-served.
Let E(w) denote the expected where X = arrival rate, Oa 2 = variance of interarrival time distribution, Oq2 = variance of service time distribution• For our purposes Oa 2 = 0 and ~2 is calculated as follows.
Let k be the batchsize, and t the cycle time• Then the expected batch service time is E[k]t, and o~ 2 = Ok2t2 and ~ = I/E(k).
In [i] the Laplace-StielJes transform of the waiting time distribution is computed for a batch input general service time model, for which the batch inter-arrival times are assumed to follow the exponential distribution•
The transform is given in terms of the transform of the service time distribution and the generating function of batch sizes• Differentiation of the expression and repeated use of L'Ho~ital's rule produces E(w) in terms of ~, E(k), E(kZ), etc.
Other pertinent remarks on batch queues may be found in [3, 11, 12] . For our purposes, a more convenient and somewhat simpler model can be developed, (which does not require Poisson input assumptions) by considering that batches of size J arrive at the memory bank with probability Cj, J=0,1,...,n during each memory cycle.
An analysis of such a system is presented next.
QUEUING ~DEL
Consider a memory bank with cycle time t. During each service period, batches of requests arrive from the remaining processors.
The batch of requests is allowed to join the memory bank queue at the end of each memory cycle• Let Cj equal the probability that j requests arrive during a service period, with j=0, i,..., Z, Z < 2p, the number of processors• The situation i~ described by the model depicted in figure 0. In statistical equilibrium eqn(1) implies
It is easily seen that
so that substitution of (3) into (2) yields
solving for Gy(Z) yields
Since it is required that Gy(1)=l, a single application of L'Hopital's rule shows that P = l-E(J) o which further indicates that 0<E(J)<i is a necessary and sufficient condition for statistical stability.
For reasons analogous to those given for (i) and (2) it is seen that Yi = Xi-i + Ji so that Gy(Z) = Gx(Z)G j (Z) or by (4) that
Since E[X]=Gx(I ) a single differentiation and two applications of L'Hopital's rule yields
and since V(J) = E(j2)-E2(j)
Let Wn* denote the delay experienced by the nth request for service.
During the period w_+l we expect E(k)(wn+l) requests to be generate~.
Following service of the nth request we expect E(Y) = E(x) + E(k) requests to be in the system. Therefore in the mean
Solving for E(w) yields
E(w) = E(X) E(K)
*w n is measured in memory cycles.
which after algebraic manipulation, use of eqn (5), and multiplication by t yields
By considering the steady state equations of detailed balance the probabilities P can be calculated giving n
Since Po is given above, the process is easily mechanized.
It must be noted that although C.=0 for J > l, the same is not necessarily true far P4" If we let J E(Y) = ~ Pn N n ! and Cj = Cj-J/E(J)* then it is also true that
Eqns (6) and (7) are numerically equivalent. The value of V(W) is easily computed using the values of Pn as, for example, To investigate the behavior of programs as regards bank references, data was collected on program behavior for several widely differing systems. Using a simulator designed for research purposes (see [2] ) data was collected on the behavior of programs running on the CDC 6600. In particular ! *E(J) is the arrival intensity and thus C~ represents the intensity for batches of size -J.
BANK COMPUTER SYSTEMS ... Continued the behavior of an operational APL* processor, a heavily used FORTRAN* compiler and several application programs were investigated• In all cases complete address traces were collected as program execution was simulated•** The traces included all program code executed including input-output and file manipulation subroutines• In all cases the program behavior was in no way altered due to the simulation process.
The trace information provided several statistics.
First operation code use distributions were established• Such data is pertinent to the simulation of models providing for asynchronous program, memory module behavior• Secondly it provided for the construction of certain transition matrices relating the banks from which successive instruction words and operands are fetched• A summary of instruction use is given by Figure (4) . Two types of transition matrices were constructed, one for data, one for instruction fetches.
Each is a matrix of the form T K = t~4 ,i, j=l,...,16, K=D(Data), l(instruction), where t~ equals the probability that if the last reference of type k came from bank i, that the next reference comes from bank J. A similar but less comprehensive study was made on operational code for the machine described in [4] . Each trace accounted for well over 500000 memory references.
Several points of information c~n be gleaned from the data.
In summary: • The probability of a jump ranged from
• with the remaining entries quite uniform in nature. This predominant behavior can be attributed to symbol table manipulation• It can be argued that the above pattern is obvious and to be expected.
It was also expected that a similar pattern would emerge for APL, as it too does considerable table manipulation.
In fact no dominant pattern was observed for APL, i.e., its data reference matrix most nearly indicated a uniformly distributed reference pattern.
Many application programs demonstrated the same type of behavior.
Several produced main diagonal dominant matrices, while several others produced uniform like matrices much as for APL.
In several cases, the a priori prediction of behavior was See reference [13] The structure of the processing elements is based on the concept of an instruction cycle. Within the cycle, various transitions may occur which affect the result in memory requests.
An individual PE has an associated program state s. This state is an integer value 1 < s < m and can be considered a memory module location M s of the current instruction word.
The transitions within the instruction cycle of a PE are described by a function If all instructions come from the same module with a probability of .9 then .05 .95 1.0 05 .i I.
might be an appropriate transition matrix.
Because the function F(T,s) and the transition matrix T control the behavior of PEs towards the memory modules of the system, it is necessary that individual PEs have unique characteristics. In terms of a simulation program, it must be possible for differing PEs to have differing matrices T. The approach taken was to define groups of PEs denoted G n. Each G n contains N PEs all identical in behavior. This structure enables the simulation model to handle differing processor types, for example, one type of PE may be for compute while another might process I/0 requests.
To describe the actions during the instruction cycle of a PE we need to define the following probabilities.
Let ~ represent the probability of a jump instruction.
If an instruction is not a Jump, then it issues memory requests for an instruction word and operand word with probabilities y and 8 respectively.
The cycle is completed by an execution delay which overlaps the memory request processing and may or may not extend beyond.
The execution delay is determined by a draw from an empirical distribution obtained from actual machine data.
Individual memory requests provide information linking them to the PE which generated them. In addition, each memory request has an associated attribute called its priority.
The linkage to the generating PE enables the processing of the request to be synchronized with the PE. The priority balue is used to order the requests in the memory module queue to model priority bussing.
The memory module removes memory requests from the waiting queue and processes them using a unit of time called a memory cycle.
The queue disciplines simulated include random selection and priority.
The algorithm for random selection is obvious.
The priority queue is characterized by an ordered set of memory requests {Rs,kj}. Associated with each R s k 4 is a priority Pkj" The selection dlsclpline chooses the earliest smallest Pkj.
For the priority bussing studied here, operand memory references were set to Pkj = 1 and instruction memory references to PkJ = 2. The first decision determines if the command is a Jump instruction.
A jump instruction has no operand other than the address to which the Jump is being made. In the case of the non-jump instruction, an operand memory 8 reference occurs with probability and an instruction memory reference occurs with probability y . The delay E is a random draw from an empirical distribution supplied as data to the simulation. As is evident from Figure 2 , the advance to each next instruction generates a memory reference.
The interaction between commands and instruction words is controlled by an empirically supplied distribution of the number of commands per word.
This distribution determines the value of IPW.
An additional difference between this model and that in Figure i is the interaction between operand memory references and the instruction execution time. An operand memory reference occurs with probability 8. However, if an operand memory reference does occur, the retrieval of the operand from the appropriate bank constitutes the execution time of the instruction.
This model difference implies that machines exhibiting many accumulators with simple memory loads and stores are well represented.
SIMULATION EXPERIMENTS AND RESULTS
This section discusses the simulation experiments and sunmnarizes the results of these experiments. The simulation experiments were based on three machine descriptions and the two PE models described above.
The results are displayed as a series of graphs and tables which we discuss below.
The different PE models were described earlier.
In addition, three modes of instruction word memory referencing were used, two types of memory queue disciplines were used to explore bussing effects, and the memory modules were synchronized and nonsynchronized.
The transition matrices used for jumps and operand references were established as uniformally random.
The three modes of memory referencing for instruction memory requests are uniformally random, individually banked, and interleaved.
For uniformally random references, all modules are equally probable for the next instruction word reference.
In the case where the instruction references are individually banked, all instruction memory references are to the same module until a jump occurs.
Interleaved references go to each memory module in sequence.
Recall that the individual memory modules are modeled as servers of a queue.
Each module operates concurrently with the other memory modules of the system.
With random selection as the queue discipline, all memory requests have immediate and equal chance at the memory modules.
The priority queue discipline insures that operand references are served ahead of instruction word references.
In the case of synchronized memory modules, all modules initiate their service cycles together.
If a memory request enters a queue during a memory cycle and that particular module is inactive, the request must wait for the start of the next synchronized cycle.
If modules operate asynchronously, only a small (5%) delay is assumed. This assumption is arbitrary.
The figures and tables discussed below display the instruction execution rate (IER).
This quantity is the number of instructions executed in the total configuration per memory cycle.
In Table I various IER values are presented for an idealized machine.
All instructions require one memory cycle• These results compare very well with those obtained from the analytic study of Sastry [9] . Figure 3 shows the plot of the uniform case and some points from the work of Sastry are plotted for comparison.
The favorable comparison indicates general validity of the behavior of the program.
As indicated in earlier sections, this study involves two real machines and their instruction mixes.
Data describing a CDC 6600 program execution was used to generate the distribution detailed in Figure 4 . Figure 5 shows the distribution of instruction mix for a single address computer, the UYK-7.
In Table 2 we summarize the results obtained using the UYK-7 instruction mix and memory cycle (1.5 microseconds) time.
The values indicate the resulting IER.
Several observations can be made from this table• First, the effect of uniform, banked, or interleaved instruction word references is minor compared to altering the number of memory modules or processors.
Secondly, the change from uniform to priority bussing also has minor effects on the IER.
Finally, Table 2 shows the effect of synchronizing the memory cycles of the memory modules• In Figure 6 we have applied the symbol "s" for the corresponding points with synchronized memory modules• Notice that these points remain close the the asynchronous points for a small number of memory modules.
As the number of processors increases so does the point at which the synchronous and asynchronous points significantly differ• This can be explained in terms of the utilization of individual memory modules.
As the utilization of a memory module decreases, the probability of a memory request finding the module idle increases.
In the synchronous case, an idle memory module is characterized by a latency of a half a memory cycle.
This delay causes a degradation in the response of the individual modules which, in turn, is displayed by the lower IER.
In Figures 6, 7 , and 8 we have plotted the values for random bussing.
Finally, Figure 9 shows the results of the simulation using the CDC 6600 data and the modified PE model• This plot demonstrates the effect of allowing multiple instructions per instruction word.
Rather than a probability of instruction word reference y, the modified PE definition uses a distribution describing the number of instructions per word. Figure 9 indicates the values used.
The increase in performance due to the multiple instruction words is significant• CONCLUSION Several of the results given here have been qualitatively known.
In essence, this study has attempted to quantify these qualitative results. Our results relate to the mode of instruction referencing, to the type of bussing between processors and memory modules, to the synchronization of memory modules, to the number of commands per instruction reference, and to the relationship between probabilistic and simulation models• Specifically, for the loadings assumed
• Assuming no dedicated memory module assignments, the mode of instruction reference to the memory modules has a minor effect on the total configuration performance.
• Priority bussing for operand memory references caused no significant improvement over random selection• from all memory requests. • The effect of synchronous memory modules is significant degradation in performance when individual modules exhibit lower utilizations. Configurations with a large number of processors will perform significantly better if individual memory modules can initiate memory cycles at the arrival of the request.
• Multiple commands per instruction word significantly increases performance, quantitatively reported in Figure 9 .
• Markovian and simulation models produce results which faithfully predict actual system performance. Simple queuing models, such as the one reported, provide the machinery to develop simple first estimates of system performance. In some instances the quantification of the parameters may be easier for such models than for more elaborate models.
When properly parameterized, the presented queuing model yielded results consistent with those of the model reported in [9] . . IER for CDC 6600 loading with interleayed instruction references and uniform operand and jump references. 0 indicates points using N instructions per word using P(N) of Table  3 while ~ indicates points for single word instructions. 
