A n umber of recent v ector supercomputer designs have featured main memories with very large capacities, and presumably even larger memories are planned for future generations. While the memory chips used in these computers can store much larger amounts of data than before, their operation speeds are rather slow when compared with the signi cantly faster CPU (central processing unit) circuitry in new supercomputer designs. A consequence of this speed disparity b e t ween CPUs and main memory is that memory access times and memory bank reservation times (as measured in CPU ticks) are sharply increased from previous generations.
Introduction
In recent y ears advances in elds such as computational uid dynamics and plasma physics have outstripped the main memory capacity o f e v en the largest scienti c computer systems. Furthermore, users have found that using disk drives or other external storage media for temporary data storage in these large problems is seldom satisfactory, a s i t o f t e n increases their wall clock run time by several orders of magnitude. Thus many scienti c programmers are now clamoring for vector supercomputers with vastly increased main memory.
Fortunately for such users, the semiconductor industry has been remarkably successful in recent y ears in producing memory chips with burgeoning capacity. 256 kilobit chips are now readily available from suppliers, and prototypes of one megabit chips have recently been displayed. Thus it is not too surprising that a number of recently announced supercomputers have featured main memories as large as 256 million 64-bit words, and presumably even larger memories are in the works for the next generation.
While the emphasis in the development o f m e m o r y c hips has been increased capacity, the emphasis in the design of supercomputer CPU circuitry has been increased speed. CPU clock \ticks" of ten nanoseconds or so are now commonplace, and supercomputers with four nanosecond or even one nanosecond CPU cycle times are on the horizon. This disparity in speed between CPU circuitry and memory bank circuitry means that the memory bank reservation time and the memory access time, as measured in CPU clock ticks, are sharply increased for new supercomputers. While it has been recognized for some time that these long operation times would lower the scalar performance of supercomputers, it is only recently that the potential for vector performance reduction has come to light.
The reason for this potential reduction in vector performance is memory bank contention { that is, delays encountered when a CPU attempts to access a bank of main memory that has been reserved from a previous access by another (or even the same) CPU. This article will analyze the phenomenon of memory bank contention and discuss both the potential for performance reduction and techniques for ameliorating this reduction.
A M a r k ov Chain Model for Memory Bank Contention
The memory bank operation of a multiprocessor vector computer system may b e a pproximately modeled using a relatively simple Markov c hain model. While such a model cannot precisely describe the phenomenon of memory bank contention in a real vector computer, it does serve as a good introduction to the problem, and in fact some quantitative conclusions can be drawn from this simple model that do carry over to a more realistic model.
In order to facilitate analysis, certain simplifying assumptions will be made. It will be assumed that the computer system being modeled has m CPUs and n banks of interleaved memory (i.e., successive d a t a w ords are in successive memory banks). It will be assumed that the cycle time for a complete memory access is t CPU ticks. In particular, it will be assumed that whenever one of the CPUs initiates an access to a word of memory (either to store or recall), a reservation of t ticks is placed on the bank containing that word. This means that for the next t system ticks, any CPU wishing to initiate an access to a word in that bank of memory must wait before it may begin. Once a CPU has initiated a memory fetch or store, it is free to initiate another at the next CPU clock period. Note that a single CPU may b e s i m ultaneously in the process of accessing up to t separate memory banks, provided no bank busy con icts are encountered.
At each system clock tick, it will be assumed that each CPU that is not waiting tosses a coin with probability of heads equal to q, and attempts to initiate a memory access (from a memory bank chosen at random) if the coin turns up heads. It will be assumed that when a CPU attempts to access to a bank that is busy from a prior reservation, the remaining reservation on that bank is uniformly distributed between 1 and t. The case where more than one CPU is waiting to access a single reserved bank will be ignored in this Markov model. A nal approximating assumption is that the fraction of memory banks that are busy at any time is approximately a constant x. Such an assumption may b e made assuming that the process has achieved a steady state.
It should be mentioned that in real vector computer operation, a CPU is typically either attempting to access memory cells every tick, as part of a long vector fetch or store, or else it is \crunching" and not attempting to access memory at all. Further, most memory accesses are from consecutive memory banks, instead of from randomly chosen memory banks. This last deviation appears to be the most serious in the model. By comparison, the assumption that no more than one CPU is queued waiting to access a single busy bank does not appear to be a serious limitation, based on the results of empirical simulations.
The operation of each CPU may n o w be approximately modeled by a Markov c hain on the t + 1 states s 0 s 1 s 2 : : : s t . Here s 0 denotes the free state and s k denotes the state of waiting for a bank that has a reservation of k ticks remaining. Let T denote the Markov transition matrix for this model (i.e., T ij is the probability that the next state is j, g i v en that the current state is i). Then 
It may easily be veri ed that the Markov c hain described by this transition matrix is a regular (ergodic) process. This means that the a priori probability o f a n y state is equal to the limiting frequency of appearance of that state (for almost every sample sequence). Let p = ( p 0 p 1 p 2 : : : p t ) denote the vector of a priori probabilities of the t + 1 states. These probabilities may be determined from the relationship pT = p ( Since it was assumed that the fraction x of banks that are in a reservation cycle is constant, the expected number of banks initially reserved at any instant m ust equal the number whose reservation expires at that instant. This can be expressed by the relation qmp 0 = nx=t where it is assumed that at each t i m e 1 =t of the busy banks are freed. This relation combined with the above yields the solution 
Implications of the Markov Chain Model
This formula for the e ciency statistic does not, unfortunately, agree closely with most actual vector supercomputer operation. The main problem appears to be, as mentioned above, that most vector computer memory accesses are from consecutive banks (or at least from banks di ering by some constant stride) instead of from randomly chosen banks. The term stride here refers to the increment in memory between successive elements in a vector fetch or store. Only in the case where a computer is running programs with uncorrelated nonunit strides does this formula closely agree with actual memory performance.
In spite of these limitations, the above f o r m ula does contain implicit relationships between the number of processors, the number of banks, and the bank reservation time that do carry over, to the more realistic model in the next section. First of all, one can conclude from this formula that if the number of processors m is increased by a factor k, t h e n the number of banks n must also be increased by a factor k to preserve the same level of e ciency. Secondly, if the bank reservation time t is increased by a factor k, then the number of banks must be increased by a factor of about k 2 to maintain the same memory e ciency.
Monte Carlo Simulations of Memory Bank Contention
A more sophisticated (and realistic) model of memory bank contention will now b e presented. Above i t w as assumed that each free CPU tosses a coin with a certain probability and attempts to access a single randomly chosen memory bank if the coin turns up heads. It will now be assumed that each free CPU instead initiates a vector access (fetch o r store) of a certain length if its coin turns up heads. The starting bank number for this vector access is assumed chosen at random, but thereafter the bank number advances with some constant stride through the duration of the vector access. The length of the vector access is assumed chosen at random according to a distribution that is uniform on the set f1 2 : : : Vg, except that a speci ed larger fraction v of the vector lengths have t h e maximum value V . Similarly, the memory stride is assumed to be chosen from a uniform distribution on the set f1 2 : : : n g, except that a certain speci ed larger fraction s of the strides are 1.
It should be noted that strides greater than n do not need to be considered, because such strides are equivalent for our purposes to their remainder when divided by n. It should also be noted that the mean restart time R between vector accesses is merely the reciprocal of the coin toss probability r, a fact that can be easily demonstrated from elementary probability theory.
As in the Markov c hain model, it will be assumed that a reservation of t ticks is placed on any memory bank once a CPU initiates a memory access. Unlike the Markov c hain model, this model will not ignore the case where two or more CPUs are waiting to access the same memory bank { it will be assumed that the CPUs merely take turns until all accesses have been completed. Observe that if no con icts are encountered, a single CPU can be simultaneously accessing up to t separate memory banks.
This model is not intended to exactly mimic the memory operation of any actual supercomputer. Instead it is intended to enable the general problem of memory bank contention to be simulated and analyzed. However, variations of this model have been shown to quite closely mimic a number of real computers. For example, the author has analyzed the Cray-2 memory by using this basic model with an enhancement that mimics the operation of the Cray-2 quadrants. The Cray X-MP/48 memory has also been studied using this model with enhancements that handle the multiple memory ports from each CPU. One result of interest from these studies is that actual performance reductions on real codes (as measured in run time) on these systems closely parallels the reductions in memory e ciency (as determined by s i m ulations). In particular, the actual performance slowdown is typically about 70% of the memory e ciency reduction in the cases studied.
Unfortunately, it is does not seem possible to analyze this model with the elementary Markov c hain techniques of the previous section. It is possible, however, to run Monte Carlo simulations based on such a model. Such a simulation program has been written, and numerous runs with it have been made on the Cray X-MP/12 belonging to the NAS (Numerical Aerodynamic Simulation) program at the NASA Ames Research C e n ter. Each separate assumption of the above parameters was run for one million ticks. It has been observed that empirical e ciency gures are reliable to within a percent o r t wo when the simulation is run to this length.
Results of the Monte Carlo Simulation Runs
Several plots displaying important s i m ulation results are shown in the pages following the end of the article. Except where indicated otherwise, these results are for the case n = 256 m = 4 V = 1 2 8 R = 1 =r = 1 0 0 t = 4 0 v = 0 :75 s = 0 :75. These parameters were chosen for a \generic" vector computer, roughly a composite of a number of current and projected supercomputers. Figure 1 shows how the memory e ciency E decreases as the reservation time t increases. The four separate curves represent results for various numbers of CPUs. Figure 2 shows how e ciency increases as the fraction s of unit stride varies from zero to one. Each curve in this gure represents results for di erent reservation times. Figure 3 shows how e ciency decreases with large numbers of processors. The four curves on this gure are for di erent n umbers of banks. Figures 4 and 5 present a di erent s l a n t on the problem: with other parameters held xed, the number of banks necessary to preserve a constant l e v el of memory e ciency (75%) is shown as a function of increasing reservation time ( gure 4) and as a function of increasing numbers of processors ( gure 5). In gure 4 the separate curves represent results for di erent n umbers of banks, and in gure 5 each c u r v e gives results for di erent reservation times.
It should be mentioned that in gure 2, the e ciency values on the left edge of the graph (i.e., the case where all strides are chosen at random) correspond closely to values computed using the formula derived from the Markov m o d e l . In particular, if q in the formula is set to the value of V=(V + R) (the approximate fraction of the time that a free CPU initiates a fetch), then the empirical e ciency results are within two percent o f t h e formula values.
Several de nite trends can be quickly identi ed from these plots. First of all, from gure 5 it is clear that the relationship between banks and processors is exceedingly close to linear { in fact the number of banks necessary to compensate for an increasing number of processors appears to be very closely proportional to the number of processors minus 1. This relation, except for the minus 1, matches the relation found in the Markov c hain analysis above. Secondly, although it is not immediately clear from gure 4, logarithmic regression of the simulation results shows that the number of banks necessary to compensate for an increase in the bank reservation time t is proportional to approximately t 1:85 . The corresponding relation from the Markov c hain analysis is t(t + 1), which is equivalent t o approximately t 1:96 over the range of the data in question. Relationships quite close to these were also found in other cases that were run with the simulator program.
The reduction of memory e ciency whenever the fraction of strides that are equal to one is not 100% (see gure 2) presents a dilemma of sorts to designers of supercomputers. It is clear that signi cantly less memory bank contention would result by designing hardware that does not allow strides other than one on most vector memory accesses. This approach has been taken, for example, by CDC in its Cyber 205 design. However, such a restriction reduces the ability of a computer to e ciently process Fortran data arrays by other than the rst dimension. As a result most supercomputer users, particularly those who run codes with large multidimensional arrays, feel that a variable memory stride is a de nite advantage in a vector computer design. Nevertheless, it clear from these simulation results that memory e ciency will be lower with a variable stride architecture.
Conclusions
The analysis of the phenomenon of memory bank contention does indeed indicate the potential for substantial reductions in performance in new generations of supercomputers. For example, suppose a supercomputer were to be designed with eight c e n tral processing units and a two nanosecond clock. A number of the current technology DRAM (dynamic random access memory) chips now in production dictate a bank reservation time of roughly 120 nanoseconds, or 60 ticks. According to simulation runs based on the generic vector computer model described above, more than 5000 memory banks would be necessary to achieve a n a verage e ciency of roughly 75%. This number is considerably greater than the 64 or 128 that characterize current designs. Thus it appears that future generations of vector computers must either be designed with memory chips substantially faster than those available today, o r e l s e t h e y m ust feature much larger numbers of independent banks of memory. F ailure to address this problem will result in catastrophic performance reductions.
In the future, it is likely that memory chips signi cantly faster than today's typical DRAM chips will be available for supercomputer memories. For example, a number of supercomputers feature static RAM chips with faster operation speeds than dynamic RAM chips. However, such fast chips cost considerably more and have only a fraction of the capacity of equivalent generation DRAM chips. This pattern can be expected to continue for the foreseeable future. Thus it seems probable that future designers and purchasers of supercomputers will have to make painful tradeo s between performance and memory size. Systems may b e a vailable with either a smaller memory of faster chips and minor memory contention, or with a much larger memory of slower chips and substantial memory contention.
Increased memory size does of course have a n umber of advantages. In addition to the ease of programming large-scale scienti c application codes, larger memories reduce the amount of time a supercomputer must spend transferring jobs in and out or waiting for I/O requests to be handled. Even with these advantages, though, it seems clear that computational performance will generally be degraded with the larger (slower) memories.
One possible solution to this dilemma is to design vector supercomputers with large main memories of the slower chips, but with a \cache" of much faster memory private to each CPU. With such a s c heme some programs could still access very large data arrays, while other programs not requiring a large amount of data storage could use the cache memory, t h us lessening the competition for main memory banks. However, problems arise with this design also. For one thing, swapping jobs in and out of a CPU could be much more time consuming if it is necessary to save t h e e n tire cache memory. In addition, nonstandard constructs may be required for the Fortran programmer to control which t ype of memory is assigned to his or her data arrays. Finally, unless a fairly large amount of cache memory is provided, most programs will be unable to perform a signi cant amount of their required computation using cache memory.
In any e v ent, it is clear that both manufacturers and potential users of supercomputers must pay close attention to the problem of memory bank contention. It is thus hoped that the techniques described in this paper will be of assistance to such persons and will help prevent unacceptable reductions in supercomputer performance.
