Abstract A memory design based on logical banks is analyzed for shared memory multiprocessor systems. In this design, each physical bank is replaced by a logical bank consisting of a fast register and subbanks of slower memory. The subbanks are bu ered by input and output queues which substantially reduce the e ective cycle time when the reference rate is below saturation. The principal contribution of this work is the development of a simple analytical model which leads to scaling relationships among the e ciency, the bank cycle time, the number of processors, the size of the bu ers, and the granularity of the banks. These scaling relationships imply that if the interconnection network has su cient bandwidth to support e cient access using high-speed memory, then lower-speed memory can be substituted with little additional interconnection cost. The scaling relationships are shown to hold for a full datapath vector simulation based on the Cray Y-MP architecture. The model is used to develop design criteria for a system which supports 192 independent reference streams, and the performance of this system is evaluated by simulation over a range of loading conditions. keywords Bu ered memories, logical memory banks, memory con icts, vector processors, Cray Y-MP.
keywords Bu ered memories, logical memory banks, memory con icts, vector processors, Cray Y-MP.
I. Introduction
The gap between memory speed and processor request rate is increasing rapidly in high performance systems. This gap is due to a decrease in processor cycle time, the use of superscalar and other multiple issue mechanisms, the increase in the number of processors in shared memory systems, and the demands of gigabit per second network communication. In addition, designers have sought to replace expensive SRAM memories with cheaper, slower DRAMS in order support dramatically increased main memory sizes at a reasonable cost.
In the face of these demands, several manufacturers have introduced more complex circuitry on their DRAM chips in order to reduce the e ective memory access time. Mitsubishi, for example, has introduced a proprietary cached DRAM. This chip has a small SRAM which reduces the memory access time if the reference is contained in the SRAM 9] . Another cached DRAM has been developed by Rambus 6] . Other approaches include synchronous DRAM technology 14] and enhanced DRAM technology 4]. Pipelined DRAM memories have also been proposed 11] . The e ect of such memory hierarchies in high-performance memory systems has not been extensively studied.
In an ordinary interleaved memory, the memory cycle time is the minimum time required between successive references to a memory module. The cycle time regulates how quickly a processor can ll the memory pipeline. Con icts due to bad reference patterns can cause the processor to block. The latency is the time it takes for a read request to navigate the memory pipeline and return a value to the processor.
In hierarchical memory systems, such as those which contain caches at the bank level, the memory cycle time and the latency are no longer constant. Caching can reduce both the e ective cycle time and the latency. This paper explores bu ering as an alternative or as a supplement to caching at the chip level. The proposed design is based on a bu ering scheme called logical bank bu ering in which physical banks are subdivided and bu ered as described in Section II. The principal contribution of this paper is the development of a simple model and the derivation of scaling relationships among the e ciency, the bank cycle time, the number of processors, the size of the bu ers, and the granularity of the banks. The goal of the logical bank design is to provide a mechanism for using large, slower memories with a moderate number of high performance processors while maintaining current operating e ciency.
A second contribution of this work is the full data-path simulation with register feedback for a realistic interconnection network. High performance machines, such as the Cray Y-MP, have a separate interconnection network for read return values. When simple memory banks are replaced by a memory hierarchy, references arrive at the return network at unpredictable times. Under moderate loading, the resulting contention does not appear to be a problem. Several approaches for equalizing performance of reads and writes under heavy loading are examined.
Bu ering has been proposed by a number of authors as a possible solution to the problem of memory con icts. A simulation study by Briggs 2] showed that bu ering at the processor level in pipelined multiprocessors can improve memory bandwidth provided that the average request rate does not exceed the memory service time. Smith and Taylor 20] explored the e ects of interconnection network bu ering in a realistic simulation model. The simulations in this paper were based on a similar, but simpler interconnection network. The bu ering in this paper is at the memory modules rather than within the interconnection network, and the emphasis of the simulations is on the veri cation of the scaling relationships.
Other proposals to reduce memory con icts have been made. Skewing and related techniques 7, 8, 12, 13] have been shown to be effective in reducing intraprocessor con icts. A recent simulation study by Sohi 21] explores skewing and input and output bu ering for single reference streams consisting of vectors of length 1024 with xed strides. Skewing techniques are not as e ective for the situation considered in this paper where con icts between processors are the main cause of performance degradation. Skewing can be used in conjunction with logical banks to reduce intraprocessor contention.
This study uses memory e ciency and throughput as its primary measures of memory performance. The e ciency of a memory system is de ned as the ratio: E = Successful memory requests
Total memory requests
The total number of memory requests includes those requests which are denied because of a con ict such as a bank con ict. It is assumed that when such a con ict occurs, the processor attempts the reference on the next cycle. This memory e ciency is measured from the viewpoint of the processor. It indicates the degree to which processors will be able to successfully issue memory references. The e ciency is essentially P A , the probability of acceptance, as calculated by Briggs and Davidson 3] in their models of L-M memories without bu ering.
Following Sohi 21] , the throughput is dened as the ratio: T = Time for vector references in a con ict-free system Actual time for vector references Sohi argues that this ratio is the appropriate throughput measure when comparing memory designs in a vector processing environment. The throughput is the fraction of the optimal rate at which entire vectors are delivered through the system. The vector element read latency is de ned as the time between the rst attempt to access a vector element and the availability of that element at the vector register. The e ciency depends on the number of processors, the number of banks, the bank cycle time, and the load. Unbu ered designs for multiprocessor machines give a quadratic relationship between memory speed and number of banks for xed performance 1]. If the memory cycle time is doubled relative to the processor speed, the interconnection costs must be quadrupled to maintain the same memory performance. The proposed design is a twotier system. The results show that if the interconnection network bandwidth is su cient to support the processors using high-speed memory, then lower-speed memory with bu ers can be substituted for little additional interconnection cost or performance degradation.
Section II describes the logical bank design. An analytical model for writes is developed in Section III and is shown to be in reasonable agreement with random reference simulations in Section IV. The relationship between the e ciency and system parameters such as the bank cycle time, number of banks and number of processors is then analyzed. Several design criteria are developed which are applied in later sections to vector systems.
Section V introduces a simulation model which uses synthetically generated references for a vector multiprocessor system similar to the Cray Y-MP. The model incorporates a full data-path simulation including return conicts and register feedback as recommended in the simulation study by Smith and Taylor 20] . The vector results are compared with model predictions under moderate processor loading in Section VI, and it is shown that even a small number of bu er slots can result in signi cant gains in performance. In Section VII the design criteria are applied to a 64-processor system (192 independent reference streams) under heavy processor loading for a range of stride distributions. Performance for writes is excellent, but there is degradation for reads. Several approaches for reducing this degradation are examined including subbank output bu ering, additional lines, optimal arbitration, port handshaking, port-line bu ers, and increased return bandwidth. It is found that only the last alternative completely eliminates the degradation. A nal discussion and conclusions are presented in Section VIII.
II. Logical Banks
A logical bank 16] consists of a fast register, the logical bank register (LBR), and a number of subbanks each with a queue of pending requests as shown in Figure 1 . The memory within the logical bank is divided into equal subbanks and addressed using the standard interleaving techniques so that consecutive addresses go to consecutive subbanks. A reference to the logical bank can be gated into the LBR in T la cycles if the register is free. T la is the logical bank access time. If there is room in the queue for the speci ed subbank, the reference is then routed to the queue. Otherwise, the LBR remains busy until a slot is available. Only one reference to a logical bank can occur during an interval T l . T l is called the logical bank cycle time. It is the minimum time interval between successful references to a logical bank. The interval may be longer if the LBR is waiting for a queue slot. If reference streams attempt to access the same logical bank while the LBR is busy, a logical bank con ict occurs and all but one reference is delayed. In the model and all of the simulations discussed later, it is assumed that T l = T la .
The parameters which de ne a multiprocessor system with shared memory organized into logical banks are shown in Table 1 .
The default values are used in later simulations except where otherwise indicated. For single-port processors the number of reference streams, n, is also equal to the number of processors. In the vector simulations, the processors are allowed to have multiple ports so there can be more reference streams than processors. Reference streams are assumed to be either read or write. Read streams are more di cult to handle because values must be returned to the processor. The return fan-in network for reads requires additional hardware for arbitration because read values do not arrive at the fan-in network at a predictable time.
The return values must include tag bits indicating the destination. This hardware is also required if cached DRAMs are used since the e ective access time is no longer constant in that case either. In fact, cached DRAMS can be used in conjunction with logical banks to reduce the e ective physical subbank cycle time, T c , with little additional hardware.
Logical banks were introduced by Seznec and Jegou to support the Data Synchronized Pipeline Architecture (DSPA) 19]. Their design includes a reordering unit so that data ows out of the logical bank in chronological order. In the scheme proposed in this paper, a reordering unit is not required at the logical bank, because reordering occurs at the processor.
Bu ering is distinct from caching in that there is no miss penalty and no overhead for cache management. The proposed circuitry would take up a very small chip area if it were incorporated on a chip. Alternatively, it could be built as an interface between o -theshelf memory chips and the system interconnection network. It is particularly appropriate in situations in which average utilization is below maximum capacity, but where there are periods of maximal loading. In addition to reducing average access time, logical banks can smooth the type of bursty memory tra c which is typical of highly vectorized programs 17].
III. A Model for Random Writes
A model for the e ciency of logical bank memories is now derived. In later sections the throughput and latency are related to the efciency. It is assumed that a reference stream can initiate at most one reference per clock cycle and that when a reference attempt fails, it is retried by that reference stream on the following cycle. As long as the LBR is available, the processor sees a memory consisting of log-ical banks with a cycle time of T l , the logical bank cycle time. The e ciency in this case is given by E l . This e ciency is determined by the interreference stream con icts at the logical bank level. When the queues are full, the memory behaves almost as though there were no logical banks. The e ective memory cycle time in this case is T p = T l + T d + T c where T d is the minimum delay incurred in transferring a reference from the queue to the subbank and T c is the physical memory cycle time. The e ciency in this case is denoted by E p .
A simple probabilistic argument shows that if the probability of a successful reference is E, the expected number of attempts per successful reference is:
E is the average number of cycles that it takes a reference stream to initiate a reference from the viewpoint of the processor. In contrast, the average reference time from the viewpoint of the physical memory is directly related to the bank cycle time and other delays.
Let P be the probability that the logical bank register (LBR) is available when a reference is rst initiated. A successful reference will take 1 El attempts with conditional probability P and 1 Ep attempts with probability 1 ? P. The e ective e ciency is then a weighted average of the two cases depending on the probability that there are slots available in the appropriate subbank queue. The average number of cycles for a successful reference can then be estimated by:
E p where E is the combined or e ective e ciency. This relationship can be written as:
This expression for the e ective e ciency will be called the logical bank model in the remainder of the paper. The probability, P, that the LBR is not full can be estimated by considering each logical bank as a system of k independent queues under the M/D/1/B queuing discipline. This queuing model has an exponential arrival rate, deterministic service time, one server, and anite queue. For xed queue size, the distribution of the number of references in the queue depends on the parameter = T p where is the average arrival rate and T p is the e ective queue service time. can be estimated as nq b where q is the probability that a free stream initiates a reference, n is the number of independent reference streams, and b is the number of physical subbanks (kl). A simple simulation is used to compute a table of probabilities for a given value of and queue size m.
Once the probabilities that the individual queues are free have been determined, the value of P for the entire logical bank can be estimated as follows. If there are k subbanks per logical bank, the LBR will be busy if any of the k queues has m+1 slots lled, the extra slot being from the LBR itself. Thus, if f is the probability that a queue of size m + 1 is not full, then the probability that the LBR is free is f k . This method is used to calculate P for the graphs given later.
Estimates for E p and E l will now be derived. The situation in which there are no logical banks has been analyzed for random accesses by Bailey 1] for systems with n singleport processors and b memory banks. Let T be the memory bank cycle time. Each processor can be modeled as a Markov chain on T + 1 states s 0 , s 1 , :::, s T , where s i represents the state in which the processor is waiting for a bank which will be busy for i more cycles and s 0 denotes the state in which a processor accesses a free bank. Bailey derived a steady state expression for the e ciency in such as system as: EB(q; T; n; b) = 2q 2q ? 1 + p 1 + 2q 2 nT(T + 1)=b Here q represents the probability that a free processor will attempt a reference on the current clock cycle. For relevant values of T, n, and b, the e ciency is dominated by the expression in the square root:
n The e ciency is inversely proportional to T and drops o fairly rapidly. If the bank cycle time is doubled, the number of banks must be increased by a factor of four to maintain the same e ciency. The Bailey model can be used to estimate E p by using T p = T c + T d + T l for the bank cycle time:
The logical bank e ciency, E l , can also be estimated using the Bailey model as E l = E B (q; T l ; n; l) where l is the number of logical banks in the system. T l , the logical bank cycle time, is assumed to be one for much of the discussion in this paper. Due to the assumptions made in the derivation of the Bailey model, it performs poorly when there are a small number of processors or when the bank cycle time is near one. Unfortunately the e ective eciency is very sensitive to the value of E l when P is close to one, so another model will be developed for E l when T l = 1. This model will be called the direct model and is derived below using Markov chains in a manner similar to that used by Bailey.
Assume that each reference stream is in one of three states: the free state (1), a state in which it is making a successful reference (2), or the state in which it is attempting a reference which is unsuccessful (3). Let: q = probability that a given free stream will attempt a reference = probability that the stream is in the free state = probability that the stream is making a successful reference = probability that the stream is making an unsuccessful reference = probability that a reference attempt will be successful
The following probability conservation equation holds:
+ + = 1 The matrix, ?, of state transition probabilities is given in Table 2 . ? i;j , the entry in the ith row and the j-th column, represents the conditional probability that the next state is i given that the current state is j.
A reference attempt will be successful if no higher priority stream is making a reference to the same bank. On the average half of the remaining streams will have a higher priority than a given stream. Since only one higher priority stream can make a successful reference to a bank, the probability that one of the (n ? 1)=2 higher priority streams is making a successful reference to one of l banks may be estimated as:
(n ? 1) 2l and so is given by:
where:
= n ? 1 2l Let p = ( ; ; ) be the vector of a priori probabilities of the three states. The steady state probabilities, which can be obtained by the relationship p = ? p, are then given by:
These equations plus the conservation equation can be used to obtain an expression for :
The direct model is in better agreement with the simulations when T l = 1, and it will be used to estimate E l for the remainder of the paper.
IV. Predictions of the Model
Bu ering can produce fairly dramatic improvements in e ciency provided that the memory system is not close to saturation. In this section, the logical bank model is compared with model simulations for random references. The excellent agreement of this comparison validates the model and suggests relationships between the design parameters which are necessary for achieving a particular level of e ciency. In the following sections a vector simulation model is compared with the random-reference model, and the relationships suggested by the analytical model are tested.
Consider a multiprocessor system which has 24 independent reference streams and a shared memory consisting of 256 banks. (These parameters represent an eight-processor Cray Y-MP with three ports per processor and a maximal memory con guration.) The performance of this system is now compared with that of an augmented system in which each physical bank is replaced by a logical bank consisting of a single subbank with a queue size of two. This case corresponds to adding bu ering at the physical bank level without adding any additional logical bank structure other than the bu ers and the LBR. The reference streams are assumed to generate writes only and T l = T d = 1.
In Figure 2 the e ciency for random reference streams is plotted versus subbank cycle time. Following Bailey 1] a reference rate of q = :4 is selected to give a base operating e ciency in the unbu ered case of :67. The logical bank model agrees well with results from scalar simulations for operating e ciencies above :6. A signi cant improvement in performance is observed with bu ering. When the bank cycle time is 18, the simulation shows an e ciency of :22 without bu ering and an e ciency of :66 with bu ering. (The memory e ciency of a real Cray Y-MP is higher than the predicted :67, because selected bu ering mechanisms are incorporated at various stages in the Cray Y-MP interconnection network as described by Smith and Taylor 20] .)
In Figure 3 the e ciency is plotted versus the number of reference streams. Again there is excellent agreement between the model predictions and those of random reference simulations. The base e ciency of :67 can be maintained with as many as 96 reference streams when bu ering is introduced at the bank level. The logical bank model overestimates the eciency near saturation because the M/D/1/B queuing model assumes that references are thrown away when the queues are full. The processor attempts to initiate a reference on the next cycle with a certain probability q < 1. In the real system and in the simulation the reference is retained and tried again on the next cycle.
A simple analysis is now presented which shows that queue sizes which are quite small can give substantial improvements in performance. Table 3 shows the probability that the number of items in the queue is less than the queue size, m, for di erent values of m and di erent system loadings. If = :5, the probability that a queue will have fewer than three slots (two queue slots plus the LBR) lled is :9731. This value is indicative that small queues will su ce. The estimate may not be completely accurate near saturation, because references which are not ful lled are thrown away in the model. Hence, the in nite queue case is now considered.
In the in nite queue model, references are never blocked, but are always queued. The expected queue size for each subbank in the in nite queue case is 10]:
Ex(Queue Size) = 2 2(1 ? )
For < :5, the expected queue size is less than :25 for each subbank. Table 4 shows that the probability that a queue contains no more than x items in the queue for the in nite queue case. The probability that two or fewer slots are lled is :947 for = :5. This probability is on the borderline of reasonable performance. The probability of having four or fewer elements in the queue is :9957. A subbank input queue size of four should be adequate to handle most references (P 1), and the e ective e ciency will be E l . For xed load q, E l is constant for constant:
Furthermore, E l ! 1 as ! 0, and E l > :9
when < :1 since 0 q 1. This result is obtained by noting that E l is a decreasing function of both q and . The condition < :1 means there should be at least ve times as many logical banks as there are reference streams for e cient performance. The bu ered and unbu ered cases can be compared in the case where there is one subbank per logical bank so that b = l. Consider the e ect of bu ering on e ciency when = qT p n=b is held constant at :5 as the subbank cycle time and the number of banks are both increased. In this paper it is assumed that T l = T d = 1, so T p T c . will be decreasing since only depends on the number of processors and the number of banks. When the queue size is four, the probability of a logical bank hit is :9957 so the e ciency is approximately E l . Since E l is independent of T c and is an increasing function of l = b, the logical bank model predicts that the e ciency will actually increase slowly as the bank cycle time and number of banks are increased with held xed at :5. Thus, for a xed reference rate, q, a doubling of T c , can be compensated for by doubling the number of banks or by halving the number of processors (reference streams). This is in contrast to a system without logical banks where T c p n b must remain xed to maintain the same e ciency. In systems without logical banks one would have to quadruple the number of banks in order to compensate for doubling the bank cycle time. When the above argument is applied to the same system with a queue size of two, the probability of a logical bank hit is at least :947. The e ciency is now a weighted average of the relatively constant logical bank e ciency, E l , and the unbu ered e ciency, E p . (The latter e ciency drops o rapidly with bank cycle time.)
To con rm these relationships in the models with and without logical banks, the load q is xed at :4, and the number of reference streams is xed at 24. In Figure 4 the eciency is plotted versus the subbank cycle time when the number of banks is varied so that is held constant at :5. The logical bank model maintains an almost constant e ciency as the bank cycle time is increased as predicted for queue size of four. The system with queue size two shows a slight fall-o . The e ciency is initially lower than the asymptotic value because when the bank cycle time is small and is held at :5, there are so few banks that logical bank con icts become signi cant. When the Bailey model is run for the same parameter values, the e ciency drops dramatically as predicted by the model. Similar scaling relationships can be derived when the bank cycle time is xed and the number of banks and the number of reference streams are varied.
One can use the relationship between and E to determine design parameters required to achieve a speci ed level of performance. The Cray Y-MP has eight processors and three ports per processor (24 independent reference streams), a bank cycle time of ve, and 256 physical banks. If a maximum reference rate of q = 1:0 is assumed, then = :468 and E l = 1 1+ . With a queue size of four, the probability of a logical bank hit is nearly one. The e ciency can be simply estimated from the previous expression for . When there are 256 logical banks (one subbank per logical bank), = :09 and E l = :92. The logical bank model predicts that bu ering with four queue slots will result in a high e ciency. The unbu ered e ciency is predicted by the Bailey model for these parameters to be :56. These model predictions are tested in Section VII for a vector simulation.
The results of the logical bank model can be summarized as follows. For a fully loaded system (q = 1:0) consisting of n reference streams, l logical banks, a logical bank cycle
T p n b < :5 and n 2l < :1 For a queue size of two, should be chosen to be less than :2.
Notice that the rst relationship depends on the total number of subbanks, b, while the second relationship depends on the number of logical banks, l. The per processor interconnection costs depend on l. As long as there are enough logical banks to adequately eld requests from the processors, an increase in bank cycle time can be compensated for by an increase in the number of subbanks without a signi cant increase in the interconnection costs. There is a point, however, at which the data bus arbitration scheme will not be able to handle read return tra c. This point is discussed more fully in Section VII.
An increase of T l above one has the e ect of lowering the overall e ciency, but the curves have the same shape. Design parameters can be determined in this case by using the Bailey model to estimate E l when q = 1.
The e ciency now depends on the parameter = nT l (T l + 1)=l. An e ciency of :90 can be obtained provided that < :25 and < :5.
V. Vector Simulation Model
In order to test the performance of the logical bank organization and the predictions of the logical bank model, a simulation study based on the Cray Y-MP architecture interconnection network was developed. The Cray Y-MP architecture was selected because its highly pipelined interconnection network can provide an e ective T l = 1. This is accomplished by having references issue immediately to the interconnection network and block later if con icts should arise. A complete data path simulation of this system with processor register feedback was performed with reference streams which were generated randomly under realistic assumptions. The simulation includes ports, lines, sections, and subsections as described below. The vector simulation model assumes there are n p processors each with p ports. Each processor can initiate up to p memory operations on a cycle. These ports are assumed to generate independent reference streams (n p p = n). Each port is designated either as a read stream or a write stream.
The interconnection model is a simpli ed version of the network described by Smith and Taylor 20] . Each processor has four lines which are direct connections to particular sections of memory. The ports from a particular processor access memory through a crossbar connection to the processor's four lines. The section number is determined by the lowest two bits of the address, so consecutive references are directed to di erent sections of memory. Each section is divided into eight subsections and the individual subsections are further subdivided in banks. In the case of the Cray Y-MP which has 256 banks, each subsection contains eight banks. In simulations of systems with n p processors and l logical banks, the number of subsections is xed at eight and the number of banks per subsection is increased. The interconnection can then be described by: np processors !4 4 !np 8 !1 l=8!l memory banks.
In the Cray Y-MP, a processor can access a particular subsection once every T c cycles where T c is the physical bank cycle time. This means that when a processor accesses a memory bank, the processor is blocked from issuing additional references to the entire subsection containing this bank for the full bank cycle time. Such a con ict is called a subsection conict and, like the section con ict, is strictly an intraprocessor con ict. References from di erent processors to the same subsection can proceed without con ict provided that they are addressed to banks which are not already in use.
The simulation for the logical banks is based on the con ict scheme described above. When a particular memory location is referenced, the line, subsection, logical bank, and subbank numbers are calculated. If the line is free, it is reserved for T r cycles and the subsection is checked. If the subsection is free, it is reserved for T s cycles, and the logical bank is checked. If the logical bank register (LBR) is free, it is reserved for T l cycles and the reference is initiated. The reference generates a hold and fails to issue if a con ict occurs at any level.
Once a reference has occupied the LBR for T l cycles, it can be moved to the appropriate subbank queue if that queue is not full. The reference must spend T d cycles in the queue before it can be processed by the physical memory. It is assumed that the reference must occupy the subbank for at least T c cycles before the subbank can accept another reference. If the operation is a write, the subbank is free to accept another reference after T c cycles. Reads are complicated by the return trip to the processor as now described.
A vector read reference is not considered to be completed until all of the element values have arrived at the processor. Read data values must be routed from the physical memory bank to the appropriate processor vector register. Additional con icts may occur because more than one value may become available on a particular cycle. Each logical bank has a single output latch. If the latch is free, the value is moved from the subbank to the latch and the subbank is freed. If the latch is busy, the subbank must wait until the latch is free before accepting another value. If an output queue is included for each subbank as shown in Figure 1 , the value is moved from the subbank to the output queue and blocking of the subbank due to return con icts is less likely to occur.
All of the data values latched for a particular processor line compete for processing on the return interconnection network. The real system has separate forward and return interconnection networks. To simplify the simulation, the return interconnection network is modeled as a pipeline which can accept one value per section per cycle. The pipeline length is assumed to be ten which accounts for the length of both the forward and return pipelines. When the last value for a vector read has emerged from the pipeline, the read is considered to be complete.
The simulation also incorporates the feedback loop between the vector registers and memory. Each processor has a certain number of vector registers (eight was assumed for the runs in this paper). When a vector operation is initiated in the simulation, a free processor vector register is randomly selected and reserved for the duration of the operation.
If no register is available, the operation holds until a register becomes available. The register reserved for the operation is not freed until all of the elements of the vector have arrived at the processor. In contrast a vector write is considered to be completed when the last element operation has been issued. The vector register is freed at that time although the actual memory value may not be inserted until sometime later because of bu ering.
Priority in the simulation is rotated among the processors in a circular fashion so that no processor is favored. This scheme is similar to the priority scheme used on the Cray X-MP. The Cray Y-MP uses a xed subsection priority scheme which does not lend itself to modi cation when the number of processors is varied. The priority scheme should have little e ect on the results of the simulation.
The simulation generates a representative reference stream for vectorized code as described below. All memory operations are assumed to be vector operations with an associated stride and length. Gather/scatter operations are not considered in this simulation. The stride is a xed interval between successive references within a single vector operation. A stride of one is assumed to be the most probable with other strides up to a maximum stride being equally probable. A default probability of stride one vectors of :75 is used unless otherwise indicated. The e ect of type of load on performance is examined in Section VII.
The maximum length of the vector operations is determined by the length of the vector registers in the processor. When an operation on a very long vector is required, the compiler splits it into several vector operations, all but one of which uses the maximum vector register length. All possible vector lengths are assumed to be equally probable, except the maximumlength is assumed to occur more frequently.
The system load is determined by the operation initiation rate. When a port is free there is a certain probability, p f , that on the current cycle a memory operation will be initiated. The value of p f may be di erent for read and write ports and is a measure of the system load. A relationship between p f and the scalar reference rate q is now derived in order to compare the vector case to the scalar case already discussed. Let V L be the maximum allowed vector length and p l be the probability of a maximum length reference. The average length of a vector reference is then: Table 5 . The scalar simulations performed in Section IV were performed by setting V L = 1, r = 0, T d = 0,i T c = 1, f l = 9, and T r = T s = 0. In the unbu ered case, T l is the physical bank cycle time as seen by the interconnection network.
The vector data path simulator was written in C and run on a network of SUN workstations. Most of the vector simulations done for this paper were run for one hundred thousand cycles, although some runs were as long as ten million cycles. Each run was divided in blocks of cycles and the statistics were computed over each block in addition to over the entire run. The statistics for the longer runs did not vary signi cantly from those of the shorter runs. This lack of variation is not unexpected since each port on each processor can initiate a reference on each cycle so there are a large number of independent reference streams over which the statistics are averaged. 
The agreement between model and simulation is better for loads which have a random component. The case where three fourths of the strides are one and the remainder are randomly distributed (p s = :75) is also shown in Figure 5 . In this case, the e ciency is :68 for a subbank cycle time of 20. The overall e ciency is slightly below the model, but the fall-o occurs in roughly the same place as predicted by the model. The logical bank model, which corresponds to random reference streams, falls in between the simulations for the two di erent stride distributions. The agreement between the logical bank model and the vector simulation is quite good considering that the vector simulation includes lines, subsections, and register feedback. The dip in efciency at bank cycle times which are integral multiples of four is a real phenomenon which is preserved over very long simulation runs.
In Figure 6 the e ciency is plotted versus the number of reference streams when the subbank cycle time is ve. The remaining parameters are the same as in Figure 5 . The number of reference streams could be quadrupled from 24 (eight processors) to 96 (32 processors) while still maintaining an e ciency of :67.
The analysis of Section IV for random references predicts that the e ciency will be high and will increase slightly as the number of banks and the bank cycle time are increased keeping = qT c n=b constant at a value :5. This scaling relationship holds in the case of vector references as well. In Figure 7 the number of reference streams is xed at 24 (eight processors) and the number of subbanks per logical bank is one. The number of banks is varied linearly with the bank cycle time in order to keep constant at :5. The condition < :1 is satis ed when the number of logical banks is greater than 120 which is the case provided that T c > 3 for held constant at :5. For comparison, a simulation with the same set of parameters was run without bu ering and the e ciency dropped dramatically when the subbank cycle time was increased. The runs are shown for two stride distributions (p s = .75 and p s = .25).
The previous vector runs were performed with no read ports. In an unbu ered memory there is no di erence in memory e ciency between reads and writes. Because bu ering introduces unpredictable delays, more than one result can become available on a particular cycle for the interconnection network for a particular section. When con icts of this type arise, some of the references are delayed which in turn causes subbanks to block. Return con icts can thus cause an e ective increase in bank cycle time and a corresponding drop in e ciency. The addition of a single output queue slot at each subbank eliminates the di erence in performance between reads and writes for moderate loading. However, as the load is increased (either through increasing the initiation rate or the percentage of stride one vectors), the port return bandwidth may be insu cient to handle the load. This problem is addressed in the next section.
The analytical model derived in Section III predicts the e ciency of memory writes. Other possible indicators of performance include throughput and latency. The throughput de ned in Section I is the fraction of the optimal rate at which entire vectors are delivered through the system 21]. The vector element read latency is de ned as the time between the rst attempt to access a vector element and the availability of that element at the vector register. Figure 8 compares the e ciency, throughput, and read latency when the load is varied. A bank cycle time of 20 was chosen because it is near the knee of the curves in Figure 5 . The parameters are the same as in that gure except that the bank cycle time is xed and the initiation rate is varied. Probability of OP (p f ) refers to the probability that a vector operation will be initiated by a free port. The value p f = .0118 corresponds to the value in Figure 5 . The throughput is slightly higher than the e ciency in the unbu ered case and slightly lower than the e ciency in the bu ered case. The bu ered throughput is still considerably better than the unbu ered throughput.
In the unbu ered case, the read latency is just a constant plus the number of attempts it takes to issue the request. As mentioned in Section III, the number of attempts is just the reciprocal of the e ciency. In fact, in the unbu ered case the read latency curve in Figure 8 can be predicted to better than .3 percent from the e ciency curve. When p f = :01, the unbu ered read element latency is 36 and the bu ered read element latency is 51. If the e ciency is at least moderately good, this indicates that the last element of a vector read is delayed about 15 cycles over what it would be without bu ering. With bu ering, the read latency is a ected by return con icts, and so is dependent on the return scheme used. Different return schemes are discussed in the next section.
Since writes do not require a return path, the write element latency is directly related to the number of attempts to issue the element write operation. The write latency curves (not shown) can be predicted from the e ciency curve to within a few percent for both the bu ered and the unbu ered case.
VII. Results for Maximal Loading
The extreme case where each port attempts a memory reference on each cycle is now considered. First an analysis is done for writes. The logical bank model is used to pick design parameters for a 64-processor system. The performance for reads is then analyzed and improvements in the return interconnection network are considered to equalize the performance of reads and writes.
The logical bank model can be used as a guide in picking design parameters for a 64-processor Cray Y-MP which minimizes the per processor interconnection cost, while achieving an e ciency of at least :90. Assuming a load of q = p f = 1:0, the condition < :1 gives l > 960. Assuming a bank cycle time of ve, the condition < :5 gives b > 1920. The number of logical and subbanks should be powers of two. Thus for 64 processors, the con guration which operates with minimum per processor interconnection cost and high e ciency has 1024 logical banks and 2048 physical subbanks. If is approximately :5, a queue size of four is required to have a :99 probability of a free slot available in the queue according to Table 4 . If the queue size is two, four subbanks per logical bank are required to bring down to :3. Figure 9 shows the performance of these designs as a function of the percentage of stride one vectors. The case of two subbanks per logical bank with a queue size of four is indistinguishable from the case of four subbanks per logical bank with a queue size of two as predicted by the logical bank model. The case of 2048 banks with a queue size of two is shown for comparison. The logical bank model is based on the assumption of random references and does not account for the presence of bad strides. Bu ering has been shown to reduce the e ect of bad strides under moderate loading 7] , and the designs here do not preclude the use of address mapping to alleviate intraprocessor con icts 5, 8] Reads are more di cult to handle because hot spots can develop on the return lines as shown in Figure 10 . The performance di erence between reads and writes as a function of stride is quite dramatic. Even more surprising is the fact that the performance for reads actually drops as the percentage of stride one vectors in the load is increased. The drop occurs because the arbitration method used for the return in the simulations is a simple round robin priority scheme on the logical banks. When a reference for a particular port is delayed, it causes all of the banks waiting for that port to be delayed. When the system is operating at a sustained maximal initiation rate, the ports can never catch up.
Various solutions for solving the hotspot problem have been examined including additional output bu ering at the physical subbanks, additional lines, optimal arbitration, port handshaking, port-line queues, and additional return ports. It was found that additional lines alone do not solve the problem, while queue depths of 30 or more at each subbank are required to make a signi cant di erence in e ciency.
In optimal arbitration, the logical banks are examined in round robin succession, but if the port to which a reference is made is already in use, that reference does not block the line and another reference can be issued. Port handshaking is a control mechanism by which a particular port is blocked from initiating a reference if there was a return con ict for that port on the previous cycle. Both methods improve performance, but they do not bring read performance up to write performance when most of the vector strides are one.
Since the drop in read performance appeared to be due to insu cient return port bandwidth, two alternative approaches were developed to increase the bandwidth. The rst approach involved adding port-line queues. In this design modi cation, each port contains a queue for each line so that each line can deposit a result at a port on each cycle. The port services one queue per cycle in round robin succession. The second approach involved doubling the number of return ports. One possible design is to have a return port for the odd-numbered vector elements and another return port for the even-numbered vector elements. An alternative design is to designate one return port for each pair of return lines. The second alternative would simplify the port-line interconnection switches but would complicate the vector register bus structure internal to the processors. Figure 11 compares the port-line bu ering to double-return ports for a variety of strides at maximal loading. Output bu ering at the subbanks has been eliminated. Portline bu ering improves e ciency but does not increase the port bandwidth. The doubling of the number of return ports brings the read e ciency in line with the write e ciency.
VIII. Discussion
The main result of this paper can be summarized as follows. If a shared memory system has su cient bandwidth to achieve high eciency with fast memory, the replacement of the physical banks by logical banks will allow the same e ciency to be achieved using considerably slower memory without signi cantly a ecting the interconnection costs. This type of bu ering is particularly useful in vector multiprocessors because vector memory operations are naturally pipelined, and increases in memory latency can be partially amortized over an entire vector operation.
The logical bank model and the detailed vector simulations given in this paper show that the number of logical banks scales with the number of processors and that the bank cycle time scales with the number of physical banks. Consequently slower memory can be used if the logical banks are divided into more subbanks. This is in contrast to the unbu ered case where b=T 2 c n must be constant for constant e ciency. The change from an inverse quadratic to an inverse linear dependence between bank cycle time and the number of banks is particularly important.
A drawback of memory systems with variable bank cycle time is that the values become ready for return at unpredictable times. The approach of Seznec and Jegou to reorder values at the bank level does not solve the problem of values arriving at the processor in the order issued. The problem can be addressed in the Cray Y-MP architecture by the addition of a tag to each return value. The Cray Y-MP architecture allows three independent vector memory operations to proceed simultaneously. These vector memory operations can be chained to the vector registers, and vector registers can in turn be chained to functional units. Chaining allows the results produced by one vector operation to be used as input to a succeeding operation before the rst instruction has completed. The component results from the rst instruction can be used by the second instruction as they become available. Pipeline setup can occur before any component of the previous operation is available 15 ]. In the current architecture values arrive in order so each vector register keeps track of the last value to have arrived. When the values arrive out of order each vector element must have a bit indicating whether that value has arrived. Some additional hardware can be incorporated to chain forward when the next element has arrived. The relative order among di erent registers is assured by the existing reservation and issue mechanisms.
It is well known 1] that the memory performance of shared memory vector processors is strongly dependent on the type of load which is generated. Since the choice of load distribution a ects the results, it is desirable to test the design against realistic loading conditions. Address-trace collection methods 22] are useful for generating statistical information about the load, but the information collected from these types of investigations is difcult to use directly in testing new designs because the e ciency depends not only on the actual addresses, but on the exact time the references were issued. Because of these diculties, the approach taken in this paper has been to develop guidelines which are applicable over a range of load distributions. The larger the number of reference streams, the less serious the impact of the details of one reference stream on the overall e ciency.
Bu ering greatly reduces the dependence of memory e ciency on the type of load for writes as illustrated by the runs in the two previous sections. Most of the dependence on load occurred because of return con icts for reads. A number of alternative designs were evaluated in an e ort to reduce the performance degradation due to return con icts. It was found that doubling the number of return ports eliminated the di erence between reads and writes over a range of loads. This part is not part of the caption. 
