Delays caused by memory access conjiicts were measured for vector operations on the CRAY Y-MP and CRAY X-MP, and on two versions of the CRAY-2 with static and dynamic memory. The delays were measured as jimctions of vector length and the number of active processors. The observed delays were lowest for memory access operations with stride one. Considerably higher delays were observed for mixed strides or random access. For access operations with mixed strides, measurements indicate that memory access was slowed down by a factor of up to 1.7 for the 8-processor CRAY Y-MP. For the 4-processor machines, the factors were 3.1 for the CRAY X-MP, 4.4 for the CRAY-2 with dynamic memory, and 2.5 for the CRAY-2 with static memory. The results are compared with a queueing model of memory bank conjiicts.
Introduction
The purpose of this paper is to study delays encountered when several vector processors access a common memory, Such delays may be caused by access conflicts at the memory or by contentions within the network connecting processors to common memory.
Because such delays are potentially the limiting performance bottleneck of shared memory architectures, it is important to understand how they depend on design parametem, in particular on the number of processors serviced by the common memory, the degree of memory interleaving, and the speed of the memory chips. Such understanding will make it possible to assess the performance of systems with increasing numbers of processors and new memory technologies. It is essential for any assessment of the future potential of shared memory systems.
There are three possible approaches to the study of this problem, analytical models, simulations, and measurements on existing machines.
All three approaches are necessary because each has advantages and limitations. Analytical models owe their effectiveness to simplifying assumptions that need to be checked by the other two methods. In particular, stochastic methods and queueing theory offer no techniques to deal with regular memory access patterns that are common in scient.itic and engineering applications. Simulations can generally approximate reality more closely; however, even simulations do not replaee real measurements.
They are often very slow and contain assumptions that need to be tested. This paper presents a set of measurements of memory access delays performed on several computers of the Cray family under carefully controlled conditions. We will compare the results with previously published modeling work [1, 2] .
We will outline in the next section essential features of the hardware and software architectures that are needed to understand the rationale of our approach and the interpretation of results. Section 3 is a description of measurement techniques and necessary precautions tQ make results reproducible.
Section 4 shows baseline results for a single active processor, and Section 5 presents measurements with several processors active, a discussion, and comparison to previously published models. We outline our conclusions in Section 6.
Characteristics of Hardware and Software
The measurements described in this paper were performed on three types of Cray computers with differing memory architectures.
The systems are a CRAY X-MP with 4 processors, a CRAY Y-MP with 8 processors, and two CRAY-2 systems with 4 processors, all with common memory.
Some essential system parameters are listed in Table 1, Cray processors, like all current vector architectures, are register-to-register computers.
Vector arithmetic units get their operands from and deposit their results into vector registers, which on Crays contain sixty-four 64-bit words. On the CRAY X-MP and Y-MP there are three access paths per processor connecting vector registers with main memory: two for load operations and one for store operations. Thus, for the vector Long Reservation Time Short Reservation Time in this paper, V = V + V, three memory streams are active simultaneously two operand loads and the storing of the result veetor proceed concurrently with the add operation only slightly offset in time. The CRAY-2, in contrast, has only one memory access path per processor.
Therefore, load and store operations cannot proceed concurrently.
Vector operations of most high-performance vector computers access memory directly without an intermediate cache. To increase bandwidth, memory in such systems is subdivided into many parallel independent units, commonly called memory banks. Each bank, after being accessed successfully, cannot be accessed again for a period called the bank reservation time. It is distinct from the memory access time, which is the time between the issue of a memory access request and the arrival of the requested data item at the processor register. Bank reservation times are in general much longer than processor cycle times (see Table 1 ). To enable vector processors to access arrays in main memory at rates of one per processor clock cycle, adjacent array elements are stored in different memory banks. Usually, the low-order bits of the memory address determine the bank address. Thus, a vector access stream to contiguous Ioeations does not encounter memory bank contlicts. However, several such access streams may have requests issued to a particular bank less than the bank reservation time apart, leading to a bank conflict and delays.
The memories of the computers under study differ in the number of memory banks and the bank reservation times, as shown in Table 1 . The structure of the CRAY-2 memory with dynamic memory chips is slightly more complicated than deseribed above. Each of its 128 memory banks is subdivided into two pseudo banks. After a successful access to a location within a pseudo bank, that particular pseudo bank in inaccessible for the period of the long bank reservation time 7L (45 cycles in the system tested). However, its companion pseudo bank is also inaccessible for the period of the short bank reservation time~~(17 cycles in the tested system). For memory traffic distributed randomly, both companion pseudo banks are accessed with equal probability.
Therefore, the memory should behave very much like a system with 128 banks and a bank reservation time of (ZL +~s)fl (31 cycles in our case).
In addition to memory bank contlicts, access conflicts within the network connecting processors to main memory may slow down memory access. The architectures of the interconnection networks are quite different for the three Cray computers under study. While the CRAY X-MP is equipped with a bus type network [3] , which is prone to access conflicts, the CRAY Y-MP provides an independent access path for each processor to eaeh bank [4] . Here conflicts between the three paths for each processor are resolved within the network. The CRAY-2 network differs considerably from the two others and is highly buffered. Major access delays occur when these buffers fill up [5] . Of course the number of access requests waiting in network buffers will be strongly dependent on the number of bank conflicts at the memory.
In addition to differences in the hardware architecture the three machines exhibit different strategies of conflict handling. Operands have to arrive in sequenee on the CRAY X-MP and Y-MP.
Therefore each memory access conflict causes a whole access stream to halt. If the arrival of a single vector element is delayed all subsequent elements are delayed by at least that amount. Thus, the delay in the arrival of the 64 word content of a vector register is the sum of all individual delays. On the CRAY-2 vector elements do not have to arrive in sequence. The access stream will halt only if the interconnection network buffers are full. Thus, the total delay for the arrival of 64 words will be less than the sum of individual delays for light memory traffic.
The CRI CFT77 compiler running on all four systems praetiees vector loop unrolling
[6] to hide vector strip startup times at least partially.
The level of unrolling for simple vector loops is two for the CRAY .X-MP and Y-MP, and four for the CRAY-2.
Vector loop unrolling is responsible for the unusual dependence of vector execution times on vector length seen in the measurements of this paper and is discussed in more detil in Section 4.
Measurement Technique
We have measured execution times of simple vector operations as a function of vector length and the number of' 'interfering processes." While one processor was running the instrumented "test process," the other processors of the system were either kept idle or running an "interfering process." We will report results of only one "test process" in this paper, V = V + V, the addition of a contiguous vector to another contiguous vector. This operation involves two vector load operations, a vector add, and the subsequent store of the contiguous result vector. Results obtained for other test operations are very similar to those reported for V = V + V and will therefore not be discussed. The interfering processes were vector add operations involving very long vectors. Operands and results of interfering processes were either in contiguous memory locations, randomly scattered across memory, or stored in nonadjacent locations with stride 23. We did not access memory with stride equal to powers of two which can cause severe access conflicts within a single vector access stream. We feel that such access conflicts can generally be avoided.
All measurements
were performed on dedicated systems by necessity. However, a dedicated system was frequently not sufficient to assure correctness or reproducibility of results. In one case, the idle loop of the operating system accessed memory at rates comparable to those of our interfering processes. We therefore found it necessary either to disable idle processors by turning the hardware off or to run on them a loop written in assembly language that spent its time accessing registers only. With these precautions our timing results without interfering processes were reproducible with very few exceptions. The exceptions were values that were far out of their range and thus could be identified and discarded easily. With increasing numbers of interfering processes the noise in the collected data increased. However, we could still readily identify far out values.
The hardware cycle counter was used for the measurements reported here. This timing call constitutes a single machine instruction on Cray computers.
We timed single vector operations and corrected the results for the duration of the timing operations. As noted in [7] , execution times for short vectors may depend on the way times are measured. We timed each operation several times and eliminated outlying values as discussed above.
All measurements reported here were performed on standard Fortran code compiled by various versions of the Cray Research, Inc., CFT77 compiler.
Special precautions are necessary with optimizing compilers that like to eliminate code considered to be unnecessary. We have used a variety of tricks to outwit compilers. Examples are putting vectors in common blocks, printing some results, and constructing intinite loops with if tests. In any case, we considered it necessary to inspect the code produced by the compiler to make sure we were timing the operations we intended to measure.
Even with only a single processor active, execution times of vector operations may depend on the relative location of operands and results in memory as shown in [7] , with differences as high as 50% of the average execution times. This is due to access conflicts between access operations originating from a single processor. Even on the CRAY-2 with only a single access path to memory for each processor, the start of a vector access operation may interfere with the tail end of the previous one. To eliminate these effects we have timed our test codes for all possible relative locations of operands and result vectors. Assuming that all relative locations are equally likely to occur, we are presenting averages of these measurements. For computer systems with m memory banks and the test operation V = V + V, m 2 measurements were averaged for a single data point.
The above precautions are necessary to obtain meaningful data not only on Cray computers but are equally applicable to similar studies on other machines.
Baseline Results for a Single Active Processor
Figures 1 and 2 represent measured execution times of simple vector operations as a function of vector length for the CRAY Y-MP and CRAY-2, respectively. These measurements were performed with only one processor accessing memory and no confticts horn multiple access streams originating from that processor. The graphs are typical for vector operations on both machines and on the CRAY X-MP which is very similar to the CRAY Y-MP. The discontinuities at vector lengths equal to multiples of 64, the vector register length, may be ascribed to the stripmining process. Long vectors, on register-to-register vector computers, are processed as strips equal to the register length, plus a remainder.
We have previously [8] developed a model to describe this behavior. The execution time T of a vector operation is given by T=to+N. t.+iVte,
where N is the vector length, t. the time for processing one vector elemenÑ , the number of strips, t, the strip startup time, tO the startup time for the strip loop, an outer loop<
The number of strips is given by Here the number of unrolled loops NM is given by N,, the number of leftover single strips is given by 's"lNmOf$m)l (6) The asymptotic execution rate for very long vectors is given by 1 rm = tU/(u RL) + t,
Thus, the asymptotic execution rate depends on the level of loop unrolling, the unrolled loop startup time, and the vector element time. Because the startup times for the strip and the unrolled loop are approximately equal, the effect of startup on the asymptotic rate is reduced by a factor Vu by the technique of loop unrolling.
This can be seen by comparing Eqs. 3 and 7.
Clearly, vector performance has become highly dependent on software architecture and the quality of software optimization.
In addition to dependence on the level of loop unrolling, execution times depend on three startup times which are affected by software functions. They can be reduced considerably by keeping loop control variables in registers, thus avoiding unnecesswy scalar memory access operations which are not only time consuming by themselves but disrupt the sequence of vector access operations. It is important in the following sections to compare results obtained with identical software settings.
Results
Plots of measured vector execution times as a function of vector length are presented in Figs. 3 through 6 for the four systems under study, the CRAY X-MP, the CRAY Y-MP, and two versions of the CRAY-2, one with static and one with dynamic memory. The lowest curve in each plot represents data collected with only the test process running on a single processor, the remaining curves represent results with one or more interfering processes running in addition to the test process on the multiprocessor system, up to seven on the CRAY Y-MP and up to three on the other systems. The interfering processes in each case are the vector addition of two very long vectors. While the test process always involves stride-1 operations, i.e., vectors stored in contiguous memory locations, interfering processes do not all follow this regular memory access pattern.
In parts a of alt figures, intetiering processes A proceed with stride 1. In parts b of all figures interfering processes B access memory with stride 23. The attempted memory access rate is slightly lower than for contiguous access on the CRAY X-MP and CRAY Y-MP, the measured factor is 0.96. For the CRAY-2 attempted memory access rate~are the same for contiguous and strided access. As evident from the figures, performance degradation is very serious, especially for the CRAY X-MP and CRAY-2 systems. We took measurements for other prime strides, including mixed strides, with similar results. In parts c of all figures, one operand in each interfering process C is stored at random memory locations. In addition to the gather operation, which accesses memory in a random fashion, and the fetch of the other operand and store of the resul~which access memory with stride 1, the index vector has to be loaded, again a stride-1 operation.
Therefore, the processing of each strip involves a total of 4 vector memory access operations. With only three memory access ports available on the CRAY X-MP and CRAY Y-MP, not all of these operations can proceed concurrently.
The attempted memory access rate of interfering process B is therefore much less than that of process A. The predicted factor is 2.6, we measured 0.58. The attempted access rate is unchanged for the CRAY-2 with only one access port per processor. In spite of the lighter interfering memory traffic, performance is degraded more severely by interfering process C with one random memory access out of four than by interfering process A with all regular access operations.
All four systems tested show severe performance degradations due to memory access conflicts. The figures indicate that heavy vector traftic can cause slowdowns by more than a factor of three. In normal machine operation memory traffic is generally less intense and therefore affects performance to a lesser extent.
While Figs. 3 through 6 represent raw measured data we have subtracted the execution time of the undisturbed test process from the times measured with one or more interfering processes. Figures 7 through 10 therefore represent the measured delays as functions of vector length for one or more interfering processes. The most striking observation from these plots is the relatively modest performance degradation for interfering processes A (with stride 1) in parts a of all figures as compared to the very severe degradation due to strided access operations in parts b of all figures, or even interfering processes C with only one of two operands accessed at random locations. For the CRAY-2 the ratio of delays can be higher than ten. However, all systems measured show the same trend.
While the delays for strided and random access (parts b and c of figures) are approximated well by a linear function of vector length, the delays for interfering process A (contiguous access) are not. They show a stmcture similar to that of the execution time of the test process with marked discontinuities at integer multiples of the 64, the vector register length. In addition, the slope of the curves decreases for long vector lengths. We interpret these differences as follows.
While interfering processes B and C interfere with memory access operations of the test process roughly at random, interfering process A and the test process will synchronize their memory access operations in such a way that after an initiat startup delay very few collisions will occur during the processing of a strip. The curves will therefore show discontinuities at integer multiples of the vector register length and saturation for very long vectors. The overall disturbance rate will be very low.
We have fitted straight lines to the data points of parts b and c of Figs. 7 through 10 and computed the initial slopes of the curves in parts u. These slopes represent the average delays per computed result and are plotted in . For random memory access patterns the model predicts linear dependence of delays on p\m, the ratio of the number p of active memory access ports and m, the number of memory banks. Delays should be proportional to the square of the bank reservation time in cycles for light memory traffic amd depend on it linearly for heavy traffic. Agreement between the measured data and the queueing results is not expected to be close because the model represents delays due to memory bank conflicts only and not delays within the access networks. It should therefore set a lower limit for delays achievable for a very well designed interconnection network. Figure  12 indicates that the CRAY Y-MP comes close to this ideaL The measured delays for strided and random access (curves B and C) are close to those prdlcted by the model for random memory access patterns. The model predicts zero delays for regular access operations (curve A). It does not handle startup delays. It is interesting to note that the slope of the curves A for stride-1 access operations is initially close to zero and increases with increasing number of processes. We ascribe this to the fact that with increasing memory traffic and increasing number of delays, memory access for interfering process A becomes less regular.
Therefore, curve A should approach the delays for random access for very heavy memory traffic (large numbers of interfering processes). Curve A is for interfering processes with contiguous memory access, curve B for processes with stride 23, and curve C is for processes involving one gather operation. The dashed lines represent values computed from a queueing model for memory bank conflicts, the upper curve for the memory access rate of curve B, the lower one for the rate of curve C. Average delay per memory access as a function of the number of interfering processes for the CRAY-2 with dynamic memory. Curve A is for interfering processes with contiguous memory access, curve B for processes with stride 23, and curve C is for processes involving one gather operation.
The dashed line represents values computed from a queueing model for memory bank conflicts for random access. The CRAY X-MP shows delays that are considerably higher than those predicted by the model. We therefore conclude that the network contributes considerably to these delays.
Because vector elements do not have to arrive in sequence at the vector registers of the CRAY-2, memory access delays due to bank conflicts should be considerably less than those predicted by the simple model, unless the network contributes significantly to the delays. Because calculated and measured values are of similar magnitude, we conclude that the contribution of the network is not negligible.
The ratios of delays for the two CRAY-2 systems measured is higher than the ratio of their bank reservation times, indicating that the dependence of delays on bank reservation time is stronger than linear.
Our measurements yield memory access delays of roughly the same magnitude as those reported previously by CalalIan and Bailey [9] under similar test conditions.
Conclusions
We can draw the following conclusions from our expedients: q Memory access contlicts can significantly degrade the performance of common memory multiprocessor systems. We have seen slowdowns by more than a factor of 3 in a four processor system. Good network design is crucial as demonstrated by our results for the CRAY Y-MP. In spite of its greater number of processors and higher processor spee& the performance of its memory is superior to that of its older siblings, the CRAY X-MP and CRAY-2.
Hopefully, its network design can be extended to larger numbers of processors. Careful modeling of interconnection networks will be essential for the design of future systems.
q Delays are much worse for mixed-stride or random access operations than for access with uniform stride. For light to moderate memory traffic, delays for mixed strides exceed those for stride 1 by approximately a factor of ten on all systems tested except the CRAY X-MP. For light memory traffic and normalized to equal attempted memory access rates, delays are higher by a factor of approximately five on the CRAY Y-MP and CRAY-2 with dynamic memory if only one quarter of access operations are addressed to random instead of consecutive locations. Thus performance degradations for random access and for mixed strides are of the same order of magnitude. These results are important for storage schemes that have been proposed to avoid single vector stream access confticts for operations with stride divisible by certain powers of two, as described for example in [10] [11] [12] . Although these schemes will significantly reduce such conflicts, they will severely degrade performance of multiprocessor systems due to multiple access stream conflicts because they scramble stride-1 access operations. Because stride-1 access is the predominant mode of accessing vectors in most algorithms and because these operations run smoothly with current storage schemes, as shown by our measurements, this feature should not be sacficed to the solution of a problem which is due to the design of some access networks. Instead network design should avoid these conflicts. In spite of existing problems, common memory multiprocessors are not at the end of their path. There is a lot of space for improvement to avoid memory access conflicts: better network design, caches for scalar data to avoid the dkuption of regular vector access operations, increasing the number of memory banks, and last but not leas~latency hiding techniques.
