Abstract. The serial and parallel performance of one of the world's fastest general purpose computers, the CRAY-2, is analyzed using the standard Los Alamos Benchmark Set plus codes adapted for parallel processing. For comparison, architectural and performance data are also given for the CRAY X-MP/416. Factors affecting performance, such as memory bandwidth, size and access speed of memory, and software exploitation of hardware, are examined. The parallel processing environments of both machines are evaluated, and speedup measurements for the parallel codes are given.
Introduction
In 1985, the first Cray Research Incorporated CRAY-2 supercomputer was installed at the National Magnetic Fusion Energy Computational Center (NMFECC). Since that time this series of machine has undergone many changes, both in hardware and software. This paper evaluates some of these changes by observing their effect on a series of computationally intensive benchmark codes. We measured the performance of three models of the CRAY-2 that differ in their common memory hardware. The first two models we measured had common memory implemented with dynamic random-access memory (DRAM) with chip access times of 120 and 80 nanoseconds (ns). These machines are Serial 2003, located at the University of Minnesota, and Serial 2011, located at the Air Force Weapons Laboratory in Albuquerque, New Mexico. The third model, Serial 2012, located at Cray Research, uses smile random-access memory (SRAM) with a chip access time of 55 ns.
In Section 2, we present a brief outline of the architectural and functional features of the CRAY-2, with emphasis on those features that affect performance. For comparison, corresponding architectural features from another Cray Research product, the X-MP/416, are included. Later sections present benchmark data with single-processor and mulilprocessor results discussed separately.
Comparison of Architectures
The CRAY-2 is a general-purpose parallel/vector supercomputer system. There are four central processing units (CPUs), each with vector and scalar capabilities. Up to 256 million words of dynamic CMOS memory give the CRAY-2 one of the largest memory capacities of any supercomputer on the market today. For a schematic of the mainframe configuration, see Figure 1 . For comparison, the X-MP is also a 4-CPU machine, each CPU having vector and scalar capabilities, but with a common memory of up to 16 million 64-bit words of static bi-polar memory.
CPUs
The CPU clock period on the CRAY-2 is 4.1 ns, while on the X-MP/416, the CPU clock period is 8.5 ns. The effect of this difference is not always as large as it at first seems. Instructions can issue from the instruction buffer on the X-MP every clock period (CP), while on the CRAY-2 the rate is one every other clock period. This gives the CRAY-2 an effective clock period of 8.2 ns with respect to instruction issue, nearly equal to that on the X-ME After an appropriate start-up, however, arithmetic results are produced every CP on both machines.
The CPUs on both machines contain three sets of registers that serve as source and destination for computations in the functional units. These are address registers, scalar registers, and vector registers, referred to as A-, S-, and V-registers, respectively. In addition, the CRAY-2 has 16K words of local (or fast) memory that can be used by these registers as temporary storage. The access time between local memory and A-and S-registers is 5 and 4 CPs, respectively. The access time for V-registers is 8 CPs + length of the vector. Instead of local memory, the X-MP has an extra set of 72 temporary storage registers called B-and T-registers. Access times for these registers is 1 CE The V-registers on the X-MP have no corresponding temporary registers. In addition to these registers, the CRAY-2 has eight semaphore flags to enable synchronization of common memory during multitasking. Only one of these semaphores can be assigned to a job. In contrast, the X-MP has five sets of shared registers (shared among the four CPUs), including 32 semaphores. Arithmetic on both machines is done in fully segmented (pipelined) functional units. This pipelining allows the functional units, some of which can also operate in parallel, to deliver a result every clock period, after a suitable start-up time. Chaining, which allows the output of one arithmetic operation to serve as the immediate input to a subsequent operation, is not available on the CRAY-2. There are also different numbers of functional units on the two machines: the X-MP has 14 while the CRAY-2 has 9, including the reciprocal square root unit. Table 1 gives some representative times for arithmetic operations. For a more complete explanation of the CPU organization, see [Kampe and Nguyen 1986 ].
Memory
As mentioned previously, the CRAY-2 has up to 256 million 64-bit words of common, or shared, memory, interleaved up to 128 ways. The memory is organized into quadrants with 32 banks in each quadrant. Each quadrant has a data path to four common memory ports, one for each processor. The four quadrants are accessed by the four processors in phase time. This means that each processor can access one particular quadrant every fourth clock period. The quadrants are accessed in a round robin fashion; that is, processor 1 can access While the large size of the memory on the CRAY-2 is an asset, the memory cycle time on the early CRAY-2 models of 234 ns (57 CPs) was slow enough to be a detriment. Cray's first solution to the memory Speed problem was pseudobanking, a technique that allows access to a physical memory in less time than the memory chip cycle time. Cycle time is made up of two parts, access time and off-chip time. The logic chips are busy for a time equal to the access time, while the memory chips are busy for an additional time equal to the off-chip time. Pseudobanking uses the simple trick of addressing alternate planes of chips within the module. This can be done in a time equal to the access time, effectively reducing the cycle time by nearly half. Using this approach the effective cycle time on the 256-Mword DRAM decreased from 57 CPs to 33 CPs [Numrich 1985] . Pseudobanking is only needed on CRAY-2s with DRAM. Later solutions have involved the use of faster memory chips and static rather than dynamic memories. The X-MP uses chips with a scalar memory access time of 14 CPs (119 ns).
Several factors affect the rate of data transfer between common memory and the vector registers on the CRAY-2. The first is the rate of instruction issue for vector reads and writes. With only one common memory port per processor, each read or write instruction must wait for the port to be free before it can issue. If one word transfers per clock period, then the next instruction can issue VL + 8 CPs later [Numrich 1985] , where VL is the requested vector length. The minimum transfer time per word, T(min), is approximated by
For a vector length of 64, this time is 72/64 or 1.125 CP/word, giving a maximum transfer rate of 217 Mwords/s (assuming a 4.1-ns clock). This rate is highly optimistic and assumes no quadrant or bank conflicts. Memory conflicts also reduce the data transfer rate. Memory conflicts on the CRAY-2s occur at two levels: quadrant conflicts and bank conflicts. Quadrant conflicts are caused by the difference in the rate at which memory-request addresses arrive at the port and the rate at which the port can process these requests. Recall that each memory quadrant on the CRAY-2 can be addressed by each processor only once every 4 CPs. The time between memory requests to the same quadrant is called the quadrant period. Quadrant periods of four cause no conflicts to occur, while periods of two or one can cause conflicts. Vectors with odd strides, including strides of one, have a quadrant period of four, and thus cause no conflicts. Even strides, however, cause conflicts of varying severity. The worst case is a stride divisible by four; here the quadrant period is one. Even in the absence of other conflicts, memory quadrant conflicts can cause performance degradation.
Bank conflicts, like quadrant conflicts, are caused by attempts to access data in the same bank within too small a time period. The bank conflict effect is a function of bank cycle time and number of banks. See Table 2 for a list of cycle times for the machines we tested. When a bank conflict occurs, the address in the quadrant buffer requires more than the 4-CP quadrant access time to clear. Memory backup then occurs because these quadrant buffers remain full until the requested bank is free.
Description of Benchmark Programs
The Computing and Communications Division at Los Alamos National Laboratory maintains a set of portable benchmark programs representing characteristic tasks that a large supercomputer would be required to run at the Laboratory. This benchmark set has been run on a wide range of both scalar and vector machines [Brickner et al. 1986; Griffin and Simmons 1984; Lubeck et al. 1985 Lubeck et al. , 1987 Simmons and Lubeck 1986; Simmons and Wasserman 1987; Wasserman et al. 1987] . A database is maintained containing results of past runs of these programs on a variety of computers. A report from the National Research Council [1986] has characterized supercomputer benchmarks in terms of a hierarchy. Using the Council's characterization, the Los Alamos benchmark set consists of tests at the levels of hardware demonstration programs, basic routines, and stripped-down applications. A description of the codes is given in the appendix. Additional information can be found in [Wasserman 1988 ]. The programs are coded in ANSI Fortran for portability and typically can be run on a new machine with little or no change. Execution rates will be indicative of the potential initial usefulness of a new machine.
Single-Processor Results
The Table 3 shows the effect of the faster memory hardware on our benchmark codes. The results for the 80-ns DRAM CRAY-2 are not always consistent with results from the other two machines. That is, the times for some benchmarks increase in going from the 120-ns CRAY-2 to the 80-ns CRAY-2. We believe this is due to different implementations of the CFT77 compiler and, in particular, the implementation of CFT77 under CTSS on Serial 2011. For this reason, and also because we wish to illustrate the maximum performance gain that could be realized from faster memory, in this discussion we focus on the difference in performance between the 120-ns DRAM CRAY-2 and the 55-ns SRAM CRAY-2S. Speedups due to the static memory are in the range of 7-16%. The two scalar codes SCALGAM and GAMTEB show identical speedups of 13%. HYDRO, which is nearly 100% vectorizable, shows the largest speedup. Note that a twofold change in memory chip access time should not yield anything close to a twofold speedup in the codes. The more pertinent hardware feature is the memory latency, which is the time to do loads from common memory. On the DRAM machine, the scalar access latency is 59 CPs, while on the SRAM machine, the latency for scalar loads is 43 CP. Thus, the maximum speedup we could observe here is about 37 %. That the maximum observed speedup is still smaller than this may suggest that the compiler could hide some of the memory latency, perhaps by more use of the local memory. Table 4 shows a comparison of execution times on the fast memory CRAY-2S (Serial 2012) using the two CFT77 compilers (1.3) and (2.0). 2 Version 2.0 yields a dramatic improvement on some of the codes. The FFT code speeds up by a factor of 2.3 relative to CFT77 1.3.
Comparison of Three Types of CRAY-2 Hardware

Comparison of Compilers
All previous versions of CFT77 vectorized several loops in FFT conditionally; the repeated execution of the conditional code at run time caused much slower execution rates. In FFT the conditional code is generated because a loop bound is passed as an argument to a subroutine. These loops are now fully vectorized in version 2.0. Using CFT77 2.0, HYDRO speeds up by 35 %. HYDRO contains one minor loop that conditionally vectorized with CFT77 1.3 and now fully vectorizes with CFT77 2.0. HYDRO also has three loops, in the time-consuming subroutines VSETUV and VQTERM, that had 
3. Comparison of CRAY-2S with CRAY-MP/416
In this section we examine the performance of only the CRAY-2S with that of the CRAY X-MP/416 results. We used a pre-release of CFT77 2.0 (BF185) on a CRAY X-MP/416 Tables 6 and 7 . These tables contain rates, in Mflops, for various elementary vector operations as a function of vector length. All operations were carried out with unit stride except for the second and third operations in both tables. All measurements were done on a dedicated system using a single processor. On the X-MP, an in-register infinite loop was also used to keep the idle processors occupied. The X-MP/416 tests used bidirectional memory. For the nonslrided, nonscatter/gather operations in Tables 6 and 7 , the differences between the two machines at vector length 1000 can generally be reconciled with the rate at which each machine is capable of producing results. For example, on the first operation, V = V + S, we expect comparable rates, and we observe 83 Mflops for the CRAY-2S and 100 Mflops for the X-MP. As another example, on the fourth operation, V = V * V, we expect the asymptotic rate on the CRAY-2S to be less than that of the X-MP by about a factor of 1.5; at vector length 1000, the observed ratio is 1.78 (51 Mflops for the CRAY-2S and 93 Mflops for the X-MP). However, the CRAY-2S compiler has to unroll all these loops (to a depth of four) to achieve this performance. At shorter vector lengths the X-MP is faster than the CRAY-2 by about a factor of 2. Comparison of the first and second operations in Table 6 shows that, as expected, the CRAY-2S suffers no performance degradation with odd strides. However, with stride 8, performance on the CRAY-2S is about one-fourth of the nonstrided rate. The minimum time for memory transfer on the CRAY-2S is slightly more than 1 CP/word. However, with stride 8 all words of data reside in the same quadrant. Therefore, the minimum transfer time, delayed by quadrant conflict only, is about 6.5 CP/word. With a stride of 8, there are no bank conflicts on the machine we used.
Scatter/gather operations, the last two rows in Tables 6 and 7 , are much more efficient on the X-MP than they are on the CRAY-2, over the entire range of vector lengths. The gather operation on the CRAY-2 is subject to a special hardware delay so that references are allowed roughly once every 4 CPs.
Benchmark Codes.
A comparison of the current CRAY-2S results with the CRAY X-MP/416 for the rest of the benchmark codes is shown in Table 8 . Two sets of results are given for the X-MP: one from a pre-release of CFT77 2.0 and one from the production compiler, CFT 1.14. The first thing tonotice in Table 8 (comparing colunms two and three) is that on the X-MP, CFT77 2.0 now produces better code than CFT 1.14 (with no compiler options) for all but one benchmark. The only (minor) exception is MATRIX, for which CFT 1.14 with the BTREG option (shown in parenthesis in Table 8 ) is slightly faster than CFT77 2.0.
The X-MP has a significant performance advantage over the CRAY-2S on seven of the ten codes. Of the seven, four are highly vectorizable: HYDRO, LSS, MATRIX, and WAVE. In HYDRO, LSS, and MATRIX, the predominant loop length is about 100. The VECOPS data in Tables 6 and 7 showed that the X-MP ran loops at vector length 100 nearly twice as fast as the CRAY-2S did. In WAVE, the predominant loop length is 256. WAVE also involves many gathers for which, as shown above, the X-MP is superior.
Interestingly, in contrast with the VECOPS data, the X-MP is only about 5% faster on the FFT code, a highly vectorized code with short vector lengths on which the X-MP should be fastest. An important aspect of vectorization on the CRAY-2S concerns the way in which arrays are dimensioned. Because of quadrant conflicts that can have a noticeable effect on performance, arrays with even dimensions will suffer performance degradations relative to arrays with odd dimensions. This fact is highlighted in the performance of the codes LSS and MATRIX relative to the X-MP. Both codes spend most of their time in SAXPY, and both have loop lengths of 100. Yet MATRIX runs nearly 63 % faster on the X-MP than it does on the CRAY-2S, whereas LSS runs about 45 % faster on the X-MP. In MATRIX, two of three critical arrays have even dimensions, while in LSS all critical arrays have odd dimensions. Thus, relative to the X-MP, one must be far more careful of program array dimensions on the CRAY-2S.
The relationship between the X-MP and the CRAY-2S on codes not overwhelmingly vector in nature is harder to explain. Of the two Monte Carlo photon transport codes, one, SCALGAM, runs about 28% faster on the X-MP, while the other, GAMTEB, runs about 18% faster on the CRAY-2S. ESN, a totally scalar code, runs about 28% faster on the X-MP. But on MCNP, the X-MP is only 7 % faster than the CRAY-2. The reason for this is not clear.
X-MP/416 External Storage Performance. The larger central memory on the CRAY-2
is an important asset for this machine. However, the X-MP can be equipped with an external solid-state storage device (SSD) that can also offer potential for large codes. An obvious question is: if a problem can be programmed with an out-of-core algorithm, how does the X-MP with SSD perform relative to the same problem run in-memory on the CRAY-2?
The WAVE code can be so programmed. We ran a job requiring about 20 Mwords of storage on the CRAY-2 (Serial 2011, 80-ns memory). We ran the same code on an X-MP/416 runnning CTSS and equipped with a 512-Mword SSD using one channel (1250 Mbyte/s). Both machines used the CFT77 version 2.0 compiler. The X-MP/416 version transferred to the SSD in block sizes of 204,800 words. The CRAY-2 ran the job in 461 seconds, while the X-MP required 355 seconds of CPU time and 360 seconds of elapsed (wall-clock) time. Although we did not run this code on the 55-ns CRAY-2S, we can approximate what the performance will be. Using the CFT77 version 2.0 compiler, the standard WAVE benchmark runs about 12 % faster on the 55-ns CRAY-2S than it does on the 80-ns CRAY-2, so the best CRAY-2S time for the 19-Mword job would be about 411 seconds. This value is still larger than the X-MP wall clock time. Note that although I/O to the SSD does not require particularly difficult coding (as might I/O to a disk) other than insuring a large block size, the CRAY-2 version requires no extra coding.
Multitasking Results
The four processors of the CRAY-2 can simultaneously be brought to bear on a single job through the multitasking environment. We ran our large Monte Carlo transport code, MCNP, in this environment on the Serial 2011 CRAY-2 at the Air Force Weapons Laboratory in February 1988. The compiler was CFT77 2.0, the operating system was CTSS, and the multitasking library was Multilib. We ran a problem size of 60,000 source particles. For comparison, we ran the same problem on an X-MP/416 at Los Alamos using the CFT77 2.0 compiler, the CTSS operating system, and a multitasking library that is a local system. We used a parallelization method called macrotasking developed at Cray Research and adapted for CTSS on the CRAY-2 by the NMFECC. This method operates at the granularity level of the subroutine? Multitasking runs on both machines were done during dedicated time. The times are given in Table 9 . Note that the X-MP is about 40% faster than the CRAY-2 for one to four processors. The serial times differ by 34 %, which is comparable to the differences observed for the other scalar serial codes. Speedup is defined as
where Ts is the serial execution time and Tn is the execution time using n processors. The speedups for MCNP are plotted in Figure 2 . The CRAY-2 shows a speedup of 3.53 for four processors, while on the X-MP speedup is 3.65. This difference might be attributed to several factors, one of which is the availability of only a single semaphore per job on the CRAY-2. The X-MP has 32 semaphores available to a job. Another factor affecting speedup is the implementation of synchronization primitives. The Los Alarnos system has implemented spin-wait locks while the Cray Research/NMFECC implementation is somewhat less efficient. Since Monte Carlo algorithms are considered to be ideal candidates for parallel processing, one might expect a speedup for four processors that is somewhat closer to four. One reason that we do not see this for this set of runs is that the time spent in the serial sections, such as the setting up of the problem, is constant and independent of the number of source particles. This means that as more and more processors are brought to bear on a problem of fixed size, the serial portion takes a larger percentage of the time. This is, of course, Amdahl's law [Amdahl 1967 ].
If we interpret Ware's model [Ware 1972 ] (of Amdahl's law) of vector performance as applying to multiprocessor performance, we can also define speedup as
wherefis the fraction of the code that can be executed in parallel and p is the number of processors. For MCNP, which is about 98 % parallel, we get a predicted speedup of 3.77 for four processors. This is somewhat higher than our measured speedup of 3.53 on the CRAY-2S. There are several reasons for the difference in these two speedups. One is the effect of multiprocessor synchronization overhead [Buzbee 1984] . Another is the additional time Number of Processors required for system overhead in the multiple processor runs. This serial version of MCNP, for example, is not stack based and so incurs no overhead associated with stack management.
Conclusions
The faster memory chips on recent models of the CRAY-2 provide some improvement on our benchmarks, but do not, by themselves, allow the CRAY-2S to perform better than the X-MP/416 in single-processor mode. This is because too much of the memory bottleneck on the CRAY-2 is due to factors other than chip access time.
The biggest improvements we have observed during the evolution of the CRAY-2 are derived from compiler changes, not hardware changes. In particular, HYDRO and WAVE, two benchmark codes that closely resemble production codes at the Laboratory, benefit significantly from the combination of new hardware and a new version of CFT77 on the CRAY-2S.
The X-MP has a clear performance advantage over the CRAY-2S on our codes that are highly vectorized. However, the difference between these machines is less clear on codes that are not overwhelmingly vector. The significant factor here appears to be the longer memory latency on the CRAY-2. Although the CRAY-2 provides more central memory than the X-MP, we have shown that on one code that takes advantage of the X-MP SSD, the faster processor and high I/O rates can overcome the lack of X-MP memory.
In multitasking mode, the CRAY-2 performs about as well as the X-MP on the problem that we ran. While the overall times are not as fast as the X-ME the speedups are comparable. The overhead observed for the problem we ran could be reduced either by running a larger problem or by using more efficient synchronization (microtasking). A Monte Carlo photon transport code. This is a relatively small model code with a simple source and straightforward geometry. It is only slightly vectorizable.
Monte Carlo photon transport code that uses the methods of GAMTEB, but with more complicated geometry, more materials, and more statistics gathered. It requires 64-bit arithmetic for its random number generator as does GAMTEB. It also does not vectorize.
A linear system solver from LINPACK [Dongarra et al. 1979] for systems of equations of order 100. It uses the method of Gaussian elimination. Although it is fully vectorizable, it is not optimized for supercomputers. Library routines supplied by supercomputer manufacturers will achieve considerably higher execution rates.
MCNP is a general-purpose Monte Carlo code [Booth et al. 1986] , heavily used at the Laboratory and elsewhere, that does neutron, photon, or coupled neutron/photon transport. It includes the ability to calculate eigenvalues for critical systems. The code treats an arbitrary three-dimensional configuration of materials in geometric cells bounded by first-and second-degree surfaces and some special fourth-degree surfaces. Point-wise cross sections are used throughout. The test problem includes a fair sample of the commonly used features of the code. The code has been parallelized for several different parallel processors. Typically 100,000 source particles are started. The code does not vectorize.
ESN is a one-dimensional, discrete ordinates, particle transport code that solves the transport equation by the discrete ordinates method [Wienke 1982] . The current algorithm implemented in ESN was developed by Wienke and Hiromoto [1985] . Particles are described by a flux, defined at each point in space and time, and the flux is a function of particle energy and direction of flight. The discrete ordinates method involves discretizing all these variables (space, time, energy, and angle) and applying an iterative solution scheme. There are 16 energy groups involved. The code does not vectorize.
HYDRO is a two-dimensional Lagrangian hydrodynamics code based on an algorithm by W.D. Schuitz [1964] . HYDRO is representative of a large class of codes in use at the Laboratory. The code is 100% vectorizable. A typical problem is run on a 100xl00 mesh for 100 time steps.
WAVE is a two-dimensional, relativistic, electromagnetic particle-in-ceU simulation code used to study various plasma phenomena [Morse and Neilson 1971] . WAVE solves Maxwell's equations and particle equations of motion on a cartesian mesh with a variety of field and particle boundary conditions. The benchmark problem involves 500,000 particles on 50,000 grid points for 20 timesteps; about 4 MW of memory are required.
