This report summarizes the early experience in using the Intel iPSC/860 parallel supercomputer at Oak Ridge National Laboratory. The hardware and software are described in some detail, and the machine's performance is studied using both simple computational kernels and a number of complete applications programs.
Introduction
Today's leading supercomputers are capable of performing over 1 billion floating point operations per second (GFLOPS) . The current generation of conventional supercomputers, typified by the CRAY-2 and CRAY Y-MP as well as a number of Japanese machines, attain such prodigious computational speeds by combining a small number of very powerful vector processors (typically 4 to 8) having a cycle time of a few nanoseconds (typically 4 to 10). At the opposite extreme, another means of providing GFLOP performance is through massive parallelism, in which tens of thousands of very small processors are employed. This approach is typified by SIMD architectures such as those available from Thinking Machines and MasPar. Because of the very limited power and memory of the individual processors, such machines require a very fine granularity of parallelism in applications programs.
An intermediate approach between these two extremes is that of medium-grain, distributed-memory multicomputers, in which a few hundred to a thousand 32or 64-bit microprocessors are combined by an interconnection network. Such medium-grain parallel machines potentially have a price-performance advantage over either of the other two approaches in that they require fewer custom parts. They use mostly commodity parts whose development and manufacturing costs are amortized over hundreds of thousands produced for the personal-computer and workstation markets. The most successful instances of this approach to date have been parallel computers called &dquo;hypercubes,&dquo; named for the topological structure of the network interconnecting the processors. The hypercube architecture, first practically realized at Caltech (Seitz, 1985) , has served as the basis for a number of commercial machines, some of which ultimately failed in the marketplace (Ametek/Symult and FPS T-seiies), but others of which have been notably successful (Intel and Ncube).
The aggregate performance of such a mediumgrain machine is determined by the performance of its constituent microprocessors, the bandwidth and latency of its interconnection network, and the total number of processors. Given their relatively low-powered processors and limited memory, the first one or two genera-tions of hypercubes were not bona fide supercomputers in that they were not yet competitive with the fastest conventional machines at the time unless extremely large numbers of processors were used. Nevertheless, these early machines were valuable tools for computational scientists to learn to deal with parallelism in applications. With the advent of RISC designs and other technological advances, very high performance microprocessors (with cycle times as small as 25 nsec) have become available and are now making their way into multiprocessor and multicomputer architectures. Consequently, this class of architectures has moved into the GFLOP performance range, with the commercial release of the Intel iPSC/860 and the Ncube 6400 hypercubes, and is now competitive with any other class of general-purpose supercomputers.
Oak Ridge National Laboratory (ORNL) has had a long association with commercial hypercubes. It was one of the first recipients of the Intel iPSC/1, iPSC/2, and original Ncube/ten (now called Ncube 3200) machines.
These machines have been used for basic research in parallel algorithms as well as for a variety of applications at ORNL (Heath, 1987) . In January 1990, ORNL received one of the two beta test machines of the new iPSC/860 hypercube produced by Intel (the other was delivered to NASA Ames Research Center). The iPSC/860 is also known as the Touchstone Gamma Prototype, since it represents an early phase of Intel's Touchstone project, whose development is funded in part by DARPA.
The purpose of this paper is to summarize ORNL's early experience with the Intel iPSC/860 machine. In the sections to follow, we present an overview of both the hardware and software of the iPSC/860, performance data for some basic computational kernels, and results of some initial applications implemented on the machine, including comparisons with performance of the same applications on more conventional supercomputers. One of these applications, superconductivity, is discussed in some detail in order to gain an appreciation for the work done in adapting it for parallel execution. The work reported in this paper involved the efforts of many others in addition to the listed authors. Contributions by key individuals are noted in the appropriate sections. The reader should keep in mind that our conclusions are based on experience with a beta release of the iPSC/860 during the 3-month period preceding the first shipments of regular production models to customers. Thus, one should naturally expect more bugs and instability in both hardware and software than might be considered tolerable in a mature product. It is also unrealistic to expect highly tailored and optimized development tools in such a new environment. Nevertheless, even during this early stage, the iPSC/860 has proved capable of world-class performance and shows great promise for tackling the grand challenges of computational science.
II
Intel iPSC/860 Hardware Each computational node in the iPSC/860 consists of an Intel i860 processor plus memory and communication components. The iPSC/860 at ORNL has 128 such nodes, the maximum configuration available. Each computational node has 8 megabytes of memory, for an aggregate total of one gigabyte of RAM. Each i860 processor features an integer core unit, pipelined floatingpoint units for addition and multiplication, a graphics unit, memory-management support, a large register set, separate instruction and data caches, and 64-bit data paths, all integrated into a single chip having about I million transistors (Kohn and Margulis, 1989; Margulis, 1989) . With a clock rate of 40 MHz, each i860 processor has a peak execution rate of 32 MIPS integer performance, 80 MFLOPS 32-bit floating-point performance, and 60 MFLOPS 64-bit floating-point performance. Thus, the aggregate peak performance rate of the 128processor iPSC/860 is more than 7 GFLOPS (64-bit) and 10 GFLOPS (32-bit). It should be kept in mind, however, that peak execution rates are based on optimal conditions that are difficult to realize or sustain in practice. In particular, the peak rate for the i860 assumes ideal instruction mix, cache utilization, data alignment, pipelining, and so forth. These issues are discussed in greater detail below.
The processors in the iPSC/860 are interconnected by a seven-dimensional hypercube network in which &dquo;wormhole&dquo; routing hardware is used to provide efficient message routing between nonadjacent processors.
The network essentially provides circuit switching (as opposed to packet switched, store-and-forward message routing), thereby effectively emulatinlg a fully connected network, with very little penalty for nonlocal communication. The peak data transfer rate across the hypercube interconnection network between any two nodes is 2.8 megabytes per second.
In addition to the 128 computational nodes, ORNL's iPSC/860 has four input/output nodes, each of which has an Intel 80386 processor and two 650-megabyte (formatted) disks, for an aggregate total of more than 5 gigabytes of disk space. These IIO nodes and the disks they support are directly accessible to the computational nodes over the interconnection network. Peak data transfer rate between a single computational node and the I/O node disks is about 1.5 megabytes per second. When more computational nodes access the I/O disks simultaneously, the aggregate throughput initially increases, peaking at about 3 megabytes per second, but eventually degrades as a result of contention as still more processors are used. For more detailed performance data for the iPSC/860 on basic I/O, communication, and arithmetic operations, see Dunigan (1990) .
Like most machines of its type, the iPSC/860 is not a stand-alone machine, but requires a host machine to serve as its interface to the outside world for program development, resource management, and external network access. The host machine, known in Intel tenminology as the System Resource Manager (SRM), is an Intel 301 microcomputer, which features an Intel 80386/387 processor pair running at 16 MHz, 8 megabytes of RAM, a 300-megabyte disk, a cartridge tape unit, and an Ethernet network connection. The SRM is attached to the hypercube network, and this link provides a peak data transfer rate of more than 1 megabyte per second.
Intel iPSC/860 Software The user interface and software environment for the iPSC/860 reside primarily on the SRM. The SRM runs Unix System V, Release 3.2, with support for TCP/IP networking and the Network File System (NFS) via Ethernet. The disk space on the I/O nodes is managed by a separate Concurrent File System (CFS) that is not cur-rently integrated with the SRM disk or NFS. A special shell is provided, however, for accessing and managing CFS files from the SRM.
The computational nodes in the iPSC/860 system run a simple operating system kernel called NX that supervises process execution and provides buffered, queued message passing (including communication to I/O nodes or the SRM). As with other MIMD hypercubes, the programming model for the iPSC/860 is based on adding explicit communication calls (send/recv) to serial code written in a conventional programming language (C or Fortran) . At present, there is no automation provided to aid in parallelizing programs, but a node debugger is available.
Compilation of either C or Fortran for the i860 node-processor target takes place on the SRM. The cross-compilers currently available do not take specific advantage of any of the special features of the i860 processor (dual instruction mode, etc.) that give it its unusually high performance. The result is that compiled code from high-level language source generally runs at about an order-of-magnitude lower performance than the peak rates expected for the i860. Specific performance data are detailed below.
The i860 development tools currently available (compilers, linker, assembler, and archiver) run very slowly on the SRM (much more slowly than their counterparts for the 80386/387 target), even for a single user. If several users run the i860 development tools on the SRM simultaneously, the SRM slows to a crawl. For example, at ORNL we have i860 application programs that cannot be built on the SRM in an 8-hour shift.
Thus, although the computational performance of the iPSC/860 is competitive with conventional supercomputers, it is not yet in the same league in terms of the program development cycle. Intel and a number of third-party software houses are working on enhanced compilers and other development tools for the i860 that should be much more efficient, both in building programs and in executing them on the i860. In addition, another obvious route to alleviating some of the SRM bottleneck would be to move program development elsewhere on the network onto higher-powered workstations via additional cross-compilers, or onto the hypercube itself. Such improvements will be necessary be-&dquo;Basic operations on vectors and ma-trGces are common in all areas of scientific computing. These fundamental building blocks fonn the inner loops of many numerical algorithms and are a dominant factor in determining the performance of many numerically intensive applications programs The performance of these computational kernels is therefore of great interests on any computer architecture, and they tend to be among the grst benchmarks on any new processor.&dquo; fore the iPSC/860 can become the same kind of everyday production workhorse that one expects of conventional supercomputers, such as the Cray series, for which compilations seem almost instantaneous.
Performance on
Computational Kernels Basic operations on vectors and matrices are common in all areas of scientific computing. These fundamental building blocks form the inner loops of many numerical algorithms and are a dominant factor in determining the performance of many numerically intensive applications programs. The performance of these computational kernels is therefore af great interest on any computer architecture, and they tend to be among the first benchmarks run on any new processor. The definitions and user interfaces for these low-level operations have been standardized in the Basic Linear Algebra Subprograms (BLAS) (Lawson et al., 1979) , which in turn form the basis of portable implementations of higher-level matrix operations, such as solving systems of linear equations (for example, see Dongarra et al., 1979) .
When implemented in a high-level programming language such as Fortran or C, the BLAS can be made portable across a wide range of computers, but their performance rarely approaches the theoretical peak on any individual machine. The usual approach, therefore, is to implement custom-coded versions of the BLAS in assembler language for each particular processor, while maintaining the same user interface across all implementations so that programs that call the BLAS will remain portable, while retaining the speed advantage of assembler coding in their inner loops. High-level language implementations of the BLAS are still of interest, however, in that the performance gap between them and optimized implementations in assembler language serves as an indication of the effectiveness of the compilers for a given machine.
We have implemented a number of the most important BLAS in assembler language for the Intel i860 processor and compared their performance with standard implementations in Fortran and C. Both in writing these codes and in testing their performance, we were confronted with a number of options regarding methodology. The i860 processor has several features, and corresponding instructions in its instruction set, that potentially enhance its performance, but whose exploitation may limit the general applicability of the resulting code. For example, the &dquo;quad load&dquo; feature allows the fetching of 128 bits of data from memory with a single instruction, but only if the data are aligned on a &dquo;quad word&dquo; boundary (i.e., a byte address that is a multiple of 16). The use of this capability substantially increases the effective bandwidth between processor and memory, but in many applications it is impractical or impossible to meet the concomitant restriction on data alignment. Thus, in writing a general-purpose code, one must either forgo using this special feature entirely, or else detect those (possibly rare) cases for which it is applicable and exploit it only in those cases. Clearly, this issue must be kept in mind when choosing benchmark tests and interpreting the results.
The theoretical peak performance figures cited earlier for the i860 are based on ideal conditions, including alignment of data on proper word boundaries, perfect pipelining, no cache misses, an instruction mix that exactly matches the functional units of the processor, optimal use of dual instruction mode, and so forth. Full realization of these conditions in real programs for any sustained period of time is undoubtedly rare, but they can conceivably be achieved in simple, artificial benchmark tests such as isolated tests of individual routines from the BLAS. However, we are confronted here by another thorny question of methodology regarding cache usage. An individual call to a single BLAS routine on a high-performance processor is too brief to yield reliable timing results. The usual approach to such a problem is simply to replicate the test many times, perhaps several thousand, so that overall execution times can be measured accurately. Unfortunately, a far higher percentage of cache hits is likely to occur in such a replicated test than would be experienced in one-time usage of the routine, thereby significantly skewing the results. On the other hand, there certainly are instances in actual practice, some of which are noted in the next section, in which data can be expected to remain in cache for sustained periods if the algorithm is carefully con-structed. The reader should keep these comments in mind when interpreting this section's results, which were obtained through replication in order to produce accurately measurable execution times.
In a high-level language such as Fortran or C, the user has little specific control over cache utilization, but in i860 assembler language, data traffic to and from memory can bypass cache at the programmer's option to obtain better overall cache utilization. The general principle, of course, is that reusable data should be cached, while nonreusable data should bypass cache so that they do not displace any reusable data that may already reside there. For example, one of the most important computational kernels in linear algebra is to compute the result of a scalar times a vector plus a vector, commonly known in BLAS terminology as axpy, y = ax + y, where x and y are vectors and a is a scalar. Cache management is particularly important for this operation because of the different roles played by the variables involved. In particular, y is both fetched and stored, while x is only fetched, so it may be advantageous to cache y, while bypassing cache with x. Figures 1 through 3 show results for the BLAS routines saxpy, daxpy, and zaxpy for data that are singleprecision real, double-precision real, and double-precision complex, respectively. In all cases the vector length is measured in bytes regardless of the precision involved. The execution rates shown were obtained using timing tests that make 10,000 successive calls of the basic routine, using a stride of 1, and with the same argument list for each call. Thus, after the first call, some data may remain in cache that are potentially reusable on successive calls, depending on the vector length involved. Some of the implementations shown use special instructions that load multiple words, but these require the input vectors to be aligned on special boundaries. The routines using the special instructions are indicated by an asterisk in the figures. These results are included for comparison, but it should be kept in mind that not all applications will be able to meet the restrictions on data alignment, and hence may not be able to take advantage of the higher performance offered by the kernel implementations that use these special features. The maximum execution rates achieved with assembler-coded axpy routines are about 55 MFLOPS with single-precision data, and about 27 MFLOPS with double-precision data. The 2:1 performance ratio between the two precisions for this computation is presumably due to the fact that the i860 can issue a single-precision multiply instruction on every clock cycle, but it can issue a doubleprecision multiply instruction only on alternate clock cycles. These maximum execution rates represent about 50% to 70% of the advertised peak performance of the i860, depending on the precision. Fortran implementations of axpy, on the other hand, achieve only 7% to 15% of theoretical peak performance, suggesting that the compiler is taking little or no advantage of the i860's high-performance features.
Cache effects can also be seen clearly in Figures 1 through 3. Performance falls off markedly when the vector size exceeds the 8 kilobyte cache size. This point is reached for vectors only half as large if both x and y are cached. And, of course, the longer word length of double-precision and complex data causes the cache to be saturated at a smaller vector size. As expected, caching only gives better performance than caching only x. The best performance occurs when y is cached and x is &dquo;quad aligned,&dquo; so that it can be piped from memory in 128-bit chunks. For very large vectors, the performance curves for all of the assembler routines converge to about the same level, which is little better than that of Fortran, suggesting that &dquo;strip mining&dquo; should be used to keep vector lengths within the cache size. Unlike the assembler routines, the Fortran routines are relatively insensitive to vector length and precision of data, but there is little virtue in this consistency. Figure 4 shows results for the BLAS routines sdot, ddot, and zdot, which compute the inner product of two single-precision, double-precision, and complex doubleprecision vectors, respectively. In our assembler implementations, one of the two vectors is cached, while the other is piped from memory, bypassing cache. Again we obtain some fairly clear-cut cache effects similar to those obtained earlier for axpy. Note, however, that we do not obtain the factor of 2 difference in performance between single precision and double precision for dot that we did for axpy. The relatively higher performance for zdot is presumably due to the larger ratio of arithmetic operations to memory accesses for complex arithmetic.
The assembler kernels whose performance is reported in this section were written by T. H. Dunigan, R. E. Flanery, and G. A. Geist.
Performance on Matrix Computations
Simple computational kernels such as those discussed in the previous section capture the flavor of the dominant inner loops of many scientific applications, but their performance as isolated modules does not necessarily reflect their performance when used within the context of a larger, more complicated program. Cache management, in particular, plays a major role in determining overall performance and is much less straightforward to optimize in a real program with complicated data structures, memory reference patterns, and multiple computational phases.
For benchmarking purposes, a useful compromise between simple computational kernels and fully detailed application programs is provided by more substantial operations on matrices, such as matrix factorization to solve systems of linear equations. With their more complex memory-reference patterns, such matrix operations place a more realistic and demanding load on data paths between processor and memory, and contain enough computation that performance can be measured accurately without replication. These features probably account for the popularity of the Linpack benchmark, in which a linear system of order 100 or 1,000 is solved as a basis for comparing floating-point performance of many computers (Dongarra, 1990) . At present, however, the Linpack benchmark code is limited to serial computers and a few shared-memory multiprocessors (with or without vector capability), so we have developed our own programs for matrix factorization for parallel execution on multiple processors of the distributed-memory iPSC/860.
Here we consider the Cholesky factorization A = LLT of a symmetric positive definite matrix A, where L is a lower triangular matrix. There are three basic algorithms for implementing this factorization, corresponding to different ways of arranging the triply nested loop that defines the computation (see George, Heath, and Liu, 1986 , for a full discussion and an explanation of the terminology we use here): row-Cholesky, for which the inner kernel is sdot, column-Cholesky, for which the inner kernel is saxpy, and submatrix-Cholesky, for which the inner kernel is saxpy.
Column-Cholesky and submatrix-Cholesky are both column-oriented and both use saxpy, but they differ in their memory-reference patterns. Column-Cholesky makes repeated calls to saxpy with the same y but different x vectors, whereas submatrix-Cholesky makes repeated calls to saxpy with the same x but different y vectors. Moreover, in a parallel implementation, column-Cholesky uses fan-in communication, while submatrix-Cholesky uses fan-out communication. Row-Cholesky makes repeated calls to sdot with one vector fixed and the other varying, and in a parallel implementation uses fan-out communication. These features suggest a different strategy for each of the algorithms in the use of cache by the underlying BLAS kernel.
The single-processor i860 performance of the three Cholesky algorithms is shown in Table 1. The variation in single-processor performance among the algorithms is due in part to differences in their effectiveness in exploiting cache. Note that we have not used &dquo;strip mining&dquo; in any of the algorithms to enhance the use of cache, so for large matrices the vectors involved in a given call to a BLAS routine may exceed the cache size. The row-Cholesky algorithm based on inner products is clearly the most effective algorithm for this computation on the i860 processor, probably because it requires no stores to memory (or cache) in accumulating the inner products, whereas the other two algorithms require stores in the inner loop. The performance of the serial Cholesky algorithms correlates well with the performance of their underlying BLAS kernels, as reported in the previous section.
Multicomputer performance of the row-Cholesky algorithms is shown in Figure 5 , using all 128 processors of the iPSC/860. In the multicomputer case, the use of double precision also doubles the necessary communication volume (measured in bytes), in addition to incurring a slightly lower arithmetic execution rate. We see that performance exceeds 1 GFLOP for single precision, and about 600 MFLOPS for double precision.
The work reported in this section was done by M. T. Heath, greatly assisted by the assembler kernels discussed in the previous section.
Superconductivity Computations
The discovery of high-temperature superconductivity in 1986 has provided the potential for spectacularly energy-efficient power transmission technologies, ultrasensitive instrumentation, and other devices. Each year new materials are added to the family of existing high-temperature superconductors. In general these materials are difficult to form and use, and some of the superconducting compounds are unstable. These difficulties are exacerbated by the lack of an accepted theory explaining superconductivity at higher temperatures. To further our understanding of the behavior of solids in general and of superconductors in particular, quantum-mechanical laws have been incorporated into sophisticated computer algorithms to predict from first principles the structural, vibrational, and electronic properties of matter.
Present calculations of the electronic structure of real materials usually employ a mean field approximation in which each electron is viewed as moving independently in a self-consistent potential due to all of the electrons and nuclei. According to density-functional theory, it is possible to express the energy of any system of electrons and nuclei as a unique functional of the electron density (Von Barth, 1984) . Since this functional is not known exactly, it is usually approximated by that of a homogeneous electron gas. This local density approximation to density-functional theory has been very successful when applied to metallic and semiconducting systems, but it appears inadequate to explain important physical phenomena such as optical band gaps and superconductivity found in transition metal oxides. More sophisticated treatments of the many-electron problem are possible, but have not been attempted previously because the Green's function and the susceptibility function needed to construct the electron self-energy are very difficult to calculate for real systems, especially those with narrow bands such as transition metal oxides.
The approach taken by researchers at ORNL is based on theoretical advances growing out of work on the Korringa, Kohn, and Rostoker coherent potential approximation (KKR-CPA) theory of alloys and magnetism (Korringa, 1947; Kohn and Rostoker, 1954) . The advantage in using the KKR-CPA approach is that it yields directly the Green's function for the system and thereby a direct way of calculating susceptibilities. The effects of disorder are treated in the CPA, which is an analytic technique for calculating the configurationally averaged Green's function (Soven, 1967) . The KKR approach is a natural context for implementing the CPA, because it is a Green's function method, and there is a natural separation between the lattice and the potential.
Researchers at ORNL and their colleagues have developed a self-consistent, semirelativistic KKR-CPA computer code that can handle multiple atoms per unit cell. The code has wide applicability to situations in which some form of substitutional disorder plays an important role, including metallic alloys, high-temperature superconducting compounds, metallic magnetism, and metalinsulator transitions. There are three primaiy reasons for parallelizing this code.
First, the KKR-CPA calculations are computationally intensive. A single KKR-CPA calculation commonly requires 10 hours of CPU time on a CRAY-2. An estimated 1,000 hours or more of CRAY CPU time would be needed to complete a single self-consistent computational experiment. The turnaround time for such experiments makes them prohibitive on conventional supercomputers.
Second, the KKR-CPA algorithm embodies natural parallelism that can be exploited to increase computational throughput. The feature we exploit is the calculation of the density of states (DOS) at a given energy level. Computation of the DOS at more than 100 energy levels is required to determine the Fermi level. Each DOS can be calculated independently of the other energy levels.
The availability of a parallel computer with GFLOP performance, the iPSC/860, has made it feasible and attractive to develop an efficient parallel version of the KKR-CPA code.
In adapting the KKR-CPA code to a parallel setting, the modifications to the code were made in such a way that it can still be run on Cray computers, and smaller problems can be run on scientific workstations. Having only one consolidated code to maintain has made software modifications much simpler to implement than if three versions of the code had to be kept up to date. Moreover, the user interface is identical across all the machines the code runs on, which has been an important factor in inducing scientists to use the code in a relatively unfamiliar parallel environment. Explicit parallelism is hidden from the user, with operations such as allocating a number of processors and loading programs onto these processors performed automatically by the code. The number of processors to be used in a given computational experiment is specified in the input file, making it easy for the user to control the degree of parallelism for each run of the program. Organizing the code in this way required writing only a few new routines to be added to the existing serial code. None of the additional routines involved new calculations, so exactly the same computational routines are called in the serial and parallel versions.
A master/slave paradigm is used in the parallel implementation. In this scheme, one processor controls work on the entire problem, and the remaining processors perform work requested by the master process. The master process in our implementation is called the pseudo-host and executes on one of the iPSC/860 nodes. We avoided using the SRM for the master process because of the computational imbalance between its 80386-based processor and the much more powerful i860-based computational nodes. The SRM is also burdened with executing the Unix operating system and program development tools for time-shared use by many users.
The KKR-CPA algorithm is organized in the following way. We start by inputting the atomic numbers of the species and initial estimates for the charge density and potentials. Since the Green's function for the system at any energy is independent of any other energy, this is a natural point in the algorithm at which to exploit parallelism. In the parallel implementation, the energies to be evaluated are held in a queue of tasks. The difficulty of each task is initially unknown, so a heuristic strategy is used to arrange the queue in order of approximately decreasing difficulty. Each idle processor selects the next task in the queue and returns the DOS to the master Fig. 4 Table 2 Average Execution Rate for Superconductor Test Problem on Several Computers process, which computes the integral over all energies. Load balancing is achieved naturally, with all the processors remaining busy as long as there are tasks left in the queue.
The most computationally intensive portion of the tasks assigned to the processors is integrating the KKR matrix inverse over the first Brillouin zone. To evaluate the integral, hundreds or possibly thousands of complex double-precision matrices of order between 80 and 300 must be formed and inverted. Each matrix corresponds to a different vertex of the tetrahedrons into which the Brillouin zone has been subdivided. The results of the integration are used to compute the Green's function for the system and the DOS for the given energy.
A further outer iteration is necessary to incorporate self-consistency of the charge density into the KKR-CPA code. This outer iteration involves integrating the Green's function over energy to obtain the charge density, which is used to derive the potential for the next iteration. Thus, the entire process described so far is repeated several times in the self-consistent version of the code, greatly magnifying the already substantial computational demands of the program.
Using the high-temperature superconductor B a 1 _ X KxBi03 as a test problem, we ran the consolidated KKR-CPA code on several computers. The test problem requires the calculation of the DOS for a fixed number of representative energies, without iterating to self-consistency. The average execution rate for a range of computers is shown in Table 2. The KKR-CPA code using only Fortran runs at a rate of about 130 MFLOPS on a single processor of the CRAY Y-MP. Performance increases to about 203 MFLOPS when assembler-coded BLAS are used. The addition of multitasking to make use of all eight processors of the CRAY Y-MP yields an aggregate performance of more than 1.5 GFLOPS. The rate shown for the iPSC/860 includes the time to load the problem onto 128 processors, all communication, file I/O (four fairly large output files are generated), and dynamic load-balancing overhead. The rate of 725 MFLOPS was attained using only compiled Fortran. This rate was increased to about 1.8 GFLOPS by using an assembler language zaxpy in the inversion routine and in the formation of the KKR matrix. The test problem used here is too small to attain the asymptotic execution rate of which the code is capable on the 1860. Larger problems are expected to yield execution rates of at least 2.5 GFLOPS.
The use of parallel computation on the iPSC/860 has led to more than an order-of-magnitude improvement in computational speed compared to a single processor of the Cray supercomputers previously used for the KKR-CPA code. From a research standpoint the improvement in turnaround time for computational experiments is even more substantial, since each subcube of the iPSC/860 is dedicated to only one user at a time, while the Cray machines are time-shared by many users. This greater computational power allows us to begin investigating many unanswered questions in superconductivity and materials science. For example, several studies are under way on the effects of alloying in the two perovskite superconducting compounds BaJ-x KxBi03 and BaPb~_xBix03.
The work reported in this section was done by G. A.
Geist, W. A. Shelton, and G. M. Stocks. For further details, see Geist et al. ( 1991 ) .
Plasma Flow Computations
One of the obstacles to the design of a magnetic-fusion reactor is an understanding of anomalous transport mechanisms that destroy plasma confinement. The onset of plasma turbulence can be studied numerically by considering the dynamics of the plasma edge. Detailed measurements are possible at the plasma edge using probes, so that experimental verification of numerical models is possible. The study of plasma edge turbulence faces many of the same challenges as the classical fluid turbulence problem. All the important time scales and length scales must be resolved in a numerical computation, and this strains the abilities of present supercomputers. In addition, since the plasma is not a perfect conductor, the turbulence can cause changes in the topology of the magnetic field that are critical for plasma confinement.
A code for studying plasma instabilities based on the reduced magnetohydrodynamic (MHD) equations has been in use by researchers in the Fusion Energy Division of ORNL for several years (Hicks et al., 1981) . This code has been optimized for use on the Cray machines available at the National Energy Research Supercomputer Center. As a pilot study, the code was implemented on the Intel 1PSC/I hypercube (Drake et al., 1987) . The MHD equations in a toroidal geometry are discretized by a pseudo-spectral method, with derivatives in the time and radial directions approximated by finite differences while functions of the two angular variables are expanded in Fourier series. Derivatives in the angular variables are performed analytically. All quantities are stored in spectral form. The nonlinear convection terms are taken to be explicit in time, while linear terms are treated implicitly. The convolutions arising from the nonlinear terms are performed analytically, rather than using a fast Fourier transform. This allows for explicit study of mode interactions during the nonlinear evolution.
Parallelism is incorporated by a spatial domain decomposition of the radial coordinate. This approach preserves data locality for the computationally intensive convolution calculation. The implicit terms require solving multiple tridiagonal systems distributed across the processors. A ring-based, pipelined solution strategy is used for this phase. Results on the iPSC/1 indicated that the calculation is well suited to large-scale parallel computation, attaining parallel efficiencies above 90%. But as a practical matter, on this early hypercube the run time was considerably longer than corresponding runs on Cray machines. The iPSC/ is also limited by its relatively small memory of 512 kilobytes per processor. Only small problems having 20 to 30 modes fit in memory, whereas interesting simulations involve 500 or more modes.
The second generation Intel hypercube, the iPSCl2, offers an improvement in both processor speed and memory size, but still falls short of the CRAY-2 in overall run time. The 4 megabytes of memory per processor accommodates much larger problems, but a 500mode case remains infeasible on the iPSC/2. The Intel iPSC/860, on the other hand, with 8 megabytes of memory per processor and vastly improved computational speed, has the potential to run full-sized simulations with times comparable to those achieved by any other supercomputer available. Figure 6 shows execution times on two generations of Intel hypercubes, the iPSC/2 and iPSC/860. Each data point represents the CPU time required to take 10 time steps using 16 processors. For the one problem size in common between the two series of runs (29 modes), the iPSC/860 is faster than the iPSC/2 by a factor of 7.4. This differences is consistent with the combination of computation and communication speeds of the two machines. Although the peak speed in Fortran of the i860 processor is at least a factor of 10 greater than that of the i386/387 processor pair of the iPSC/2, the communication speed of the iPSC/860 is only a factor of 5 or so faster than the iPSC/2 for small messages. Table 3 shows CPU times on the iPSC/860 compared with the corresponding single-processor CRAY-2 time for a 300-mode case. The Cray code uses 60-bit precision, while the iPSC/860 code uses 64-bit precision. The Cray code has unrolled loops and takes advantage of vectorization of the inner loop of the convolution calculation. We have begun to experiment with optimizing the inner loop on the iPSC/860. By rearranging the order of loop indices we can obtain a factor of 2 increase in performance. We find that the unoptimized inner loop executes at about 1.9 MFLOPS per processor. With compiler optimization, the execution rate increases to 4.5 MFLOPS. We expect to increase this speed significantly in the future by using assembler-coded routines.
Since the timing results from these tests appear favorable for large-scale computations simulating plasma edge turbulence, we are developing more extensive models, such as the KITE code (Garcia et al., 1986) , for use on the iPSC/860. The work reported in this section was done by J. B. Drake and V. E. Lynch. For further details see Carreras et al. (1990) .
Atomic Physics Computations
The collisions of heavy ions at high energy levels can be simulated using a quantum electrodynamic framework, which is somewhat simpler and more tractable than the full coupling of quantum chromodynamics. One of the structures to emerge from the collision is a strongly cou-pled lepton-antilepton pair. This structure usually decays in less than 10 -19 seconds. The simulation of the production and decay of leptons is a formidable computational challenge. The ability to simulate the production of such pairs is important in the design of experiments for colliders currently under construction. Until recently, most accelerator designers have worked in the domain where fundamental interactions can be decoupled from engineering considerations, with little or no involvement in the modeling of basic processes on supercomputers. It is now recognized, however, that beam stability and focusing in all advanced collider designs are strongly influenced by pair production. The theory of these processes is rapidly evolving both for high-energy heavy-ion colliders and for electron-positron linear colliders (Month, 1988 ).
The fundamental model for heavy-ion collisions is based on Dirac's equation (Umar et al., 1991) . The solution of this equation gives the motion of the leptons, which are assumed to move independently of the classical electromagnetic fields. Dirac's equation relates the time rate of change of the particle probability distributions to spatial derivatives of the distributions. This equation is linear, so that propagation of the distribution in time can be represented as an operator exponential. Methods developed for the simulation of lepton-pair production are also applicable to related nonrelativistic problems in nuclear, chemical, surface, and plasma physics.
A B-spline collocation method for Dirac's equation has been implemented on the iPSC/2 and 1PSC/860, as well as on the Cray and the FPS T-series computers. The collocation method uses basis splines of user-selectable order. The high-order splines have better approximation properties than low-order splines or the simple interpolants typically used in finite difference and fmite element discretizations. The B-spline collocation method thus has excellent convergence and accuracy properties. The method also allows implementation in a storageefficient tensor product style, in which the effects of the discrete operator in each coordinate direction are separated. The desired level of resolution is a 100 x 100 x 100 lattice. The computations typically involve solving an eigenvalue problem for the initial minimum energy state and then taking several thousand time steps through a transient pair production and decay. The eigenvalue calculation is an iterative procedure using only the operator ; the tensor form of the operator replaces explicit formation of a matrix representing the operator.
Parallelism is introduced by domain decomposition, with processors and data assigned in a two-dimensional grid or, with edges connected, a torus arr angement. Applying the discrete Dirac operator to a state vector requires three matrix-matrix products, one for each coordinate direction. Two of these directions, y and z, have data divided among the processors by the domain decomposition. The first matrix of the product represents the derivatives of the Dirac operator in the particular coordinate direction. This matrix is of order 100 for the desired resolution and can easily be formed and stored on each processor, so that only the state vector is then divided among the processors. A &dquo;roll&dquo; algorithm similar to the well-known parallel matrix-matrix product algorithm (Fox et al., 1989 ) is employed to accumulate the results. This phase requires nearest-neighbor communication on the grid of processors, accumulating the results for the y direction using rings of processors in one direction and then accumulating the results for the z direction using rings of processors in the perpendicular grid direction. The inner loop of the implementation is a saxpy, even though all of the state vectors are complex. Real and imaginary parts of the state vector have been rearranged to gain efficiency from the use of only real arithmetic.
Preliminary performance data for this code on the iPSC/860 are shown in Figure 7 , which shows dramatic improvements in computational speed with the use of an assembler-coded saxpy for the inner loop calculation. The crossover in performance between the two codes based on assembler-coded saxpy is due to the lower efficiency of the aligned code when the vector lengths are short, which is the case when a fixed-size problem is spread over more processors. Timing studies and operation counts show a dependence on the number of processors, ,~, and the grid resolution, ~a, that is proportional to n4/p. Physically interesting results computed at the rates shown and at the desired resolution can be obtained with runs lasting approximately 1 day. Efforts to optimize the performance of the code for the 1PSC/860 have thus far taken precedence over exploring the physics of pair production using the iPSC/860's computational power. The latter activity will begin in earnest during the next few months. The work reported in this section was done by C. Bottcher, J. B. Drake, R. E. Flanery, and M. R. Strayer.
Conclusion
In this paper we have summarized our early experience with a new supercomputer system, the Intel iPSC/860. We have described the hardware and system software and reported on performance, both for some simple computational kernels and for some complete application programs. We have seen that the iPSC/860 is capable of GFLOP-level performance and is competitive, at least on some applications, with the fastest conventional vector supercomputers available. The iPSC/860 is even more impressive in price-perf ormance, since its cost is an order of magnitude less than competitive conventional supercomputers. The iPSC/860 gains this performance by using a medium number (128, which, while not trivially small, is certainly not &dquo;massively parallel&dquo;) of relatively powerful processors. Although the machine is competitive with conventional supercomputers in its performance on some applications, its software environment is not yet sufficiently mature to be competitive in program development effort or cost.
An important issue that we have not discussed thus far is the speedup or efficiency with which the various application programs run on the iPSC/860. True speedup and efficiency are difficult to measure for large-scale applications on a distributed-memory parallel architecture, since there is usually insufficient memory on a single node to obtain a serial execution time for a full-size problem of scientific interest. A very conservative estimate of efficiency can be obtained by using the known peak execution rate of the individual processors, thereby determining what fraction of the total number of FLOPS available were actually used. By this measure, the superconductivity application code is quite efficient, running at about 14 MFLOPS per node for the test problem on 128 nodes. For this range of vector lengths, its zaxpy kernel should run at about 21 MFLOPS on one node (see the &dquo;x cached&dquo; curve in Fig. 3 ), assuming there are no cache misses. In fact, with cache misses taken into account, the kernel of the superconductivity code runs at about 18 MFLOPS on one node. The remaining difference between 18 MFLOPS and the observed 14 MFLOPS per node for this test problem using the parallel code is accounted for by I/O operations (reading and writing data files, which would be present in either a serial or parallel code) and by parallel overhead, such as interprocessor communication. It should be noted, however, that both the I/O and communication overhead grow much more slowly than the computational work as the problem size grows, so that by solving larger problems the efficiency of the superconductivity code becomes arbitrarily close to 100%, subject to the limitations of memory capacity. The other two applications we discussed, on the other hand, require a much greater frequency and volume of interprocessor communication, and thus execute with substantially lower parallel efficiency and at per node rates that are far below the peak speeds of their serial kernels. For example, for a problem small enough to solve on a single node, the atomic physics code runs at an efficiency of about 35% on 8 nodes, and 23% on 16 nodes. Of course, this efficiency improves with increasing problem size, but because of the greater communication load, it never reaches the very high efficiency of the superconductivity code.
There have been a number of developments concerning the iPSC/860 in the period since this paper was written that are worth noting. Intel now supplies a library of BLAS routines for the i860, including the axpy and dot families, that are similar in performance to those reported above, and thus users no longer have to write these assembler codes themselves. Intel also now supplies a new compiler for the i860 that runs on Sun workstations, thereby relieving the SRM host of this chore and greatly speeding up program development and compilation time. Unfortunately, the execution speed of the code produced by this compiler is little, if any, better than that produced by the previous compiler, so peak performance of the i860 still can be approached only through assembler coding. Finally, further tuning and solving larger problems have brought the performance of the superconductivity code up to 2.5 GFLOPS on the 128-processor iPSC/860 and 2.3 GFLOPS on the 8-processor CRAY Y-MP. 
BIOGRAPHIES

