This paper presents arguments supporting the thesis that clusters of SMPs interconnected with general purpose networks (perhaps several different networks to support both low-latency and bulk data transfers) are the architecture that will provide scalable high-performance computing environments. The paper presents an analysis of the high performance computing environment based on the state of technology in early 1996. The evolution of the technology base since that time has only strengthened that author's opinion that this architecture will be successful.
Summary
In this paper we argue that the next generation of supercomputer will be based on tight-knit clusters of symmetric multiprocessor systems in order to: i) provide higher capacity at lower cost; ii) enable easy future expansion, and; iii) ease the development of computational science applications. This strategy involves recognizing that the current vector supercomputer user community divides (roughly) into two groups, each of which will benefit from this approach: One, the "capacity" users (who tend to run production codes aimed at solving the science problems of today), will get better throughput than they do today by moving to large symmetric multiprocessor systems (SMPs), and a second group, the "capability" users (who tend to be developing new computational science techniques), that will invest the time needed to get high performance from cluster-based parallel systems.
In addition to the technology-based arguments for the strategy, we believe that it also supports a vision for a revitalization of scientific computing. This vision is that an architecture based on commodity components and computer science innovation will: i) enable very scalable high performance computing to address the high end computational science requirements; ii) provide better throughput and a more productive code development environment for production supercomputing; iii) provide a path to integration with the laboratory and experimental sciences, and; iv) be the basis of an on-going collaboration between the scientific community, the computing industry, and the research computer science community in order to provide a computing environment that is compatible with production codes and dynamically increasing in both hardware and software capability and capacity.
We put forward the thesis that the current level of hardware performance and sophistication of the software environment that is found in commercial symmetric multiprocessor (SMP) systems, together with advances in distributed systems architectures, make clusters of SMPs one of the highest performance, most cost effective approaches to computing available today. The current capacity users of the C90-like system will be served in such an environment by having more of several critical resources than the current environment provides: much more CPU time per unit of real time, larger memory per node and much larger memory per cluster; and the capability users are served by an MPP-like performance and an architecture that enables continuous growth into the future. In addition to these primary arguments, secondary advantages of SMP clusters include: The ability to replicate this sort of system in smaller units to provide identical computing environments at the home sites and laboratories of scientific users; the future potential for using the global Internet for interconnecting large clusters at a central facility with smaller clusters at other sites to form a very high capability system, and; a rapidly growing base of supporting commercial software.
The arguments made to support this thesis are as follows: 1) Workstation vendors are increasingly turning their attention to parallelism in order to run increasingly complex software in their commercial product lines. The rapid pace of development by the "workstation" manufacturers due to their very large investment in research and development for hardware and software is so large that the special purpose research aimed at just the high performance market is no longer able to produce significant advantages over the mass market products. We illustrate this trend and analyze its impact on the current performance of SMPs relative to vector supercomputers.
2) Several factors also suggest that "clusters" of SMPs will shortly out-perform traditional MPPs for reasons similar to those mentioned above: The mass produced network architectures and components that are being used to interconnect SMP clusters are experiencing technology and capability growth trends similar to commodity computing systems due to the economic drivers of the merging of computing and telecommunications technology, and the greatly increased demand for high bandwidth data communication. Very high speed general purpose networks are now being produced for a large market, and the technology is experiencing the same kinds of rapid advances as workstation processor technology. The engineering required to build MPPs from special purpose networks that are integrated in special ways with commercial microprocessors is costly and requires long engineering lead times that result in delivered MPPs with less capable processors than are being delivered in workstations at the same time.
3) Commercial software now exists that provides integrated, MPP-style code development and system management for clusters of SMPs, and software architectures and components that will provide even more homogeneous views of clusters of SMPs are now emerging from several academic research groups.
We propose that the next generation scientific supercomputer center be built from clusters of SMPs, and suggest a strategy for an initial 50 Gflop configuration, and incremental increases thereafter to reach a teraflop by just after the turn of the century.
Clusters of SMPs Economic drivers SMPs vs. MPPs
Teraflop computer While this cluster uses what is called "network of workstations" technology, the individual nodes are, in and of themselves, powerful systems that typically have several gigaflops of CPU and several gigabytes of memory.
The risks of this approach are analyzed, and found to be similar to those of MPPs: That is, the risks are primarily in software issues that are similar for SMPs and MPPs: namely, in the provision of a homogenous view of a distributed memory system. The argument is made that the capacity of today's large SMPs, taken together with already existing distributed systems software will provide a versatile and powerful computational science environment. We also address the issues of application availability and code conversion to this new environment even if the homogeneous cluster software environment does not mature as quickly as expected.
The throughput of the proposed SMP cluster architecture is substantial. The job mix is more easily load balanced because of the substantially greater memory size of the proposed cluster implementation as compared to a typical C90. The larger memory allows more jobs to be in the active schedule queue (in memory waiting to execute), and the larger "local" disk capacity of the cluster allows more data and results storage area for executing jobs.
Risk analsis
Production capacity
Goals and Strategy for the Future of Scientific Computing and Data Access
The goals that are addressed with the proposed approach and architecture are to provide: 1) Continuing and improved support for the current production scientific supercomputing community;
2) A powerful computing capability that will serve the scientific Grand Challenge community, attract computational science researchers, and promote new ways of thinking about solving scientific problems;
3) Definition and support for programming paradigms that are flexible, long lived, and vendor independent; 4) A distributed scientific computing and data handling / storage capability based on the expanded global Internet capability that has the potential to aggregate computational science facilities around the world into very large-scale distributed computation systems; 5) Integration with distributed scientific laboratory environments ("collaboratories" or remote access laboratories -see [Johnston 95a] and [Johnston 95b]) in order to explore the potential of closely coupling laboratory instrumentation and data collection with large-scale computational facilities;
6) An academic -industry -government collaboration that will incorporate useful computer science results into the scientific high performance computing environment on an on-going basis.
To reach these goals we propose an approach to high performance scientific computing -that niche previously occupied mainly by users of large vector supercomputers like the Cray C90 -that is based on assembling hierarchical distributed systems from commercial symmetric multiprocessor (SMP) systems and network components. The arguments for this approach are:
• High performance computing can be achieved more easily and economically by utilizing s foundation of mass produced technology -both software and hardware;
• There is a great advantage to having the architecture of a large central facility (that provides a uniquely powerful computational resource) also be scalable incrementally, and be capable of being replicated in many different sizes in many different environments;
• Such an architecture lends itself to aggregating far-flung elements into even larger scale systems in the future;
• The parallel code development software issues for inter-cluster multiprocessor code development are roughly the same as for MPP systems: if they are solved for either architecture, they will be solved for both.The intra-cluster multiprocessor code development environment is driven by a large commercial market, and is already quite mature.
Our strategy for the evolution of scientific high performance computing is, in essence, based on the conclusion that 1) in the future the most cost effective computational capability will come from clusters of symmetric multiprocessors rather than MPPs, and 2) during the next two years the combination of intellectual support for algorithm redesign, plus an SMP hardware configuration that provides substantially more throughput than the current C90-like systems, will make SMPs the platform of choice even for legacy codes that have been optimized for the vector architectures over many years. This conclusion is supported in sections 2.0 through 6.0.
The Impetus of Technology and Economics
The rationale for the architectural approach of using commercial network components to interconnect commercial SMPs to form large-scale computing clusters is based on a set of observations about emerging directions in software and hardware computing systems and architectures. To realize the potential of NOWs, we need to move two MPP technologies into the workstation community: low latency networking and global system software that treats a collection of processors, memory, and disks as if they were a single machine. Our approach is to leverage off-the-shelf technology as much as possible --workstation hardware, standard workstation operating systems on each node, and local area network ATM switches. To this, we will add communications protocol software and a global system layer that together provide low overhead communication, a single view of operating system services across the cluster, parallel file I/O, and robustness to individual node failures."
One point of possible confusion related to the term "network of workstations" should be cleared up immediately. The approach described in this paper uses the architecture and potentially the software of the NOW work, but the individual nodes of the SMP cluster described here are no more "workstations" than an Indy race car is an automobile that one uses to commute to work. Each node of the hypothetical prototype cluster provides 3200 Mflops (in 12-16 CPUs), 4-6 Gigabytes of main memory, and 100-1000 Gigabytes of local disk. Each such node is it self a system with a list price of about $1.5M. The prototype implementation would consist of about 16 of the above nodes, interconnected with multiple very high speed networks and would supply 50 Gflops of computing. The only thing this has in common with a "workstation" is some of the component technologies.
The validity of the basic economic assumption of the NOW approach is demonstrated, for example, in the results of the NAS parallel benchmark studies [Saini 95] that are presented in Figure 1 in a way that shows the increasing cost efficiency of commercial microprocessors for scientific calculations, and the high cost of custom integration of network components (the SPP1000, T3D, and SP-2).
Below we present a series of arguments to support our assertion that SMPs are not only capable of replacing C90-like systems and MPPs, but can do so economically, and in a much richer and faster growing software environment, while at the same time providing higher levels of both maximum performance for a wide mix of codes and higher throughput for a large number of users. The systems and number of processors are: Cray C90 ("C90-16"), Convex Exemplar SPP1000 ("SPP2000-64"), IBM SP2 (wide nodes) ("SP2-WN-54"), a Cray T3D ("T3D-128"), an IBM SP2 (thin nodes ("SP2-TN-64")), Fujitsu VPP500 ("VPP500-51"), Silicon Graphics Power Challenge (75 MHz, R8000s) ("SGI75MHz-16"), and a DEC 8400 (300 MHz EV-5 Alphas) ("DEC8400-8").)
3.0 The Computational Capacity of SMPs Relative to the Cray C90
Vector vs. Scalar Performance
The performance of modern SMPs relative to the C90 depends on the degree of vectorization of the code in question, however several studies have shown that over a fairly diverse mix of codes the average performance ratio for the computation capacity of an SMP compared to a C90falls between about 1:2.5 and 10:1 (that is, one SMP doing the work of 2.5 C-90s at one end of the spectrum, and 10 SMPs doing the work of one C-90 at the other end of the spectrum).
The data presented in Figure 2 is a composite from several different studies, and shows ratios of performance for several different benchmarks and several different CPU configurations. Case 1 shows one C90 CPU providing in the range of 2.2 to 5.6 times the performance of one SGI CPU. Case 2 shows a 16 CPU C90 system providing in the range of 3.6 to 6.0 times a 16 CPU SGI system using Linpack and the NAS parallel benchmarks. Case 3 shows the best case SGI comparison being 3:1 (2.7:1 for a parallel benchmark) in favor of the C90 (18 -90 MHz R8000s vs. 16 -C90 CPUs). These results reflect the degree of vectorization of the various codes. Applying this code-by-code comparison to the real-world involves considering real workload mixes, and. several organizations have done such studies. NCSA [Note 1] studies (comparing an older SGI to a Cray Y-MP) than highly vectorized codes favor the C-90 more than most benchmarks would indicate, and highly scalar codes favor the SMP more than benchmarks indicate.
Summarizing the NCSA study: Figure 4 shows a similar grouping of codes into scalar (all of which perform better on an SMP than on a C90), moderately vector (performance in the range of 20-90% of a C90), and probably highly vector codes (SMP performance is 10-20% of a C90 -note that the code achieving 5% of a C90 was completely untuned, and should probably be ignored. (See Table 1 (Conversion Effort and Results for a Mix of Cray Fortran Codes))). Overall, almost 90% of the test suite codes performed on the SGI SMP, on average, at 75% of the C90 performance.
The Technology Curve
The trends in the rate of microprocessor performance increase compared with rate of Supercomputer processor increase are shown in Figure 
Using an SMP Cluster as an MPP
In this section we argue that there are several key technology issues for SMP clusters vs. MPPs:
• Uniformity of memory • I/O capacity • Portability of code and environment In addition to technology and usability, cost (in general) is always an issue, and in Section 7.0 of this paper we present a performance and financial model for using the SMP cluster approach to build a usable teraflop computer within five to six years.
Beyond the economies of building on CPUs that come out of the research funded by mass produced technology, there are several other issues that need to be addressed in order to make practical the use of SMP clusters as a highly parallel system. The software issues are very similar for networks-of-workstations (NOWs) and MPPs. Much of the related software research started on MPPs is now being retargeted for NOWs.
Access to Memory
SMP clusters and MPPs have several similar issues with respect to memory: namely, mechanisms to provide a uniform view of memory and the uniformly high speed access to memory. The research community is addressing these issues, originally from the point of view of MPPs, and now increasingly from the point of view of networks of workstation components. Both MPPs and NOWs currently support various message passing mechanisms to enable the distributed memory (many distinct address spaces) parallel computing model (e.g. PVM and MPI).
The hardware issues relate primarily to obtaining low latency, high bandwidth messaging between processors. While NOWs are at a slight disadvantage at this point in time for using general purpose networks for node interconnections, that disadvantage is rapidly disappearing due to the attention of the research community. (See, for example, the Active Messages discussion, below.)
The slight disadvantage of using general purpose networks is, however, offset by a tremendous advantage. Using a general purpose network for NOW node interconnection opens the internal communication paths to the outside world, providing a rich and very high bandwidth access to the cluster. MPPs, on the other hand, have traditionally suffered because inaccessible inter-node networks have forced the use of special purpose I/O nodes that are almost always bottlenecks.
Portable parallel programming paradigm
In this section we will comment on current software capabilities, and the future directions that will produce more easily used and higher performance NOW clusters. An insightful commentary on the future of software (with a section on several of the parallel system issues discussed here) has recently appeared in [Hill] .
Uniform access to resources: Probably the primary issues for portable parallel programming are two problems shared by both SMP clusters and MPPs -namely how to provide uniform access to memory and secondary storage (disks). In the case of memory, this involves both global address space and "uniformly" high speed access. In the case of disks, intelligent caching and consistency of parallel I/O operations are probably the main issues. Neither of these is currently available in commercial SMP clusters nor most MPPs, but are the norm in SMPs. Currently inter-cluster memory access is via message passing rather than global memory, which forces the programmer to be much more aware of the underlying architecture than should be the case. "Split-C is a parallel extension of the C programming language that supports efficient access to a global address space on current distributed memory multiprocessors. It retains the 'small language' character of C and supports careful engineering and optimization of programs by providing a simple, predictable cost model. This is in stark contrast to languages that rely on extensive program transformation at compile time to obtain performance on parallel machines. Split-C programs do what the programmer specifies; the compiler takes care of addressing and communication, as well as code generation. Thus, the ability to exploit parallelism or locality is not limited by the compiler's recognition capability, nor is there need to second guess the compiler transformations while optimizing the program. The language provides a small set of global access primitives and simple parallel storage layout declarations. These seem to capture most of the useful elements of shared memory, message passing, and data parallel programming in a common, familiar context."
"Split-C is currently implemented on the Thinking Machines Corp. CM-5, the Intel Paragon, the IBM SP-2, and the Meiko CS-2, and is under development on the Cray T3D. [It has also been ported to NOWs.] All versions are built using the Free Software Foundation's GCC and the message passing systems available on each machine. Faster implementations are underway for the Meiko CS-2 using the Elan libraries and for networks of workstations using Active Messages. It has been used extensively as a teaching tool in parallel computing courses and hosts a wide variety of applications. Split-C may also be viewed as a compilation target for higher level parallel languages."
Much of the work in NOW started as research in how to more easily use MPP architectures, and much of the Berkeley NOW technology was originally designed and implemented on the nCUBE/2 and CM-5. It was ported to a Cray T3D, and now to workstation clusters. As indicated in the Software Environment section below, the GLUnix environment of the NOW project provides a rich set of operating system-like services, and the implementation strategy is to build GLUnix on top of standard, commercial operating systems [Anderson 94].
High level tools and abstractions:
Another important concept is that of layering capability abstractions in such a way that the underlying hardware systems are hidden, and that the layers are rich enough that as hardware becomes more capable it "floats" up to replace the software layers. This concept is being developed in ways that support both NOW / SMP clusters and MPPs in the U. C. Berkeley Castle project [Castle] :
"The Castle Project addresses the lack of software tools through a collaborative research program with four levels: scientific applications, parallel languages, parallel libraries, and low-level system support."
This concept is discussed in Section 5.0 of this paper.
The point is that capabilities are coming to both NOWs and MPPs that will greatly improve our ability to develop portable parallel code for which the parallelism of the problem is the driving force, rather than the capabilities provided by a particular underlying architecture.
Low latency, high bandwidth communication
When hardware does not support global memory then message passing with software imposed cache consistency is necessary, and low latency, high bandwidth communication between processors is the key underlying technology requirement.
The "Active Message" communications technology of the NOW project [von Eicken 92] forms a foundation of high speed, low latency messaging, and at this point in time there are at least two compilers that use Active Messages to implement global addressing across distributed memory processors: Split-C from U. C. Berkeley and an HPF compiler done at the University of Maryland for the CM-5.
The Message Passing Interface (MPI) approach to providing architecture independent messaging is another example of technology that started on MPPs and is now moving to workstation clusters. MPI provides unicast, multicast, and asynchronous communication in the MPP environment, and is now implemented on several workstation architectures using Active Messages as the underlying mechanism.
Following the NOW philosophy of building on large-scale commercial bases, one obvious candidate for workstation or SMP interconnection is Asynchronous Transfer Mode (ATM) network technology. This is a flexible technology that, even apart from the very rapid development and deployment of ATM in LAN, MAN, and WAN networks, is also being used for desk area networks to provide a system component interconnect strategy [Hayter 91].
While the principle performance limiter of network communication is system memory bandwidth, the latency is a function of how "close" the network interface can be placed to the memory, and how to minimize the operations needed to get data from memory to network. Workstation memory bandwidth has been going steadily up for several years (though not nearly as fast as CPU speed) and in modern 64 bit data path systems like the DEC Alpha, the bandwidth is nearly sufficient to drive a 600 Mbit/s network from a single CPU. "The U-Net communication architecture provides processes with a virtual view of a network device to enable user-level access to high-speed communication devices. The architecture, implemented on standard workstations using off-the-shelf ATM communication hardware, removes the kernel from the communication path, while still providing full protection. The model presented by U-Net allows for the construction of protocols at user level whose performance is only limited by the capabilities of network. The architecture is extremely flexible in the sense that traditional protocols like TCP and UDP, as well as novel abstractions like Active Messages can be implemented efficiently. A U-Net prototype on an 8-node ATM cluster of standard workstations achieves 15Mbytes/s TCP bandwidth with 1Kbyte buffers and demonstrates performance equivalent to Meiko CS-2 and TMC CM-5 supercomputers on a set of Split-C benchmarks."
And from [von Eicken 94]: " .... evaluates a prototype implementation of the low-latency Active Messages communication model on a Sun workstation cluster interconnected by an ATM network. Measurements show application-to-application latencies of about 20 microseconds for small messages which is roughly comparable to the Active Messages implementation on the Thinking Machines CM-5 multiprocessor."
As evidence of the economies-of-scale point of view, within a year of their introduction ATM network adaptors cost about half of what FDDI costs, and the cost of ATM switches has fallen about 20% per year since their introduction. The ATM market is growing much, much faster than other high speed network technologies, and this provides all of the factors needed to bring the cost down rapidly.
The point of this discussion is not to prove that one can go out today and buy a very high speed NOW for very little cost, but rather that the technology issues are close enough to resolution that this is the time to make the change in direction so that we will be positioned for this environment in the next few years.
SMP -based NOWs vs. MPP
Our suggested approach is not to introduce a new paradigm for parallel computing, but rather to use the enormous research investment of the commercial workstation vendors to produce very large, affordable, and scalable parallel systems by building on technology that is already being developed in the parallel computing arena. The U. C. Berkeley NOW project [NOW] provides one of several examples of this approach.
The approach is not technologically radical. The Intel Paragon, the CM-5, and the T3D are typical examples of distributed memory MPPs: they represent one end of an implementation spectrum characterized by its use of special purpose networks to interconnect microprocessors. The other end of the spectrum is represented by a collection of workstations interconnected by a general purpose network. In the middle are systems like the IBM SP2 (which uses a "custom" but not special purpose network interconnecting workstation mother boards), the Convex Exemplar (multiple high speed SCI nets connecting low CPU-count SMPs), etc.
One advantage of a large number of SMPs as the nodes of this architecture instead of a small number of MPPs is the I/O bottleneck problem. Using a general purpose, external network for most of the CPU interconnect provides many access points where data can be gotten into and out of the cluster. In an MPP it is generally not possible to access the internal network, so specialized processors are used to gateway data into and out of the internal network. Because they are generally few in number, this creates an I/O bottleneck.
The second advantage is that using general purpose networks operating through standard adaptors reduces integration costs and time (greatly increased communications requirements in the commercial world are driving computer manufacturers to design faster standard interconnect strategies).
Finally, the use of large, symmetric multiprocessor systems (SMPs) as the nodes of a NOW cluster has two (competing) advantages at the current state of technology. Intra-SMP memory access is uniform and very fast (resulting in the so-called "domain decomposition" advantage over MPPs). On the other hand, access to the inter-node network vastly improves the I/O capability of the overall system by removing the I/O node bottleneck of some MPP designs. (The "competing" aspect is that the more processors and memory in a single SMP, the fewer places you have to get data in and out of the system.)
Summary
The reasons for advocating NOW-like architecture over what is called MPP are several fold.
First, as noted above, MPPs and NOWs are almost identical architecturally, and indistinguishable in the case of systems like the SP-2. Most of the significant parallel environment issues have to be solved for both MPP and NOW, and in many cases in the same way. The principle architectural difference is that MPPs typically use customized network switches and topologies. This has the advantage of high speed, low latency, non-blocking inter-CPU communication. The disadvantage is the custom, non-scalable nature of the network, resulting in difficulty in getting high performance access to the outside world. and in scaling the implementation.
There are two advantages gained from using external, general purpose networks for CPU (or SMP node) interconnects. The first is the afore mentioned ability to provide access to many possible I/O ports when the network is accessible. The second is that building the intra-cluster connection from general purpose networks will permit growing the system incrementally and at low cost, and will also permit construction of practical and low cost small-scale implementations built by individual researchers so that they can replicate the architecture and software environments at their home institutions and in scientific laboratories.
Second, is the argument that the cost and time of the engineering required to build the special purpose interconnect networks of the MPPs, and then integrate them with the processors, both drives the cost up and results in MPPs being one or more generations behind in their processor technology.
Software Environment
While the memory and CPU hardware, together with low-level communications functionality, enable high performance computing, languages and system services determine the usability of a computing environment. In this section we present a model for the minimum acceptable software environment, and again argue that the state of MPPs and NOWs are comparable, with NOWs having the advantage by being the focus of the research community. We site as examples some of the directions of the research community and the environment supported by one commercial SMP cluster supplier.
The software environment of parallel systems requires several capabilities that are important for ease of code development, portability, job management, and, in several cases, performance. This functionality includes transparent use of all available processing power, memory, and disk and network bandwidth in the cluster. To provide this functionality a collection of systems services are .) In the context of the thesis being argued here, the presence of these functions in all of the common SMP environments serves to strengthen the argument supporting the general utility of SMPs.
The system services (1-6) are present in "global layer Unix" [GLUnix] by design, and the remaining services are being built on top of these basic functions. The goal of projects like GLUnix is to define a layered set of interfaces that can be provided on top of current operating systems, to present a standard "virtual multiprocessor" abstraction that supports (independent of the underlying physical architecture) the high level languages, libraries, and services needed by computational scientists [Castle] . Further, these services are intended to be the "natural" ones that future commercial operating systems and hardware will provide directly. This approach permits higher level services, languages, and libraries to be designed and built today in a way that they will remain useful in the future. In the case of clusters of SMPs, the approach of GLUnix is to layer what appears to be a standard set of operating system services on top of the available commercial operating systems to provide these services uniformly across the cluster. This allows each vendor to optimize the OS for the peculiarities of their hardware. All of the user-level functionality (points 8-15, above) is present in various MPP environments in varying degrees.
As an indication of the viability of currently available commercial SMP clusters, we note that all of the functions, 1-15 above, except global memory addressing, are supported on the SGI POWER CHALLENGEarray [SGI-1].
Case Study of Code Conversion, Performance, and Production Throughput [Note 2]
The results in this section strengthen the argument that conventional SMPs are easily capable of providing the maturity of software environment and system throughput necessary to support a scientific supercomputing workload. The basic results are that code conversion and tuning (from a C90 to an SGI system) is not nearly as difficulty as many users believe, and that the production job mix throughput enabled by the large memory and fast system interconnect architecture of the SGI Power Challenge is very good.
This section presents the results of converting and tuning 24 codes from a C90 to an SGI Power Challenge SMP. The codes were chosen to reflect a typical workload for one C90 site. Virtually all of the benchmarks were initially optimized for the Cray C90 architecture. The overall benchmark was intended to determine the throughput of a large job-mix production environment. The initial effort was focussed on obtaining single job performance statistics. This was followed by tests to determine multiple job throughput timings. On average, approximately one person-week was spent per code to improve performance on the SGI Power Challenge architecture. Some codes required as much as three person-weeks. Some required no changes at all. Many of these changes resulted in several fold speedups in the codes under test, and there is still significant room for further benchmark performance improvement. Table 1 shows the type of code and its size, the effort to port the source and obtain correct numerical results, and then the resulting, un-tuned performance. Next is the time spent tuning the code, the tuned running time and the speedup factor over un-tuned code. The last column shows the ratio of the tuned performance compared to the run time on a C90.
The single job performance (no other jobs running on the system) indicated that the SGI offers substantial price/performance benefit. The average performance for this set of benchmarks shows the R8000 to be 70% of the C90 on a single CPU to single CPU basis. (The price difference is more than a factor of 20 between the two CPUs.) For about 30% of the benchmarks, the R8000 exceeds the performance of the C90. This is reflective of the large scalar fraction which a typical high performance scientific application workload contains and is evidence of the strong superscalar features of the R8000 to provide good performance for these kinds of scientific applications The tuning was followed by a series of large throughput tests on a number of SGI Power Challenge SMP systems. From a list of about 30 benchmarks, mixes of multiple combinations of the various benchmarks were selected and run (including compile and link times for generating the executables).
Wall clock time from the start of the run to the end of the run was measured. The mix of runs was chosen to demonstrate throughput performance in a heavily loaded computing center.
The code mix used for the throughput tests represents real codes with large memory, CPU, and I/O requirements. Four of the codes used an average of 3 GBytes of memory each, and the rest averaged about 50-100 MBytes of memory. Many codes required multiple reads and writes to files in excess of 200 MBytes in size. Some codes wrote files over 2 GBytes in size. Some wrote files repeatedly for total I/O counts in excess of 6 GBytes. The throughput testing involved running different mixes of about 50 codes simultaneously on systems that had sufficient memory that no paging or swapping to disk was done. In no case did the overall performance of the system degrade by more than 7% over the timing that would have been obtained on an ideal system with no code interference. This is in spite of tremendous activity on the system bus. In general, the measured throughput result is within 2% of the single code performance, and in one or two cases exceeds the single code times. (These latter results can be attributed to I/O freeing up idle processors as well as parallelized code operating in scalar sections with idle processors.)
These results clearly demonstrate that: i) Code conversion from the C90 environment to the SGI SMP environment is not likely to be an intractable problem; ii) the performance of individual codes will probably be acceptable, and; iii) the performance of real codes running in real production environments (many codes running simultaneously) will almost certainly provide much better turnaround in the proposed multiple large SMP environment compared to a heavily loaded C90 environment. 
Case Study of Software Availability [Note 3]
Another important issue for a production environment is the availability of commercially supported math, science, and engineering software. This section addresses this issue by evaluating the availability on SMP of the commercially supported codes running on Crays at the Department of Energy, Energy Research Supercomputer Center (NERSC).
Experience of other organizations and users who have already migrated to the new generation of supercomputer hardware and software (such as CERN, NCSA, and Cal Tech) suggests that most users and applications will be able to migrate to the new capacity environment with minimal change and disruption. The keys to success are increased throughput capacity, an integrated software environment, and good staff support.
Users who use standard packages (e.g., Gaussian or ANSYS) will be able to run most of those packages (sometimes with more recent versions) on the new architecture. Those whose programs use standard libraries will also find most of those same libraries on the new systems, along with translators to help automate code conversion from the C90 to the new environment. Users who have already migrated some of their codes to run experimentally on systems like the Cray T3D and SGI POWER CHALLENGEarray should likewise be able to easily retarget PVM or MPI to an SMP. For example, Table 2 shows a partial list of NERSC software that has already been optimized for the DEC Alpha and the Silicon Graphics' Power Challenge SMPs.
Risk Analysis for SMP Clusters vs.MPPs
The "risks" for adopting the multiple SMP approach rather than the more standard MPP approach are ones of degrees, and small degrees at that.
As described below, the prototype implementation strategy for a high performance NOW cluster that will provide a major computing resource involves an initial configuration of on the order of 16-20, 12-16 CPU SMPs. Each one of the SMP nodes will provide about 3000 Mflops of computing capacity in an advanced systems environment that includes almost all of the desired software characteristics described above. Table 3 .
Even with none of the NOW technology to homogenize the nodes, SMP clusters are still a very powerful and advanced parallel environment that provides high bandwidth, low latency inter-node communication through existing systems like MPI, PVM, Express, etc.
Existence Proof: A Commercial SMP Cluster: The following example should provide further evidence of the low risk of the multiple SMP strategy .
In an experiment that was done jointly by SGI, the University of Minnesota, and the Army High Performance Computing Research Center, the goal was to solve a fluid turbulence problem on a 1024×1024×1024 mesh (10 9 computational zones). The computation was done on an SGI cluster consisting of 16 nodes, each with 20 R-4400 CPUs, and a total of 28 Gbytes memory and 192 Gbytes of fast, local disk. Twenty FDDI rings were used to organize the 16 systems into a 3D torus interconnect geometry. This configuration solved the largest CFD problem done on any computer architecture at that time (1993), and achieved (using CPUs two generations older than today's) 15 Gflops peak and 5 Gflops sustained computing performance.
In November of 1994, SGI announced the "POWER CHALLENGEarray" as a product. This commercial SMP cluster, based on the newest processor technology (90 MHz R-8000s) can provide 52 Gflops of peak computing performance, supported by more than 100 Gbytes of memory and 60 Tbytes of fast disk. 
An Implementation Strategy
The prototype strategy for implementing a scalable and large scale scientific computing facility is to build NOW clusters from currently available commercial SMP systems, and migrate the programming environment toward complier-based support of a global view of distributed memory (e.g. the UCB Split-C, and U. Md. HPF approaches).
The guiding principles for the implementation of the architecture are that it:
1) Be scalable in both computing and I/O capacity over the local and wide area;
2) Be capable of interconnection with systems at various scientific sites to form very large distributed systems, and to interface with distributed scientific laboratory systems;
3) Provide a "file" system that is scalable in capacity, geographic scope, and performance; 4) Provide a good operational environment for both production and Grand Challenge computational science 5) Accommodate incorporating academic computer science research into early prototype production systems 6) Provide continuous functional and performance evolution in an environment that supports on-going computational science research and production.
Base the hardware architecture on a cluster of SMPs
Very high speed systems can be constructed by hierarchical-parallel aggregation of workstation-like components into a single, logical system: In addition to the commercial offerings, a significant amount of research is aimed at this goal. For example:
• UCB's NOW project (uniform view of memory, cluster-wide system services) [NOW] • U Maryland's CHAOS project (Compilers to provide automatic parallelization across distributed memory systems) [CHAOS] • U. Toronto's Hurricane/Hector project (operating systems designed for hierarchical clusters)
[Hurricane]
• MAGIC testbed distributed-parallel storage system (high performance parallel I/O) [DPSS] • Univ. of Illinois's Portable Parallel File System [PPFS]
The commercial clusters that form the core of the prototype facility would be built from SMP nodes, each of which consists of from 4 to 32 CPUs, has high network I/O capacity to support both intra-and inter-cluster communication, has local disks support the OS, swap, checkpointing, etc., and cluster-based disks to provide the large capacity working storage needed by executing programs. As noted above, these clusters provide all of the required functionality except a global address space (which is the target of several of the current research projects).
Using large SMPs as nodes of a cluster will guarantee that the environment will be useful from the start, even without the technology to provide global memory addressing.
The two to four year goal is to construct a scalable and tightly coupled cluster of SMP nodes interconnected by multiple high bandwidth ATM switches. The target is that a single cluster will provide 100 Gflops of computing, 100 Gbytes of memory, 500 Gbytes of "local" disk per node, and a fully meshed, 600 mb/s intra-cluster interconnect. Such a system is illustrated in Figure 6 .
In the illustrated architecture, the cluster will communicate with the outside world via multiple ATM interfaces, and to large-scale network distributed tertiary storage in the same way. Within the cluster nodes, multiple CPUs are structured in shared memory systems. The parameters for a NOW / cluster implementation are the mix between the number of processors per node that share memory over internal busses / switches as opposed to the number of nodes that interconnect via high speed, external network switches to form clusters. The spectrum runs from the IBM SP-2, which is effectively 1 CPU per node, to the 4-12 CPU DEC Alpha cluster, to 16-18 CPU nodes of the SGI Power Challenge. Some of the variations are compared in Table 4 , which uses the N=1000 Linpack benchmark as the metric for comparison.
While the architecture of Figure 6 is the objective, what can be purchased today is a configuration like that of the SGI POWER CHALLENGEarray [ Figure 6 ]. An actual configuration would, of course, be determined by competitive bidding and negotiation with vendors for the most favorable cost and future upgrade conditions.
An Implementation Model
In order to present a plan that we know can be implemented using technology available today, we will present a specific scenario based on the SGI CHALLENGEarray SMP cluster. This specific example should in no way prejudice consideration of similar architectures from other vendors -rather this concrete example is, again, used as an existence proof.
The model for building up an SMP cluster environment to reach a teraflop in 5-6 years is based on the following assumptions and assertions:
1) Funding profile as indicated in Table 5 (5, 7, 10 $M for three years, followed by eight years at $10M/yr + 3% inflation);
2) A configuration and cost (including currently available discounts) as follows:
-Each "node" of the cluster is an SGI Power Challenge SMP with 16, 90 MHz, R8000 processors, 4 GBytes memory, 500 GBytes disk local to the node -Each such node delivers a peak rate of 3.24 GFlops and costs about $1.13M (assuming typical public institution / education discounts)
3) The projection model (Table 5) is formulated as follows:
-"funding": Funds available are used for leasing in three year payout units -"purchase pwr.": The annual average funding is asserted to buy 2.5 times that amount in discounted hardware over three years (based on leasing)
-"salvage": We assume that the equipment will be sold (or otherwise salvaged) for 1/2 the purchase cost (about 1/3 of the list cost) after three years -"num. nodes" The number of SMP nodes that can be purchased with "purchase pwr."
-"CPU factor": a multiplicative factor reflecting the processing power growth rate (see Figure 5) -"peak GFlop": The aggregate computing power of the cluster constructed with "num. nodes"
-"Cray 1S eq (scalar)": the scalar Cray 1S metric for "peak GFlop"
-"Cray 1S eq (vector)": the Cray 1S metric for "peak GFlop" deflated to account for the vector -scalar code mix (see Figure 3) 4) We assert that the current rate of progress in network and network I/O technology will enable effective interconnection of clusters and larger numbers of nodes in a time frame needed to support the capacity projection.
This scenario gives rise to the prediction of Based on workload studies at NCSA (noted above) and similar studies in progress on the NERSC environment [Note 6] we assume a 30/70 job mix. That is, 30% of the users use 70% of the available CPU resources (call these Grand Challenge users "G"), and the other 70% of users use 30% of the CPU resources (call these medium-scale users "M"). Under these circumstances (and assuming 95% availability of the computing facility) we arrive at the following for the first year of operation of the cluster implementation described in model above:
Available Cray1-S CPU hours x 1000 ("kC1") in the first year (from Table 5 , Cray 1S, scalar equivalent): 3970 70% of kC1 to class "G" 2779 30% of kC1 to class "M" 1191
A Cray C90 supplies about 820 kC1hrs/yr. This gives a net increase of about 3000 kC1 for the first year.
It is difficult to predict the impact of highly vectorized code on the above numbers, but if we pessimistically deflate the 3970 kC1 number by 1/4 for the entire job mix, this provides about the same as 16 processor C90 in the first year. 
