Introduction
The Department of Energy selected Oak Ridge National Laboratory (ORNL) as one of its high performance computing centers as part of the government's High Performance Computing and Communications (HPCC) initiative. The initiative provided ORNL with funds to procure a massively parallel computer and to support various Grand Challenge applications. ORNL selected Intel to provide the massively parallel computer for the HPCC project. The agreement with Intel speci ed the staging of increasingly more powerful versions of its new Paragon multiprocessor. As part of the agreement, ORNL would receive pre-production models of the Paragon and assist in beta testing and product development.
This report summarizes our early experiences with the Intel Paragon. Our evaluation and testing of the Paragon involved testing end-user UNIX services (editors, compilers, le management, etc.), system administration services (account management, partition management, batch queuing support, network services, backup/restore, etc.), and porting various parallel applications onto the new software platform. The initial testing was done with test suites and small parallel applications that ran on Intel's iPSC/860 and Delta multiprocessors. As soon as the hardware and software had stablized, the Grand Challenge applications were ported to the Paragon. Bugs and problems were reported to on-site Intel sta , and the Intel design team consulted with ORNL in setting design directions and priorities for the evolving Paragon system.
This report also provides initial performance characteristics of the Paragon. Computational and communication performance were measured with synthetic benchmarks, application kernels, and a few parallel applications. The Paragon's performance is compared with the performance of other currently available parallel processors.
In the following section, the Paragon architecture is summarized, and the con guration of the ORNL Paragons is detailed. Section 3 describes the Paragon operating systems. In section 4, the performance of the i860XP is compared with the i860 processor. The message-passing performance of the Paragon mesh is reported in section 5, and preliminary performance of the Paragon's le system and local area network interfaces are analyzed in section 6. Parallel application performance of the Paragon is examined in section 7, and section 8 summarizes our initial experiences with the Paragon.
Paragon architecture and con guration
The Intel Paragon system is a mesh-connected parallel processor. Each Paragon node consists of two 50 MHz i860XP processors, memory, and communication hardware (Figure 2 .1). One processor is used for computation, and the second processor is for communication. (The communication processor became operational in May, 1994 .) The bus interconnecting the processors and memory operates at 400 MB/second. Each compute node is presently con gured with 32 million bytes of memory. The initial con guration had only 16 million bytes of memory, but that proved inadequate. The nodes are logically subdivided into service nodes, compute nodes, and I/O nodes. The service nodes appear as a single host and support time-sharing through the OSF operating system. The compute nodes also run OSF. The I/O nodes are connected to local networks and arrays of disks (RAID) and provide a UNIX le system, swap/paging space, and a Parallel File System (PFS). Since the service nodes are used for time-sharing and loading the compute nodes, a \host" is not required as in the earlier Intel architectures (Delta and the iPSC series).
The nodes are interconnected by a mesh. The speed of a single mesh channel was designed to be 200 MB/second, but the delivered Paragons provided a maximum of only 175 MB/second. Per hop delay through the mesh is only 40 nanoseconds. Based on analytical studies and simulations, Intel chose the mesh architecture because it provides the most e cient use of available wires. Given the same number of wires, a mesh will outperform any hypercube, toroidal, or tree-structured network for uniformly distributed communications tra c 12].
Two Paragons were delivered to ORNL in September, 1992. A 66-node system with 14 Gigabytes of disk was provided under a Cooperative Research and Development Agreement (CRADA) and was used primarily as the program development machine. A 512-node system with 150 Gigabytes of disk was the interim production machine, eventually to be replaced with a 2048-processor machine. The i860XP's were running at only 40 MHz in these initial machines. Each Paragon was connected to the local Ethernet, and later each was also attached to HiPPI.
For comparison, the following sections include performance data from the Intel Delta and Intel iPSC/860. The Delta is the one-of-a-kind predecessor to the Paragon. The Delta is a mesh-based multiprocessor based on the 40 MHz i860 processor and the NX node operating system. The peak bandwidth of a channel in the Delta mesh is 22 MB/second. The iPSC/860 is a hypercube multiprocessor based on the same processor and OS as the Delta. The peak bandwidth of one of the hypercube's channels is 2.8 MB/second. Both the Delta and iPSC/860 support a parallel le system (CFS) similar to the Paragon's PFS. The iPSC/860 and Delta con gurations and details of the benchmarks are described in 5] and 7].
The Paragon is Intel's rst production-oriented mesh multiprocessor. The Intel iPSC series were all based on a hypercube topology. The mesh has some potential advantages over a hypercube topology. Though both topologies are extensible, in practice, commercial hypercubes have a xed maximum dimension. For example, the largest iPSC/860 is seven dimensions or 128 processors. Hypercubes must be expanded in powers of two, which is often prohibitively expensive. Meshes can be expanded at linear costs by adding an additional row or column. Of course, the hypercube topology has advantages as well. The maximum distance between two processors in an n processor system is only log 2 n for a hypercube, compared with p n for the mesh. The lower connectivity of the mesh may lead to communication \hot spots" in the mesh or to slower aggregate communication operations such as barriers. Our tests and analyses in the following sections will attempt to identify the strengths and weaknesses of the Paragon's mesh topology.
OSF and SUNMOS
The Paragon operating system support di ers from both the Delta and iPSC series of hypercubes. For the older Intel parallel processors, compilers and editors were provided on a small host processor or by cross compilers on the user's workstation. For the Delta and the Intel hypercubes, a small kernel OS (NX) on the nodes provided message passing, memory management, and a UNIX I/O library. For the Paragon, time-sharing services are provided by OSF running on a set of mesh nodes (service nodes). In the later releases of the Paragon OSF, the service nodes provided some limited parallel processing of user services, so that di erent users would likely be running on di erent service nodes. Our beta testing included exercising the user services (editors, le system, accounting, compilers, linker, etc.) on the OSF service nodes. The OSF services worked, though initially performance was slow. Performance has improved with each software release, but overall UNIX performance as measured by a set of UNIX benchmarks is still slow in comparison with current workstations.
Instead of a tiny kernel like NX on the nodes, the Paragon provides the OSF micro-kernel on each node. OSF on each node provides a more comprehensive set of of services to the node programmer, but at a cost of memory and some additional overhead. The OSF kernel provides virtual memory, permitting larger node programs to be run on the nodes than might t in physical memory. However, the paging of memory to the I/O nodes has proven a bottleneck to date, and the bene ts of virtual memory have diminished. The OSF kernel on the compute nodes provides OS services through the OSF interprocess communications facility (NORMA IPC) which in turn sits on top of inter-node message passing services. The present implementation of the OSF IPC has limited the performance of le I/O and network I/O.
The software overhead of OSF and the inability to use the message coprocessor initially prevented parallel applications on the Paragon from matching the performance of its predecessor the Delta. To provide an alternative node operating system, Sandia National Lab and the University of New Mexico developed a small (256K byte) compute node kernel called SUNMOS 14] . SUNMOS runs in the compute partition, supporting the same message-passing primitives as OSF and NX. OSF is still used on the service nodes. SUNMOS does not provide virtual memory, and its I/O support is not fully developed, but SUNMOS provides higher bandwidth for large messages than OSF.
Computational performance
The CPU for the Paragon is the 50 MHz i860XP, an enhanced version of the 40 MHz i860 CPU in the Delta and iPSC/860. The i860XP has the same instruction set as the i860 and so is software compatible. The i860XP has a 16KB instruction and data cache, twice that of the i860. In addition the speed of the memory bus has been increased from 160 MB/second to 400 MB/second. The super-scalar architecture is capable of 75 M ops (double precision).
Our CRADA agreement with Intel resulted in our being able to evaluate early releases of the hardware and software. Our initial Paragon con gurations had 40 MHz i860XPs until March, 1993. Single-node performance from these 40 Mhz chips and early software was disappointing. For example, single-node Linpack performance was actually slower than the 40 MHz i860. Of course, evaluation and development in these early months was concentrated on OSF reliability and stability issues and not on absolute performance.
With the 50 Mhz i860XP's installed, single node performance improved to roughly 20% faster than the i860 processor over the set of benchmarks described in 2]. For example, the 100 100 double-precision FORTRAN Linpack ( 1]) ran at 10.9 M ops on the 50 MHz i860XP versus 9.7 M ops on the i860. A FORTRAN radiosity code ( 10] ) that includes some I/O ran at 4.8 M ops on the i860XP versus 2.8 M ops on the i860. A C Cholesky factorization and a C numeric integration ran 25% and 31% faster on the i860XP than on the i860.
Finally, application performance on a single-node is a ected by the amount of memory available. The OSF kernel consumes about 6 megabytes. By contrast the SUNMOS kernel takes less than 1 megabyte. NX on the Delta consumes about 4 megabytes, and NX on the iPSC/860 consumes about 1 megabyte. Memory consumption varies based on message bu er allocations. Memory consumption was measured with a simple malloc() loop on NX and SUNMOS. For the virtualmemory OSF, a vector-touch loop was run over larger and larger vectors until performance drops indicating that paging has begun. The larger memory consumption of OSF has to be balanced against the additional features (e.g., virtual memory) it provides the application programmer. In general, message bu er requirements grow with the number of nodes, so the amount of memory available to an application will diminish as more nodes ares used. The need for larger-memory nodes in large (greater than 512 nodes) con gurations is a system design issue that was identi ed in our early evaluation process.
Communication Performance
In this section, we analyze the communication performance of the Paragon mesh, rst looking at adjacent node performance, then at communication to more distant nodes. The communication tests were performed under OSF 1.2 with the communication processor enabled. Various communication patterns are analyzed to determine how much concurrency the Paragon mesh can support and when contention degrades performance.
Node-to-node communication
In the rst test, a simple echo test is used, where a message is sent and echoed back by the receiver. The sender measures the round-trip time for 1000 iterations. Figure 5 .1 shows the data rate for two adjacent nodes echoing messages of various message lengths. The data rate increases with message sizes from 8 to 8, 192 bytes. The Paragon using SUNMOS reaches a data rate of about 65 MB/s for a message size of 8,192 bytes. By contrast, the Paragon with OSF achieves 45 MB/s, though, as the gure illustrates, OSF's data rate exceeds SUNMOS for smaller messgaes. The cross-over point occurs roughly at where OSF segments messages into 1792-byte packets. SUNMOS does not segment messages. Earlier generation message-passing machines (Intel iPSC/2 and iPSC/860) exhibited slower data rates if communication was not with the nearest neighbor ( 7] ). The Paragon, like its predecessor the Intel Delta (and to a lesser extent the Ncube 6400), communicates nearly as fast with the most distant node in the network as it does with its nearest neighbor. In particular, at the moment, di erences in communication speed across the Paragon mesh are hidden in the measurement error of the experiments. The speci cations for the Paragon mesh suggest that only 40ns are required for each hop 12] . Thus for a 16 32 mesh less than 2 s are added to the communication times between most distant nodes. As noted below, the minimum nearest-neighbor communication times are currently about 45 s, so multi-hop overhead is less than a few percent.
In our earlier analyses of message-passing systems ( 7] 2] 3], we modeled the message-passing time, T , as a linear function of start-up time, , a per-byte cost,
, and a per-hop delay, .
We used a linear least-squares t of our experimental results to calculate the startup and per-byte parameters. However, the experimental data from the Paragon and other new architectures are not as well supported by a linear t, and the calculated parameters are very sensitive to the set of data points used in the t. For purposes of comparison, To the extent that one can characterize communication with one or two numbers, we now prefer to use the time to send a zero-length message as one metric, and the data rate for a one million byte message as another metric. The extra time required for a multi-hop message is more clearly seen if we look at the time for sending a zero-length message ( Figure 5 .2). Though the bandwidth between nodes has increased on the Paragon in comparison to the Delta and iPSC/860, the zero-length message time (latency) has improved only marginally, even though the 50 MHz i860XP is a faster processor. The latency is dominated by house-keeping chores (argument checking, context switch on interrupt, etc.) on both the sending and receiving nodes. In a separate study ( 4] ), the time to handle the time-slice interrupt on the iPSC/860 was about 50 microseconds, which suggests that interrupt context switch overhead could be the dominant factor in message latency. With the communication processor disabled, latency on the Paragon climbs to 85 s and bandwidth is reduced by a factor of two. Intel hopes that the latency on the Paragon can be reduced to 25 s. Figure 5 .3 further illustrates the di erence in performance and variability of message passing with and without the message processor under OSF. The gure shows the distribution of round-trip times for 2,000 samples using an 8-byte message. The echo test was run both with a nearest neighbor and from cornerto-corner in the 512-node mesh using both OSF and SUNMOS. Notice that the variance is such that it is possible to observe round-trip times that are faster corner-to-corner than to nearest neighbor.
Even though the communication performance of the Paragon and Delta is generally better than the iPSC/860, the hypercube topology performs some communication primitives faster than the mesh. For example, using Intel's gsync(), barrier synchronization time grows with the number of nodes for the mesh, but only as the log of the number of nodes for the hypercube ( Figure 5 .4). 
Contention
All of the communication data rates that we have reported have been measured on idle systems. In actual applications, other message tra c may compete for the communication channels, either from the application itself or from applications in other partitions. One partition may need to use another partition's communication channels to reach the I/O processors or other service nodes. The Paragon, iPSC/860, Delta, and Ncube 6400 use circuit-switching to manage the communication channels. When a message is to be sent, a header packet is sent to reserve the channels required. When this \circuit" is established, the message is transmitted, and an end-of-message indicator releases the channels. SUNMOS reserves the channel for the entire message. Paragon/OSF breaks a message up into packets (usually 1792 bytes), and the circuit is only reserved for the packet. This packetizing can add to the overhead of a message, but permits multiplexing the links of the circuit with other nodes. A program was developed to measure the e ect of contention on the data rate of a communication channel and to measure the capacity of a given physical link. The link-contention program developed for the hypercube 7] proved inadequate for the higher speed meshes of the Delta and Paragon. Link contention was measured on a row of the mesh with varying numbers of pairs doing synchronous sends of one megabyte messages in one direction. It was observed that the interior pair completed rst, followed by the next innermost pair, and so on. The outermost pair nished last. (Note the inner pairs continued to send data after the timed portion of their transmission completed.) For both OSF and SUNMOS, contention occurs when the aggregate data rate exceeds about 160 MB/second ( Figure 5 .5). (Recall, the peak channel bandwidth is 175 MB/second.) The slower data rate of the OSF nodes, means that more OSF nodes can be sending before contention occurs. For the Delta, aggregate channel throughput under contention is about 11 MB/second. The e ect of contention can vary from run to run and can slow down an application. Since a mesh has fewer channels between nodes than the hypercube architecture, one would expect increased contention for the mesh channels. But contention will occur on both mesh and hypercube channels when the aggregate sending rate of nodes on the channel exceeds the channel bandwidth. A more detailed analysis of channel contention on the Paragon is reported in 13]. 
Concurrent Communication
The message-passing performance of a node may be improved by utilizing more than one of its communication channels at the same time. A fan-in test was used in our earlier tests on hypercubes 6], but only the Ncube was able to show a higher aggregate receive date rate. The Intel machines (including the Paragon) have a single receive FIFO and a single transmit FIFO, so it is only possible to receive from one channel at a time. However, for the iPSC/860, it is possible to nearly double the aggregate data rate of a node by doing an exchange using FORCE TYPE, that is, a node concurrently sends and receives with another node. However, so far, we have not been able to achieve the same result on the Paragon.
File and Network Performance
Paragon le I/O and access to local area networks are provided through one or more I/O or service nodes. These nodes usually reside on the outer columns of the mesh. Communication to the I/O or network nodes uses OSF interprocess communication (NORMA IPC) layered on top of underlying mesh communication primitives. The OSF IPC is presently limiting performance.
Parallel File System
The Paragon OSF provides both a standard UNIX le system and a larger, high performance parallel le system (PFS). The system manager can con gure the I/O nodes and disks into combinations of UNIX and PFS le systems. A typical con guration would be to allocate the disks of an I/O node as a mountable partition in the UNIX le system. PFS is typically con gured across a set of I/O nodes and disks. The PFS is striped across one or more I/O nodes using the disk RAID arrays and appears to the UNIX system as a separate mountable le system (e.g., /pfs). Normal C and FORTRAN I/O operations can be used on PFS, but optimum performance is achieved using special open calls. Several I/O benchmarks were used to characterize the performance of PFS. The benchmarks measured I/O throughput from a single compute node and from many compute nodes doing I/O concurrently. The tests were run with a varying number of I/O nodes in the PFS con guration. 
Experiences
We have evaluated several serial-number-one parallel processors at ORNL, beginning in 1985 with Intel iPSC/1 hypercube. These early machines were used for algorithm development, performance analysis, and, to the degree possible, porting existing applications or developing new applications. Our initial testing of the Paragon was through a Cooperative Research And Development Agreement (CRADA) between ORNL and Intel. This rst phase of the CRADA provided us with an iPSC/860 running OSF. Our testing involved evaluating the new OSF as well as porting some of our iPSC/860/NX hypercube applications to the new OSF environment. We were pleasantly surprised at the quality of the OSF implementation.
The second phase of the CRADA included the delivery of 66-node, i860XP-based, Paragon mesh. Initially, this unit was to have preceded the 512-node machine by several months, but schedules slipped, and the 66-node machine and 512-node machine arrived within a week of each other. As expected in a beta test, the hardware and software had bugs, and our initial e orts were directed at identifying the critical problems and working out solutions with the Intel sta . As part of the contract, Intel provided hardware and software personnel on site, so feedback was fast and e ective. The developers at Intel Corporate would often have new software releases the day following a critical bug report.
Our testing on the Paragon consisted of program development of benchmark codes, porting working iPSC/860 codes, and performing system administration functions. A number of UNIX applications that run on single-processor UNIX systems were compiled and run under the OSF beta system. These applications included UNIX commands, benchmarks, various network servers, simulators, PVM 9], PICL 8], and component tests. These applications exercised system services such as le I/O, shared memory, semaphores, process creation, pipes, signals, network sockets, and shell scripts. In addition, POSIX and UNIX test suites were run. Though performance was not an issue during this early development and testing, the results from the various benchmarks did reveal various component ine ciencies that were relayed to the Intel team. The time-sharing services of OSF remained reasonably stable, though compile times were slow initially. File and network I/O were unusually slow, due to ine ciencies in the interprocess communication facilities of the OSF implementation. File and network I/O performance have improved, but still are not competitive with typical workstations.
The porting of parallel applications (working iPSC/860) codes was successful for smaller applications. Those codes that depended on a host (SRM) had to be recoded to be hostless. Some applications did not port because of the limited application memory space on the compute nodes. Initially, the OSF kernel was taking nearly 8 megabytes of memory on each of the compute nodes. Virtual memory was supported, but if an application started swapping, performance was very very poor and often was the cause of crashes. Though problems were xed quickly, patches and new releases required re-running all of our tests. Occasionally, features that had been working would fail in a new release.
The 9-cabinet, 512-node system had early problems with grounding and noise on the communication channels. Many scaling problems with OSF were uncovered. Operating system tables were not properly sized for hundreds of processors. For several months, the maximum number of nodes in a single application was limited to 256. Multiprocessing of the service nodes was not initially supported, and memory bottlenecks hurt service node performance. Eventually, both the service nodes and compute nodes were upgraded to 32 megabytes of memory.
Although the evaluation of early systems and the CRADAs were partial justi cation for procuring the Paragons, the primary purpose of the ORNL Paragon was to provide a tool for computational science. Much of the testing and evaluation centered around porting the three Grand Challenge applications to the Paragon. The material science application was already running an early version on the 128-node iPSC/860. Porting that code to the Paragon was successful. The application uses PFS, dynamic memory allocation, and is achieving near linear speedups. On a per node basis, the Paragon version performs 1.7 times faster than the iPSC/860 version, and delivers 17 Giga ops on the 512-node Paragon. Porting the global climate modeling application to the Paragon from the Delta has been more di cult than anticipated, primarily because of le I/O bugs and ine ciecies. However, the Paragon version is running about 1.5 times faster than the Delta version. The contaminant transport application did not have a fully developed parallel implementation, so progress on the Paragon has been di cult to measure.
The 66-node and 512-node Paragon systems are providing parallel computing cycles to a nationwide community primarily working on the three Grand Challenge projects. Performance and reliability continue to improve with each release, but performance still remains below expectations. The usefulness of OSF on the compute nodes is still a matter of debate in view of its performance and memory liabilities.
