Introduction
The Department of Energy selected Oak Ridge National Laboratory (ORNL) as one of its high performance computing centers as part of the government's High Performance Computing and Communications (HPCC) initiative. The initiative provided ORNL with funds to procure a massively parallel computer and to support various Grand Challenge applications. ORNL selected Intel to provide the massively parallel computer for the HPCC project. Intel has developed a family of distributed-memory multiprocessors, starting with the iPSC/1 hypercube in 1986. The Intel multiprocessors are members of a growing market of parallel processing systems that are being used by researchers and commercial organizations to tackle increasingly complex computational tasks. A Cooperative Research and Development Agreement (CRADA) between ORNL and Intel speci ed the staging of increasingly more powerful versions of its new Paragon multiprocessor. As part of the agreement, ORNL would receive pre-production models of the Paragon and assist in beta testing and product development. This report summarizes the results of the third and nal phase of the CRADA, the test and evaluation of the Paragon MP.
In 1994, the rst Paragon MP was delivered to ORNL. The Paragon MP extended the computational power of the Paragon by providing three i860XP processors with a shared memory on each node of the communication mesh. Appendix A provides a time-line of the events during the test and evaluation of the Paragon MP system. This report details the performance of the shared-memory node board and evaluates the performance of several parallel applications on the Paragon MP. Computational, shared-memory, and communication performance were measured with synthetic benchmarks, application kernels, and a few parallel applications. The Paragon's shared-memory performance is compared with the performance of other shared-memory parallel processors.
In the following section, the Paragon MP architecture is summarized. Section 3 describes the performance of the shared-memory MP node and compares its performance to other shared-memory architectures. In section 4, recent improvements in message passing and I/O are reported.
Paragon MP Architecture
The Intel Paragon system is a mesh-connected parallel processor. In the rst member of the Paragon family, the GP system, each node on the mesh consists of two 50 MHz i860XP processors, memory, and communication hardware. One processor is used for computation, and the second processor is for communication.
A Paragon MP node consists of three 50 MHz i860XP processors, memory, and communication hardware (Figure 2 .1). The nodes are interconnected by a 2-D mesh with 175 MB/second communication channels and a per-hop latency of only 40 ns. The nodes are logically subdivided into service nodes, compute nodes, and I/O nodes (Figure 2.1) . The service nodes appear as a single host and support time-sharing through the OSF operating system. The compute nodes run OSF or SUNMOS. The I/O nodes are connected to local networks and arrays of disks (RAID) and provide a UNIX le system, swap/paging space, and a Parallel File System (PFS). Each i860XP has its own 16 KB data and instruction cache, and each node has at least 64 MB of memory. The bus interconnecting the processors, meshinterface, and memory operates at 400 MB/second. The 50 MHz i860XP is a super-scalar architecture capable of a peak 75 M ops (double precision). Typical FORTRAN performance is only 11 M ops ( 2] ). Early designs of the MP proposed ve CPUs with L2 cache, but cost-performance analyses dictated the three-CPU con guration and no secondary cache. Intel's analyses showed the bus bandwidth to the local memory would not support ve CPUs e ciently. Also the three-CPU con guration provided more board real-estate for memory than the ve-CPU design and the ve-CPU design would have required using only every other backplane slot.
Message-passing libraries (NX, PVM, MPI, SUNMOS) are provided for internode communication. A node is the smallest addressable unit in the messagepassing architecture. (Conceptually, each processor might be addressable in the message-passing software, but Intel's early design analyses favored node addressing.) Typically, one processor on each node is designated as the communication processor, leaving two processors for computational work. The compilers can provide automatic parallelization of the processors on a node, or a threaded library and compiler directives are provided for explicit parallelization.
Performance
In this section, we look at the performance of the shared-memory node. We measure CPU memory bandwidth and the e ects of bus contention when multiple CPUs and message passing compete for the limited bus bandwidth. We compare the shared-memory performance and multi-threading primitives to other shared-memory multiprocessors, nally comparing performance of some parallel benchmark kernels.
An MP node has three CPUs, memory, and mesh-controller sharing a 400 MB/second bus (Figure 2 .1). A 50 MHz i860XP is speci ed as being able to generate 400 MB/second of memory tra c. Clearly, the MP node architecture is likely to be bus limited. Large caches on each processor could mitigate the limited bandwidth, but the data cache is not large (16KB). With 90% to 95% cache hit rates and typical cache write-back rates, one can expect that only 15% to 25% of a CPU's memory requests actually generate a memory operation. In principle, the 400 MB/second bus could support four or ve i860XPs. Programs can be contrived to demonstrate either extreme: where all data requests are satis ed from cache, and linear speed-up is possible; or where little or no cache hits occur, and the node runs at the speed of one (or less) processor.
To measure actual memory bandwidth performance, we used a small unrolled assembler loop that did quad load's (p dq) from memory. The memory locations were \pre-touched" to eliminate any virtual memory e ects. A single CPU sustained a memory access rate of 251 MB/second, considerably less than the 400 MB/second speci cation. If we ran our test concurrently on two CPUs, the aggregate rate was still only 237 MB/second. If all three CPUs on the node concurrently accessed memory, the aggregate rate was 246 MB/second (Table  3 .1). We compared the MP shared-memory node board with other shared-memory multiprocessors. We compared thread and fork creation, lock and unlock, barriers, and concurrent update of shared variable (no locks). The i860XP has no hardware \atomic" operations, so locks are implemented by software. Table 3 .2 compares single processor performance of the 50 MHz i860XP with single processors on the KSR, BBN, and Sequent. The KSR is ring-based shared-memory multiprocessor using a 20 MHz custom processor. The Sequent Symmetry is bus-based shared-memory multiprocessor using 16 MHz 386 processors. The BBN TC2000 is a cascaded-switch based shared-memory multiprocessor using 20 MHz M88000 processors (see Appendix B). Table 3 .3: Three processor performance. Table 3 .4 compares the MP node performance over a set of application kernels in C and FORTRAN. The same copy of the code was run on each multiprocessor, and the codes have not been tuned. The numeric integration kernels e ectively operate from cache, so near linear speed-up is achieved. The Cholesky code is a little more memory intensive, and the slower MP performance results from lock and bus contention. The HiTC kernel is based on a double-precision complex ZAXPY. Using Intel's ZAXPY from the kmath library, the serial code runs at 36 M ops. The vectors in the ZAXPY exceed the i860XP cache size, and bus contention prevents the parallel HiTC kernel from achieving any speedup on an MP node. Another version of the HiTC kernel, modeling only one atom per cell, has small enough vectors that near linear speedups can be attained.
Speedup on three CPUs
MP KSR Sequent Integration (C) 2.9 2.9 3.0 Jacobi iteration (C) 2.7 2.8 3.0 Cholesky (1K 1k) (C) 2.1 3.0 2.9 Integration (F) 2.9 2.9 3.0 HiTC kernel (F) 1.0 2.9 2.9 Table 3 .4: Speedup on three CPUs for various application kernels.
To this point, we have considered only a single node board. In parallel applications, each Paragon node will communicate with other nodes in the mesh in solving a parallel application. The expected con guration is to use one CPU on each node board as a communication processor. To see the e ect of communication and computation competing for the bus, we added a communication thread to our p dq test. In the absence of computational activity, the communication thread ran at 119 MB/second, using an echo test to an adjacent node. In the absence of communication, the p dq ran at 252 MB/second. With one p dq thread and one communication thread running concurrently for identical durations, the aggregate data rate was 177 MB/second. The communication thread garnered 31 MB/second, and the p dq thread achieved about 146 MB/second. So the limited bus speed can slow both computation and communication. Table 3 
Message passing
Our beta testing concentrated primarily on the shared-memory features of the MP, but we also re-evaluated message-passing performance and I/O. Most of our production research is conducted using Intel's OSF on the compute nodes, but we also continue to evaluate SUNMOS. Our communication tests uncovered several performance anomalies. Data rates were poor if message sizes were not a multiple of 32 bytes, and data rates of one-to-n communication degraded as n increased. Intel corrected the anomalies in subsequent software releases. Message-passing performance (latency and bandwidth) improved with each release of software. For nearest neighbor communication, we are currently measuring latencies of 25 to 30 s for zero-length messages, and data rates of nearly 171 MB/second for one MB messages. These numbers were measured under OSF 1.0.4 R1 3 and SUNMOS 1.6.2 and are much faster than those we reported just last year ( 4] ). Per-hop delay is nearly negligible. The additional delay for going corner to corner on the 1024-node MP Paragon (16 64) is less than 3 s.
The compute nodes can be con gured in a \turbo" mode, where all three processors are used as computation processors. Communication tasks are handled with context switches. Our early tests showed that communication performance was several orders of magnitude slower in turbo mode. However, recent software releases have greatly improved turbo mode communication. Latency slows to 75 s, and bandwidth is reduced to 109 MB/second. Our CRADA analysis also re-evaluated I/O performance. The Paragon OSF provides both a standard UNIX le system and a larger, high performance parallel le system (PFS). PFS is typically con gured across a set of I/O nodes and disks. The PFS is striped across one or more I/O nodes and their disk RAID arrays and appears to the UNIX system as a separate mountable le system (e.g., /pfs 
Summary
The nal phase of the ORNL/Intel CRADA provided value to both parties, Intel getting feedback from early users and performance analyses, and ORNL getting an opportunity to do leading-edge computer science and computational science. The shared-memory performance of the Paragon MP is competitive with other shared-memory multiprocessors. The limited bus bandwidth of each MP node requires that the application programmer exploit data locality to garner noticeable speedups. The automatic parallelization compilers help in utilizing the multiple processor nodes, but in complex programs, the application programmer usually needs to assist in the parallelization process. Message-passing performance and I/O continue to improve. The 96-node MP Paragon CRADA machine continues to be a valuable computational resource for ORNL, even after delivery of the production 1024-node MP Paragon in January, 1995.
than two microseconds, and a single channel of the switch has a bandwidth of 40 MBs 6] . The architecture could be used with other memory management policies 5]. Compiles on the BBN were done with -O -lus. LINPACK 100 100 double-precision on a single processor was 1.0 M ops using -OLM -autoinline. Dhrystone (v1.0) was 19.4 Mips.
Kendall Square
The Kendall Square uses custom-designed 20 MHz processors that share memory on a one gigabyte per second ring. Each processor has a 256KB cache, and the global memory is managed as a cache. A single processor generates a maximum of 40 MBs against the ring. LINPACK 100 100 double-precision on a single processor was 15 M ops 3].
Sequent Symmetry
The 26 processor Sequent Symmetry located at ANL is based on 80386/387 processors (16 MHz) with a Weitek 3167 oating point co-processor. Each processor has a 64KB cache, and 32 MB of memory is shared by all processors on a 54 MBs bus. The maximum con guration is 30 processors. The processors run Dynix 3.1.2, and compiles were done using -O. LINPACK 100 100 double-precision on a single processor was 0.37 M ops 1]. Dhrystone (v1.0) was 3.6 Mips. Processor 4.8 MBs versus a 26 MBs bus.
