Introduction
In September of 1991, a Kendall Square Research (KSR) multiprocessor was installed at Oak Ridge National Laboratory (ORNL). This report describes the results of this initial eld test. The performance of the KSR shared-memory multiprocessor is compared with other shared-memory and distributed-memory multiprocessors, using synthetic benchmarks and real applications. Performance gures must be considered preliminary, since the KSR system was in its rst eld test.
The KSR multiprocessor runs a modi ed version of OSF/1 (Mach). To the user, the KSR system appears like typical UNIX TM , but providing performance advantages similar to those provided by the Sequent Symmetry and BBN TC2000 multiprocessors and providing scalability similar to the Intel iPSC/860 and DELTA. Piped processes and background jobs can utilize the multiprocessor architecture to provide improved throughput and response time.
A programmer on the KSR system is provided with a parallel make and with automatic parallelization for FORTRAN. The programmer can assist the automatic parallelization (a FORTRAN pre-processor from Kuck Associates) with compiler directives, or can do explicit parallelization using the pthread subroutine library. The pthread library is provided to the C programmer along with language extensions to manage shared variables.
Shared Memory
The distinguishing feature of the KSR multiprocessor is its shared-memory architecture. Each processor has 32 megabytes of memory. Up to 32 processors are connected to a slotted, pipelined ring, called a Ring:0. Larger systems are formed by connecting Ring:0's to an interconnecting Ring:1, providing up to 1,088 processors. The memory of all of the processors is part of a 40-bit virtual address space managed as a cache, where the ring is used to transport cache lines to satisfy \cache faults." Custom CMOS chips manage the cache, ring, and ringto-ring routing. The KSR architecture and chip set are designed speci cally to support a shared-memory multiprocessor. Section 2 and 18] provide more detail on the actual implementation.
The KSR shared-memory architecture is similar to the bus-based Sequent systems in that there is one cached address space, but it di ers from the Sequent in that the Sequent does not have a notion of \local cache," and the KSR architecture is extensible beyond 30 processors. The BBN shared-memory multiprocessors share KSR's extensibility, but under the BBN's Uniform system there is no caching, rather a reference to a \remote" shared location will always be remote, and replication is under software control. KSR di ers from the meshbased distributed shared-memory systems DASH 14] and PLUS 1] in that these systems do not provide strongly ordered read/write memory operations. DASH and PLUS must use explicit synchronization operations when a speci c ordering is required in accessing a shared location. The KSR memory system is both sequentially consistent 12] and strongly ordered 4], so ordinary read/write memory operations can be used to implement synchronizations. The KSR's ring-based memory system is quite similar to MEMNET 2] , except that MEMNET still has a local memory for each processor independent of the ring-based shared memory. Also, a shared memory location on MEMNET has a \home" location, a feature not required on the KSR. Delp 2] notes that the ring topology supports broadcast and provides an ordering of memory accesses so a coherency protocol is easy to implement. Both KSR and MEMNET pipeline the ring, so that more than one memory transaction may be on the ring at the same time.
Additional details of the implementation of the shared-memory architecture are provided in Section 2 along with a summary of the processor architecture and implementation. Section 3 compares the computational performance of a single KSR processor to other superscalar processors and compares KSR's UNIX performance to other UNIX systems. Section 4 measures the parallel performance of the KSR multiprocessor and compares it to other shared-memory and distributed-memory multiprocessors. Section 5 relates our early experiences in porting various applications to the KSR.
Implementation
The KSR ring:0 consists of a 34 slot backplane, populated with 32 processor boards, or cells. The remaining two slots are used for ring:1 interconnect boards. Each cell consists of 12 custom CMOS chips. The shared-memory is managed by 4 Cell Interconnect Units (CIU) and 4 Cache Control Units (CCU). The remaining chips comprise the four functional units | the Cell Execution unit (CEU), the 30 Megabytes/second (MBs) external I/O unit (XIU), the integer unit (IPU), and oating point unit (FPU). An instruction pair is executed on each cycle, with one member of the pair coming from either the CEU or XIU and the other member being either an FPU or IPU instruction. Thus an address calculation, load/store, or branch can be executed concurrently with either an integer or oating point instruction.
Each cell runs at 20 MHz, and the oating point unit supports a pipelined adder and multiplier for a peak performance rate of 40 Mega ops per cell. Thus the KSR processor is very similar to other superscalar processors such as the Intel i860 and the IBM RS/6000 (see Appendix A). The oating point unit uses 64 64-bit registers, and the integer unit has 32 64-bit registers. The CEU uses an additional set of 32 40-bit address registers. Each cell holds a 256KB data cache and a 256KB instruction cache, and a 32 Megabyte daughter board is attached to the back of each processor board. KSR calls the local memory on each processor cache and refers to the 256KB data cache as the sub-cache.
The memory of every cell is part of a single 40-bit virtual address space managed as a hierarchy of caches. If a processor requests a location that is not in the local data cache then the data is fetched from the on-cell memory. If the data is not in the on-cell memory, then the data is fetched from the memory of one of the other cells on the ring(s). In each case the processor is stalled until the data arrives. The latencies and capacity of each level of the cache hierarchy are listed in The programmer or compiler can use a non-blocking pre-fetch instruction (up to four may be in progress from each processor) and a post-store instruction to reduce the latency. Synchronization, or locking, is provided by instructions to lock and unlock a 128-byte subpage.
The KSR con guration at ORNL is a 32 cell-system. An Ethernet and Exabyte 8mm tape drive are connected to the I/O port of cell 1. A Multi-channel Disk (MCD) controller is attached to cell 3. The MCD has 5 SCSI controllers, each with two 1-gigabyte drives. These drives are presently mounted as independent UNIX disk partitions. In the future, the drives can be con gured as RAID arrays and as one logical volume with the les striped across the drives. Appendix A summarizes the con gurations of other machines (BBN TC2000, IBM RS/6000-530, Intel iPSC/860, and Sequent 80386-Symmetry) used for comparison in the following sections.
For the tests described in the following sections, the KSR software release used was PR1.14. Unless otherwise noted, -O2 optimization was used. Timings were provided by either the UNIX time command, or by timer calls within the application. The KSR supports a \global" time-of-day clock with a 10 millisecond resolution and two sub-microsecond timers on each cell. One timer provides user time, and the other is a free-running timer. The timers all run at the same frequency, but the free-running timer is initialized as each cell is started. Each cell is started serially after cell 1, so all of the free-running timers are o set from each other. Thus if a process/thread migrates to another cell, timings reported by the free-running timer cannot be trusted. We used the free-running timer for many of our tests, but we always bound the thread to the cell for the test, preventing the scheduler from moving it.
Single Cell Performance
The single processor performance of the KSR functional units was measured with several widely used benchmarks. Floating point performance was measured with the FORTRAN Livermore Loops, SLALOM (version 2) 11], and the 100 100 double-precision LINPACK. As of this writing, KSR FORTRAN codes performed somewhat faster than the equivalent C programs. As a rough measure of integer performance the C Dhrystone (version 1) was used. Figure 3 .1 shows the results of these benchmarks. For comparison, results from the Intel i860 and IBM RS/6000-530 processor are displayed as well (see Appendix A for con gurations and compiler options). The 20 MHz KSR is competitive with the faster clocked i860 and 530. The KSR compiles were done with -O2 optimization, except \auto-inline" was used for LINPACK. Unfortunately, with \auto-inline" the LINPACK compile takes more than an hour. Without \auto-inline", the compile still takes several minutes and performance slows from 15 M ops to 11 M ops.
The KSR compilers have not yet been optimized for compile-time speed. The KSR takes over 6 minutes to compile the 3000-line Livermore Loops FORTRAN code with -O2 optimization. Compile times for the i860 (a Sun 4/390 crosscompiler) and the IBM RS/6000-530 are under one minute. A similar disparity in performance is exhibited by the BYTE benchmark suite, a set of C programs and shell scripts that exercise various UNIX features including multiple processes, pipes, and compiles. The time for a BYTE run on the KSR was more than ve minutes, compared with under one minute for the IBM 530. (The BBN TC2000 ran the BYTE suite in 113 seconds, the Sequent Symmetry in 117 seconds.) Some of the slowness can be attributed to the development stage of the OS and I/O subsystem. The disk subsystem will eventually support a RAID organization with striping, but at present each disk is a separate UNIX partition. Basic I/O data rates from the disk subsystem measured with a le system exerciser (FSX) and simple write/read tests are competitive with data rates from the IBM 530. There was some measurable performance di erence if the I/O test was performed on the cell attached to the disk subsystem. Write times dropped from nearly 1 Megabtye/second on the I/O cell to 0.31 MBs on other cells. Read times were about 1 MBs and showed little variation from cell to cell, presumably due to disk bu er caching. (A 16 Megabyte le was written/read using 16 KB blocks.) Concurrent I/O tests, multiple processes writing/reading independent les on separate disks, showed promising results with a 2.4 MBs aggregate read rate on four cells | results competitive with concurrent I/O rates on the Intel hypercube le system (CFS) 7]. More extensive I/O tests will be performed when the disk system is more optimally con gured.
The performance of a single KSR processor in executing some simple pro-cess control primitives is given in The computational performance of the KSR depends on the e ectiveness of the user's program in utilizing the memory hierarchy. The large number of registers and dual instruction streams permit the compiler to generate code to do computations in one instruction stream while loading and storing data in the other. The large register set makes it feasible to unroll loops to a greater depth. A hand-unrolled FORTRAN double-precision (64-bit) matrix multiply achieved 33.3 M ops.
Data for the registers are fetched from a 256 KB data cache (sub-cache). This large cache sustains high performance over larger vector sizes. Figure 3 .2 illustrates the performance of a repeated double-precision complex zaxpy vector computation for various vector sizes. The zaxpy is repeated 10,000 times on the same two vectors for various vector sizes. Although this test is not representative of any application, it does serve to illustrate cache behvior. When the cache can no longer contain all of the data, performance drops as data has to be fetched from the slower main memory. The advantage of the larger cache is evident when compared with the smaller caches of the i860 (8 KB data cache) and 530 (64 KB data cache). The 256 KB cache actually will hold all of the data for the 100 100 LINPACK. Performance for a 128 128 matrix drops to 5 Mega ops for the unmodi ed FORTRAN code. However, by using a blocked algorithm as KSR has done for the 1000 1000 LINPACK, performance reaches 31 Mega ops If the KSR processor fails to nd a data item in the local memory, it must issue a request to the ring to fetch the data from one of the other processors. In the absence of other activity on the ring, we measured this latency to average about 6.7 microseconds ( s). For the BBN TC2000, a remote access takes less than 2 s, but on the BBN the remote access is not cached to the requesting processor. By contrast, on the KSR, subsequent references will be local (in the absence of other exclusive requests for that location from other processors). A remote access on the iPSC/860 or DELTA would require a send/recv and would take roughly 150 s. Faulting a large vector from one KSR cell to another, using a 128-byte stride, resulted in a data rate of 19.5 MBs. Using the prefetch instruction (up to four may be in progress at once), the measured data rate increases to 34 MBs. By comparison, the peak data rate for iPSC/860 is 2.8 MBs, and the measured peak for the DELTA is about 17 MBs 8] . In the following section we run these memory tests concurrently on multiple processors and measure both single processor and aggregate data rates.
Parallel Performance
To measure the parallel performance of the KSR system, we ran a number of the tests in the previous section concurrently on multiple processors. In addition, we measured parallel performance of the memory system under various loads. Parallel tests of various synchronization primitives were conducted as well. The parallel tests were conducted using the pthread library and \binding" each thread to a separate processor.
Concurrent Memory Tests
The prefetch test was run concurrently on independent pairs of processors. The linear response and aggregate data rate are quite good, but these tests were not able to achieve the vendor-stated peak of 1 GBs. To stress the memory subsystem, we measured the average time for doing an unrestricted update of a shared variable with varying number of processors. The unrestricted update is unrealistic, since in a real application such an update would be coordinated with a lock. However, the test is adequate for our intent of measuring the response of the memory subsystem to a very hot spot. For comparison, the same test was performed on the BBN and Sequent systems. For all three machines, the compute time is comparable and increases linearly with the number of processors (Figure 4.2) . Though the Sequent has a slower CPU, its memory latency is better than either the BBN or KSR, so the compute times for this test are comparable. For both the BBN and KSR, the memory subsystem does not reach saturation until more than four processors are contending for the shared location. For further comparison, we conducted the hot-spot test on the distributed memory iPSC/860 and DELTA. Multiple processors send a message to the owning processor requesting the current value, followed by a message updating the value. For 32 processors, the average update time was 5.7 ms for iPSC/860 and 5.9 ms for the DELTA compared to 63 s for the KSR. To further study the e ects of a hot spot in a shared memory, we used the workload generator described in 16 ]. An input le to the generator describes the various workload characteristics for exercising a shared-memory system. One can specify the number of shared locations, the percentage of shared references to local references, and whether locking is required. We ran the workload using a single shared memory location and no locking for various percentages of sharedto-local references. The occurrence of the shared reference within the workload can be deterministic or probabilistic 16]. The tests were run on the KSR, BBN, and Sequent systems. Figure 4.3 shows the e ciency of each system for a 1% and 10% shared access ratio using the probabilistic model. E ciency is measured as the average time for executing the workload on a single processor (the \shared" location is local in this case) divided by the average time for executing the workload concurrently on p processors, T 1 =T p . Although the three systems performed comparably when the memory subsystems were saturated (Figure 4 .2), their behavior under lighter loads is markedly di erent. The Sequent shared-bus can easily keep up with the demand from the workloads. The e ciency for the KSR falls o faster than for the BBN, but response of the memory subsystems (shape of the curve) are roughly the same. The KSR has a faster processor and longer remote memory latency than the BBN which accounts for most of the performance di erence. workload. The KSR is noticeably slowed in relation to the other two systems, suggesting the need for coarser-grained applications for the KSR shared-memory system.
Locks and barriers
Access to a shared location is usually controlled by an atomic locking operation. A synthetic lock/unlock test was run on the three shared-memory systems to measure the performance of locking operations on a single lock (Figure 4 .5). The performance of the KSR hardware lock instruction, gsp (blocking version), is better than the mutex library routine for a few processors, but gsp performance degrades rapidly for more than 15 processors. The mutex version is thus preferred and performs well compared to the BBN and Sequent. The lock controls data access, the barrier controls synchronization of processes or threads. tree to implement the barrier, so performance goes as the log 2 of the number of processors. The KSR barrier function also provides an option for a spanning-tree like implementation. The dashed line in Figure 4 .6 shows the improved KSR performance using a tree of width four. (Presumably a similar implementation for the BBN would improve its barrier performance as well.) The bus-based Sequent shared-memory system provides the best performance, but the architecture is not extensible beyond 30 processors. Memory (or message-passing) latency, bandwidth, and contention account for most of the di erence in barrier performance for the di erent machines. Since we are using wall-clock time, the barrier times may also be a ected by the OS overhead on one or more processors on each system. OS timer interrupts typically occur every 10 ms. The timerinterrupt overhead on the Intel nodes is only about 50 s, but for the UNIX-based systems (KSR, BBN, and Sequent) the overhead is on the order of 500 s.
Parallel applications
The next class of benchmarks we used in comparing the KSR with other architectures consisted of small C applications that utilize shared memory, threads, barriers, and locks. The applications do simple numeric integration using spatial decomposition (static allocation), matrix multiply using spatial decomposition (static allocation), nite di erence using chaotic Jacobi iterative method with static spatial decomposition, a parallel quick sort using a queue-of-tasks model (dynamic allocation), and solve a linear system using Cholesky factorization (dynamic allocation). The codes use explicit parallelization and were easily ported to the KSR from the Sequent version. The main objective was to compare the shared-memory architectures running identical source programs (except for the translation of the calls that manage the parallelism). Figure 4 .7 illustrates the Cholesky performance for the shared-memory multiprocessors and for the Intel distributed-memory multiprocessors. The sharedmemory code could not be run on the Intel multiprocessors, so the Intel performance includes the e ects of a di erent algorithm | the program must explicitly move portions of the matrix among the various processors. The performance of the serial code is represented as processor 0 in the gure. The BBN outperformed the KSR in the parallel (and serial) quick sort and numerical integration. The quicksort is integer work and the BBN also performed the dhrystones faster than the KSR (Appendix A). The numerical integration is dominated by oating-point divides which the KSR does in software and the BBN does in hardware.
The performance of these tests was consistent with the underlying speed of the individual processors and memory subsystem. In general, the Sequent was slower in absolute time, but maintained a higher e ciency (speedup divided by the number of processors) with increasing processors. The Sequent's low latency bus architecture accounts for the high e ciency, but the architecture is not extensible to more than 30 processors. The KSR was faster than the BBN in most tests and maintained a higher e ciency. Even though the BBN memory hierarchy has a lower latency than the KSR ring, the KSR's ability to \fault" a remote reference into a local reference results in higher performance for these tests. (A BBN-tuned application would see that work was assigned to a processor that \owned" the distributed portions of the global data structures | such tuning is not required for the KSR, though it too can pro t from such tuning.) To compare the architectures with a larger problem, optimized for each architecture, we used the 1000 1000 double precision LINPACK 3] . The KSR implementation was based on a block algorithm implemented by KSR's Nick Camp in FORTRAN with some assembly language. The matrix is manipulated in groups of columns to optimize the use of the 256KB cache, and post-store's are used to reduce ring latency. few data points of the KSR performance curve. The two Intel machines share the same processor and roughly the same message latency, thus the di erence in their performance is due to the higher bandwidth of the DELTA mesh. The KSR outperforms the Intel multiprocessors because it has both higher bandwidth and lower latency. (Performance gures are not available for the BBN and Symmetry, but since their single-processor performance is low, their parallel performance would not be competitive for this test.)
Early Experiences
As the various benchmark kernels were being developed and tested other users were working on porting applications to the KSR. The KSR multiprocessor is designed to make porting applications easy and that has been our initial experience, both for serial and parallel codes. The rst parallel application to be ported was a 19,000 line FORTRAN code that calculates energy densities for high temperature superconducting materials 9]. The code already contained explicit Cray parallel micro-tasking directives, so porting to the KSR merely required changing the names and arguments for thread creation and joining and for lock management. The parallel version exhibited near linear speedup and achieved 243 M ops on 32 processors. Serial and parallel versions of a sparse-matrix library (SPARSPAK, 17] ) and a large FORTRAN global climate modeling code are also being ported to the KSR. Each of these large FORTRAN applications has usually uncovered one or more bugs in the -O2 optimization of the FORTRAN compiler. These bugs were usually xed quickly. SPARSPAK includes implicit parallel directives for the Cray and Sequent, and those directives map nicely into corresponding KSR directives. The climate modeling code also has Cray parallel directives.
A number of UNIX C codes were ported as well, including the Network Time Protocol (NTP) 15], a variety of hypercube simulators 5], and PVM 10]. Some of the C codes had to be modi ed to account for 64-bit long's. The hypercube simulators use fork() to create sub-tasks and then use pipes, sockets, or System V shared memory to communicate among the sub-tasks. Performance for these simulators was poor, since the scheduler presently runs only one sub-task at a time.
Hardware reliability has been very good, with only two board failures during the rst four months. The compilers and operating system have improved with each release, and KSR support has been very responsive. The OS still lacks several features for full multi-user support, but those features will be available in the rst production release of the OS.
We will continue tracking KSR performance with the new releases and hope to expand the system to include a second ring. A second ring would permit us to better understand the extensibility of the architecture. We would like to develop analytical models of the performance of the memory hierarchy in terms of latency, hit ratio, and contention. A hardware memory event monitor will be installed on each cell in early summer. Data from the event monitor will permit us to better measure architecture and application performance. Finally, the user community will be expanded, providing more applications and a better understanding of the ease of use of the KSR multiprocessor.
A. Comparative Architectures
The KSR is compared with a number of other processors. This appendix summarizes the architectures and con gurations used in this report. The processor architecture of the IBM RS/6000 and the Intel i860 share several common characteristics with the KSR processor: independent integer and oating point units and pipelined independent adder/multipliers in the oating point units. The Sequent and BBN parallel processors provide contrasting shared-memory architectures. Finally, the Intel distributed-memory parallel processors provide contrast to KSR's shared-memory model.
BBN TC2000
The BBN TC2000 at Argonne National Laboratory (ANL) is a 45 processor shared-memory parallel processor. Each processor is a Motorola 88000 running at 20MHz with 16 MB of memory fronted by a 16KB data cache and a 16KB instruction cache. All of the memories are interconnected by a 2-stage 8-way switch. The system can be expanded up to 512 processors. The Uniform programming environment (under nX 2.0.6) provides the program with both local and explicitly allocated shared memory. The shared memory may be allocated in another processor's memory, and thus a non-uniform memory access (NUMA) model is supported. In the absence of contention, a remote reference typically takes less than two microseconds, and a single channel of the switch has a bandwidth of 40 MBs 19] . The architecture could be used with other memory management policies 13]. Compiles on the BBN were done with -O -lus. LINPACK 100 100 double-precision was 1.0 M ops using -OLM -autoinline. Dhrystone (v1.0) was 19.4 Mips.
IBM RS/6000-530
The IBM RS/6000-530 uses a 25 MHz processor with a 64 KB data cache and a 400 MBs memory bandwidth. The processor has an independent integer and oating point unit, and the oating point unit has an independent adder and multiplier. The peak performance is thus 50 M ops. The workstation used in the tests was running AIX 3.1 in 16 MB of memory. Compiles used -O optimizations. LINPACK 100 100 double-precision was 11 M ops 3]. Dhrystone (v1.0) was 23.7 Mips.
Intel iPSC/860 and DELTA The Intel iPSC/860 hypercube and DELTA mesh distributed-memory parallel processors both use the 40 MHz i860 processor. The i860 has an 8KB data cache and 8 MB of memory (16 MB on the DELTA) with a memory bandwidth of 160 MBs. The processor has independent integer and oating point units, and the oating point unit has an independent pipelined adder and multipler for a peak rate of 64 M ops. The iPSC/860 has a maximum con guration of 128 processors. The processors are interconnected with a hypercube network with a latency of about 60 microseconds and a bandwidth of 2.8 MBs per channel 7] . The DELTA is a mesh connected parallel processor located at Cal Tech with a maximum con guration of 512 processors. The mesh has a latency of about 50 microseconds and a measured bandwidth of about 17 MBs/channel 6]. The processors run NX 3.3 and compiles were done with -O3 -Knoieee on a separate \host" processor. LINPACK 100 100 double-precision was 6.5 M ops 3]. Dhrystone (v1.0) was 29. 4 Mips.
Sequent Symmetry
The 26 processor Sequent Symmetry located at ANL is based on 80386/387 processors (16 MHz) with a Weitek 3167 oating point co-processor. Each processor has a 64KB cache, and 32 MB of memory is shared by all processors on a 54 MBs bus. The maximum con guration is 30 processors. The processors run Dynix 3.1.2, and compiles were done using -O. LINPACK 100 100 double-precision was 0.37 M ops 3]. Dhrystone (v1.0) was 3. 
