Moving paragon™ operating system to a new hardware platform  by Pfiffer, A.K.
Pergamon 
Computers Math. Applic. Vol. 35, No. 7, pp. 33-43, 1998 
Copyright(~)1998 Elsevier Science Ltd 
Printed in Great Britain. All rights reserved 
0898-1221/98 $19.00 + 0.00 
P Ih  S0898-1221 (98)00030-3 
Moving Paragon TM Operating System 
to a New Hardware Platform 
A. K. PFIFFER 
Intel Corporation, 15201 NW Greenbrier Parkway CO1-01 
Beaverton, OR, 97006, U.S.A. 
andyp©ssd, intel, tom 
Abst ract - -Th is  paper discusses some of the problems, solutions, and strategies discovered while 
moving features of Paragon 1 Operating System (OS) to a new hardware platform. The new hardware 
platform differs from the Paragon System in several areas, including processor architecture, multipro- 
cessor capabilities, I/O support, and interconnect technology. Performance metrics for critical-path 
message-passing software and other OS microfunctions are compared to the existing Paragon plat- 
form. 
Keywords- -Supercomputers ,  Operating systems, Paragon, Single system image. 
1. INTRODUCTION 
In September  of 1995, the United States Department  of Energy (DOE) selected Intel Corporat ion 
to deliver to Sandia Nat ional  Labs the first computer  system 2 capable of sustained performance 
of 1 TFLOPS on appl icat ions crit ical to the stewardship of the U.S. nuclear stockpile. The scale 
of the system is immense; it will be the largest system yet produced at Intel and is among 
the world's  largest mult icomputers  in terms of processor count. The system will contain over 
4,144 nodes, each node containing two Intel Pent ium ® Pro Processors. More than 540 gigabytes 
(540 GB) of total  memory  will be present in the system. Two sets of online disk storage with 
roughly 1,000 gigabytes (1 TB) of capacity each will be provided with the system. Sustained a ta  
transfer ates for the disk systems are nominal ly  rated at 1 GB/second each. Peak computat iona l  
performance of the system is expected to be on the order of 1.8 TFLOPS.  
This  paper  takes a brief look at the challenges involved in moving Paragon OS in use today  on 
Paragon systems into the service and I /O  part i t ions of the 1+ TFLOPS system. 
2. A NEW SYSTEM ARCHITECTURE 
A port ion of the 1+ TFLOPS system will be running Paragon OS, leveraging scalable oper- 
at ing system technology that  has been matur ing over the last five years. The remainder  of the 
product ion system will be running Cougar, a l ight-weight environment derived from PUMA (a 
Special thanks are warranted for L. Brown and C. Fleckenstein for their contributions to this project. I would also 
like to express my gratitude to the many Ciants back at the shop for encouraging me to stand on their shoulders. 
The author has worked with UNIX TM and a variety of multicomputer operating systems ince being humbled by 
the FPS T-Series in 1986. The author has spent he last five years working in, on, under, and around Paragon TM. 
1Trademark Information: A number of trademarks and registered trademarks appear within this document. 
Inte1386 TM, Intel486 TM, i860 TM, Paragon TM, and EtherExpress TM are trademarks of Intel Corporation. Intel®, 
i®, iPSC®, and Pentium® are registered trademarks of Intel Corporation. All other products and company 
names are trademarks of their respective owners. 
2Additional material regarding the system is online under h t tp  ://www. ssd. ±ntel .  corn/ 
Typeset by .A.A/tS-TF~ 
33 
34 A .K .  PFIFFER 
descendent of SUNMOS [1]). During the course of development, we expect o run Paragon OS 
on large and small numbers of nodes in development systems to continue refining performance, 
stability, and scalability. At a gross level, the new system architecture remains Paragon-like--- 
a mesh-connected multicomputer. At least one node, perhaps all, 3 run Paragon OS with full 
single-system-image and a complete UNIX TM programming environment. Support for high speed 
user-to-user NX message passing [2] and large, multinode parallel applications i provided. The 
node architecture, the interconnect, the backplanes, and the remaining aspects of the hardware 
differ from Paragon. 
Paragon OS is composed of several major software components. At the bottom, a Mach3 
kernel [3] provides VM, thread scheduling, IPC, and other basic services. Single system image [4] 
UNIX TM services are implemented by user-mode Mach3 applications [5] with TNC enhancements 
from Locus Computing (e.g., [6]). OS-level communication between nodes is provided by an 
Intel developed variant of Mach network IPC [7]. In addition, Paragon OS contains an Intel 
developed protocol framework internally known as MCMSG designed primarily to support rapid 
deployment of high performance user-to-user message passing [8,9] applications with or without 
using an additional on-board processor for protocol processing. 
2.1. A New In terconnect  
As with Paragon, the interconnect of the system remains a mesh. However, peak point-to- 
point communications bandwidth as been improved by roughly a factor of two unidirectionally, 
resource contention under heavy message load has been reduced, fault detection and isolation 
capabilities have been improved, and new router-to-router signaling mechanisms have been de- 
veloped that are well beyond the scope of this paper. 
Router-to-router flow control, in-order delivery, and worm-hole routing of flits have been re- 
tained in the new interconnect. The hi-level functional division to implement the interconnect 
also remains: both the Paragon system and the new system contain components on the node card 
that connect o the routing plane (a Network Interface Chip--NIC), while a different component 
is used to implement the routing fabric itself (a Mesh Routing Chip--MRC). 
New fault isolation and detection features not present in the Paragon interconnect have com- 
plicated the once simple definition of "reliable" with respect o the mesh. 4 Enumerating the fault 
detection and isolation features is beyond the scope of this paper, but some are covered briefly 
in the System Support Section later. 
2.1.1. New routers  (MRCs)  
The 2D nature of the Paragon mesh has been extended to a 3D mesh on the new system. 
The 3D mesh is restricted in one axis to a small number of relative hops. Each new router 
has X, Y, and Z ports for the exchange of data. On Paragon, message routes are provided to 
the interconnect during low-level message injection by specifying signed relative hop counts [AX 
Ay] when beginning a send. On the new system, routes are still specified in relative hop counts  
during message injection, but in terms of [AZ AX Ay  AZ]. 
The maximum number of Z-hops are limited to no more than +7. The first Z-hop is intended 
to route a message from the source node to one of the two planes, with the second Z-hop to 
route from the plane to the destination ode. The maximum X- and Y-hops are =h255 and +63, 
respectively. When hop counts within the message reach +0, routing of the message continues on 
the next axis with a nonzero hop count. When all hop counts within the message have reached +0, 
the message will be delivered. There exists an additional feature for concatenating routes within 
3not in the production system, however. 
4Routers never silently drop messages, although some component failures could deliver a variety of malformed 
messages. 
5Separate sign bits are used rather than 2's complement representation, so -I-0 and -0  are equivalent. 
Moving Paragon TM Operating System 35 
one message (for routing around failed components), but that is also beyond the scope of this 
paper. 
Peak communications bandwidth of the new routing devices has been nominally rated at 
roughly 400 MB/s one-way, 800 MB/s bidirectional per port. Aggregate throughput of the new 
MRC component is nominally rated at 6 x 400 MB/s = 2.4 GB/s. 
Contention has been reduced by exposing to the low-level programmer four virtual lanes. Each 
lane specifies a logically disjoint network that is multiplexed over one physical interconnect. When 
independent messages are sent on different virtual lanes over coincident routing paths, flits of the 
messages are time-multiplexed by the routers over the path. 
If, for example, four messages each on a unique lane need to traverse the same path, each 
message will realize 25% of the peak unidirectional bandwidth. Similarly, two messages will each 
realize 50% of the peak unidirectional bandwidth. In the Paragon routers, a message stall due 
to contention for coincident paths or sluggish receivers will block access to the coincident path. 
The new routers stall messages granular to the lane, allowing other messages on different lanes 
to make progress. Messages ent on the same virtual lane over coincident paths stall in a similar 
fashion to the Paragon router. 
Routing planes (aka "backplanes") are constructed in both the Paragon system and in the 
new system by interconnecting routing components, with power and mesh connections upplied 
to nodes through connectors. Paragon backplanes utilize 16 routing components arranged in a 
4 x 4 mesh. Four sets of cables are used to interconnect backplanes to form larger meshes. 
Backplanes in the new system utilize two groups of four routing components, with each group 
of four interconnected in a 2 x 2 mesh forming an XY routing plane. The two 2 x 2 planes are 
interconnected together on the Z axis at the vertices of the 2 x 2 meshes to form a binary 3-cube. 
The MRC design, however, does not require this arrangement (i.e., multiplanes, stars, tori, pipes, 
etc., are also possible). 
The production 1+ TFLOPS system will contain in excess of 2,688 MRCs. 
2.1.2. New N ICs  
The new NIC is unlike the Paragon NIC in nearly all aspects: bus interfaces, programming 
model, and support services. Whereas the Paragon NIC is purely passive with respect o routing, 
the new NIC contains a subset of the routing resources identical to those within the new MRC. 
These resources include +Z ports as well as an internal port connecting the communications 
agent within the NIC to the bus interface agent. In principle, it is possible to construct a 
small Pentium ® Pro Processor multicomputer (a "line" as opposed to a "mesh") using only 
NIC components chained together through the +Z ports. Injecting a message through the NIC 
with hop counts of [4-0 4-0 4-0 4-0] in effect causes a loop-back of the message. 
The new NIC is a Pentium ® Pro Processor bus master. The NIC has been nominally rated 
to support transfer of in- or out-bound data to and from the processor bus at a maximum 
of about 400MB/s. The Pentium ® Pro Processor bus is nominally rated to support slightly 
over 500 MB/s of transfer bandwidth. 
The Paragon NIC contains separate transmit and receive 2 KB FIFOs and logic to support 
cache-line sized transfers of data, as well as support for programmed I /O (the CPU may fill or 
drain the FIFO "by hand" with direct loads and stores). The new NIC does not contain generally 
accessible FIFOS, nor does it support programmed I /O by the CPU. The new NIC contains logic 
to support byte-wide DMA transfers, with optimizations for cache-line sized operations. 
The new NIC exposes to the low-level programmer command queues to initiate message trans- 
fers. Each command queue is eight elements deep, with each element containing a physical 
memory address, a byte count, and control bits to specify end-of-message, error checking, and 
the disposition policy to be used when the command completes. Separate command queues for 
transmit and receive are defined, with a pair of queues for each of the four virtual lanes. 
36 A.K. PFIFFER 
The Paragon NIC can be programmed to generate interrupts on several different conditions, 
supplying status important for managing the on-chip FIFOS. The new NIC makes visible to the 
low-level programmer a more general mechanism termed events. Events are collectively a set of 
state changes that can occur within the NIC. Most of the state changes reported as events upply 
status important for managing the state of the on-chip command queues. For example, the low- 
level programmer can request notification via an interrupt upon the byte count of a command in 
the queue reaching zero, or when room for another command in the queue has become available. 
Event status bits can be read from a memory-mapped register inside the NIC. The new NIC can 
also be configured to update an in-memory copy of the status register directly, allowing software 
to spin on a cached copy and without generating bus traffic. 
There is also an experimental feature present within the new NIC for "remote bus operations", 
allowing cache-incoherent shared memory operations. At present, we do not expect Paragon OS 
to exploit this feature. 
The production 1+ TFLOPS system will contain in excess of 4,144 NICs. 
2.2. New Nodes  
As with the Paragon node (first 2x and later 3x i860 TM XP-50 processors), each node of the new 
system is a shared memory multiprocessor. All nodes of the new system contain two Pentium ® 
Pro Processors haring one memory. All nodes, by adding sufficient industry standard add-on 
devices (disks, video cards, etc.) are PC compatible and can run shrink-wrapped operating 
systems. The new NIC is directly attached to the Pentium ® Pro Processor bus. Two types of 
nodes, compute and I /O, will be present in the new system. 
I /O  nodes are designed for increased I /O and memory capacity. Two on-board PCI bus inter- 
faces allow industry standard PCI cards to be installed for access to large capacity disk RAIDs 
and other external devices (e.g., ATM). In addition to the PCI busses, each I /O node contains an 
on-board EtherExpress TM PRO-10 compatible Ethernet interface, as well as an NCR SCSI disk 
interface. Sufficient SIMM slots are present o allow up to 1 GB of memory to be installed across 
two memory controllers. Each I /O node has a 9U physical form factor. 
Compute nodes are designed to improve the MFLOPS/ in 3 ratio of the system, utilizing a 
feature of the new NIC. Two compute nodes are present on each 9U sized card. The inner and 
outer halves of the card are identical, with the inner node connecting directly to a routing plane, 
and the outer node indirectly connected to the routing plane via a chained connection in the 
Z-axis to the inner node's NIC. A maximum of eight SIMM slots for memory are present on each 
node. A single memory controller and a single PCI bus interface (to support the manufacture 
and testing of the card) are present on each node. 
2.3. Sys tem Suppor t  Serv ices 
Unlike the single diagnostic station and single JTAG scan string woven throughout the Paragon 
system, the new system employs a more scalable solution. A Patch Support Board (PSB) per 
backplane is utilized. The PSB is built around an embedded Intel386 TM running a real-time 
executive and numerous external interfaces. 
Each PSB has control and status interfaces to the local backplane resources (routers, fans, 
power supplies, nodes, etc.). In addition to JTAG scan interfaces, 8UART interfaces are con- 
nected to one of the RS-232 ports on each node. Software within the PSB, combined with control 
signals present on the backplane, are used to switch RS-232 connections between the inner- and 
outer-halves of compute nodes. An Ethernet interface is also present on each PSB. 
Some of the multiple functions the PSB performs are configuration of the routers, reset of 
nodes, and monitoring backplane-local resources for component failures. Monitored components 
range from power supplies and fans to faulting nodes. 
Moving Paragon TM Operating System 37 
Through Ethernet, all PSBs are connected to a Scalable Platform Services (SPS) station. 
Software on the SPS station communicates with the many PSBs within the system to monitor 
for faults, online and offiine groups of components, log errors, track serial numbers, and identify 
faulted components. The SPS station can be instructed, for example, to inform groups of PSBs 
to reset nodes, or select out pieces of the system for subsequent routing-around failures. 
The production 1+ TFLOPS system will contain over 330 PSBs. 
3. DUST ING OFF  PARAGON OS 
FOR THE INTEL  ARCHITECTURE 
We began this project by dusting off the Intel Architecture support for PCs already present in 
our source tree. Well, at least some of what we needed was in the source tree. 
3.1. Bootstrapping 
In Paragon prehistory, our organization had been running what eventually grew into Paragon 
OS on the i386TM-based iPSC @ hypercube. Our collaborators also ran similar software on desk- 
top PCs. Those efforts were largely focused on small clusters [4] of nodes, each complete with 
local I /O devices and low-speed, unreliable networks [7]. 
Over time, our source base for the Paragon became heavily biased towards the i860. Source 
that was present in the code base for the Intel Architecture family was in some cases years old 
(predating i486 TM support for proper copy-on-write semantics). In one case, source code needed 
for this project had never been installed into our master source tree. It had to be retrieved 
from a "storage receptacle" in a forgotten corner of the building. Internal staffing constraints 
over the years had forced us to be less than diligent with respect to non-i860 processors and 
machine-independent code. 
To begin moving Paragon OS to the new system, we pulled together discarded pieces of i386 TM 
and i486 TM systems, building our own systems until procuring more modern equipment. 6 We 
discovered, like many others that have worked on operating systems for PC compatibles, that 
some PCs are simply more compatible than others. We eventually settled on a few common 
devices that we would actively support, and ignore the vast array of available cards. 
As nodes fbr the new system became available ( I /O nodes were the first off the line), several 
were pressed into duty as stand-alone development systems. These systems are internally known 
as "dots" (a single-node mesh). 
As of this writing, several meshes are now heavily used, containing a mix of I /O and compute 
nodes. Our internal evaluation staff regularly run many of our internal Paragon evaluation test 
suites developed over the last five years on desktop PCs running Paragon OS and mesh-based 
systems as well. 
3.2. Por t ing  MCMSG,  N IC ,  and  L ibnx  Code  
In order to pass messages and support Paragon OS on multiple nodes, the existing Paragon- 
specific message passing software had to be ported or reimplemented. There was a strong desire 
to minimize perturbations within the code base and to support both platforms from a common 
source tree. We identified three layers of code with impure boundaries: the MCMSG protocol 
framework, low-level Paragon NIC code, and user-mode access to lower levels, largely through 
Libnx. 
A large fraction of the Paragon MCMSG layer, protocol modules, and Libnx were rearranged 
into generic code to be shared between the two platforms. In most cases, the task was largely 
"build" mechanics: moving files around in the source tree. In other cases, poorly placed, missing, 
or otherwise bad # i fde fs  needed attention. 
6133 MHz Pentium® systems are now common among developers. 
38 A.K. PFIFFER 
Very low-level Paragon-specific code generally had to be reimplemented. Either the code in 
question had become the victim of ruthless performance tuning, or it had become intertwined 
over time with the unique aspects of the Paragon interconnect. The general approach to solving 
the problems were to identify a sufficiently high-level cut point, and recode from there down. 
The programming interface to the new NIC made many low-level routines straightforward to 
implement. Others were more difficult. 
3.3. ASMP and AP IC  
Paragon OS for the Intel Architecture now supports multiprocessor systems that are compliant 
with the Intel Multiprocessor Spec. As on Paragon, we have implemented "mostly symmetric" 
support: any thread can get into the kernel on any CPU, but only one CPU is allowed in the 
kernel at a time. In practice, this works well for Paragon OS as the bulk of UNIX support is 
multithreaded and running in user-mode. Fine grained locking for only two CPUs may often 
incur unforeseen overheads that offset any advantages gained by multithreading. 
Some support has also been added for using "partial" APIC mode for cross-processor interrupts. 
Full APIC support in Paragon OS was deemed to have a high implementation cost with few 
performance benefits for the expected usage model of the system. It is expected that in production 
mode, and when using a CPU for a message coprocessor, interrupt overheads will be avoided 
simply by not taking interrupts, rather than making some interrupt service incrementally faster. 
3.4. Drivers 
New Mach3 drivers were also developed. Among them are an EtherExpress TM PRO-10 driver, 
new NCR SCSI2 support for fast/wide, and an FDDI driver is nearly ready. 
3.5. Ins ta l la t ion  Issues 
Early on in the project, it became apparent hat the combined sizes of the Mach3 kernel, 
the OSF/1AD server, and other tools needed for a cold installation would simply not fit on a 
single 1.4 MB PC bootable floppy disk. The PC support we had been using supported a "two 
floppy" installation: one disk for the operating system, and one disk for a root file system. 
This was sufficient for bootstrapping among developers, but cumbersome for others within our 
organization. 
We resolved the issue by developing a "single floppy" installation. The bootable floppy disk 
contains a Mach3 kernel, and a stand-alone Mach3 application capable of interacting with the user 
via a menus to down load additional software components over Ethernet via the TFTP  protocol. 
One of the components down loaded is a multimegabyte disk image containing conventional 
UNIX applications (as on the Paragon). After down loading the disk image, it is written into a 
pseudo-device driver resident in the Mach3 kernel for later use as a root file system. The second 
component down loaded is an OSF/1AD server. Control is passed from the TFTP  loader to the 
OSF/1AD server after down loading as if the Mach3 kernel had originally loaded the server from 
disk. 
3.6. Headless Kernel  Debugging 
On nodes within systems (as opposed to desktop Intel Architecture systems used for portions 
of the development), no PC compatible video cards or keyboards are present. We resolved the 
issue of using the kernel debugger and Paragon OSs notion o f /dev /conso le  by allowing selection 
of the console device at boot time to be either of the PC compatible RS-232 ports, or a video 
adapter and keyboard if present. The second RS-232 port on nodes within systems may be used 
remotely over Ethernet by interacting with the PSB. 
Moving Paragon TM Operating System 39 
4. SOME EARLY  MICROBENCHMARKS 
The project is far enough along to make some initial comparisons of specific functions between 
the two platforms. Very little tuning has been done, and several performance factors (clock rates, 
bus speeds, L2 cache, etc.) have not yet been fully characterized. 
Differing page sizes are also an important factor. Paragon OS on the i860 uses two contiguous 
4 KB physical pages for a single logical page size of 8 KB. For the Intel Architecture, we are still 
running with a 4 KB page size. We expect to move to 8 KB logical pages soon. 
The Pentium ® Pro Processor systems in house have differing clock rates, differing L2 cache 
sizes, and a large variety of options for the memory controller (in-order memory queue depth 
or QD, interleave, page mode policy), and are invariably set inappropriately when needed for 
benchmarking. 
It should be noted that the MCMSG layer used on the new platform does not have the 
Paragon-style message-coprocessor s ftware enabled (or implemented). All Paragon data were 
gathered with nondebug Paragon OS R1.4 kernels and with the message-coprocessor enabled with 
BOOT_CPU_MODE=ama. 
4.1. NX Message Passing 
NX programs for Paragon OS on the new platform are compiled in a similar fashion [10] to 
Paragon, although the current compiler drivers are not "-nx" aware and additional object modules 
must be specified during linking. 
Table 1 details the results reported from the standard NX speed-of-light program la t .  c. Band- 
width is reported in millions of bytes/second. 
Table 1. NX speed-of-light. 
0-byte Max 
Nodes 
Latency (us) Bandwidth 
i860XP-50 
28.4 155.9 
MP3, NICB 
Pentium® Pro 
16.1 132.0 
200MHzQD = 1 
Pentium ® Pro 
15.7 153.0 
200MHzQD=8 
Latency is nearly half that of the Paragon, and NX bandwidth is nearly equal. Bandwidth is 
expected to improve after switching to 8 KB logical page size. This level of performance without 
using MCP mode on the Pentium ® Pro nodes is encouraging. 
4.2. RDMA Protoco l  
The RDMA protocol is one of two lightweight protocols utilized by the Mach3 kernel to imple- 
ment the Intel developed fast remote Mach3 IPC. It is a useful indicator of peak message passing 
performance for OS-to-OS communications spanning node boundaries. 
Table 2 details the RDMA bandwidth measurements (user-to-user application). Bandwidth is 
reported in millions of bytes/second. 
At 32 K message lengths, the Pentium ® Pro node attains 138.923 MB/s of bandwidth is a 
promising result, considering that the maximum packet size in use is 4 K, and that the message 
coprocessor feature is not yet enabled. This benchmark has historically tracked the NX speed- 
of-light bandwidth. 
40 A. K. PFIFFER 
Table 2. RDMA speed-of-light. 
0-byte 4 KB Len 8 KB Len 
Nodes 
Latency (us) (bw) (bw) 
i860XP-50 
47.21 63.79 91.56 
MP3, NICB 
Pentium ® Pro 
23.65 104.81 122.01 
200MHzQD=8 
4.3. Mach3 
There are several direct and indirect measures of Mach3 that  are somewhat  useful indicators 
for system performance. Data  gathered in this section includes results from some Pent ium ® 
desktop systems. 
4.3.1. Interrupt service 
With  some I /O  devices in the product ion system, it will be important  o minimize interrupt  
service t imes. A rough approximat ion of pure overhead for interrupt  service can be computed 
by a user-mode program that  can rapid ly  sample a high resolution clock (e.g., as returned by 
dc lock( ) ) .  The AT  between adjacent ime samples can be examined for periodic d isturbances 
in the data  coinciding with clock interrupts.  
Table 3 reports approx imate clock interrupt  service t imes for three platforms. 
Table 3. Clock interrupt service time. 
System Time (us) 
i860XP-50 MP3 64.8 
Pentium® Pro 
50.7 
133MHzL2=256KQD =8 
Pentium ® 
20.3 
133 MHz L2 = s256 K EDO 
4.3.2.  Nu l l  Mach3 trap t ime 
The null Mach3 t rap  traverses the path from user-mode to the bot tom of a Mach3 kernel 
rout ine and back to user-mode. The entire path is t imed and averaged over 10,000's of calls. 
Table 4 reports the average t imes in microseconds for the null Mach3 t rap on three platforms. 
Table 4. Null Mach3 trap time. 
System Time (us) 
i860XP-50 MP3 5.535 
Pentium® Pro 
3.858 
133MHzL2 = 256K QD = 8 
Pentium@ Pro 
2.622 
200MHzL2----256KQD--8 
Pentium ® 
2.143 
133 MHz L2 ---- s256 K EDO 
4.3.3. Zero-fill VM fault t ime 
The t ime required to service a demand-zero-f i l l  fault can be derived by al locat ing demand-  
zero-fill address space, sampl ing a hi-resolution clock, reading from the new address space to 
Moving Paragon TM Operating System 41 
trigger a zero-fill fault, then sampling the clock again when the fault is satisfied. Accumulated 
and averaged over thousands of iterations can yield a reasonably good measure. 
It should be noted that the Pentium @ and Pentium @ Pro systems are operating with a 4 K page 
size, and the i860 is using a logical 8 K page size. 
Table 5 reports the average times in microseconds to service zero-fill faults on three different 
platforms. 
Table 5. Demand zero-fill time. 
System Time (us) 
i860XP-50 MP3 (8 K) 266.86 
Pentium® 
73.79 
133 MHz L2 = s256 K EDO 
Pentium® Pro 
39.80 
133MHzL2= 256KQD---8 
4.3.4. Loca l  Mach3 IPC  
Local Mach3 IPC is generally less interesting for the Paragon OS portions of the production 
system as most I /O and process control operations will span node boundaries. For desktop 
systems, the round-trip time for RPCs is important. 
Table 6 reports the average times in microseconds for local RPCs: one with no data, and one 
with a single page of data being remapped. 
Table 6. Local Mach RPC. 
0-Length 1 Page 
System 
RPC (us) RPC (us) 
i860XP-50  (8 K )  263.016 614.065 
Pent ium® 
41.360 111.860 
133MHz L2  = s256K EDO 
Pent ium ® Pro  
37.055 80.670 
133MHzL2=256KQD=8 
4.3.5. Remote  Mach3 IPC  
Mach3 IPC can exchange message data in either of two ways--by reference ("inline") or by 
value ("out-of-line" or "ool" ). Small messages are usually sent inline, large messages are generally 
sent out-of-line. Large messages are used for paging, mapped file I /O, and device I /O. Large 
messages contribute heavily to overall system bandwidth, small inline messages contribute heavily 
to the system's responsiveness (does it "feel" quick). 
Message data passed by reference utilizes the virtual memory system of Mach3 to make a 
"logical copy" without physically copying the data. 7 Message data that is passed by value 
requires the Mach3 kernel to make a copy of the data when the sender enters the kernel, and 
the data is also copied when the receiver of the message xits from the kernel. Neither method 
applies directly to our implementation f remote Mach3 IPC as NO Remote Memory Access (aka 
"NORMA") is possible. 
For remote Mach3 IPC, our implementation behaves as if the interconnect were a valid receiver 
of the message. A logical copy is made for out-of-line data using the VM system, and the logical 
copy is streamed across the interconnect when a receiving thread begins receiving the message. 
7Copy-on-write semantics supply a physical copy when and if the data is modified. 
42 A .K .  PFIFFER 
For inline message data, a physical copy is made when the sender enters the kernel, and the data 
is transmitted irectly from that location. 
On the receiving side of our NORMA implementation, out-of-line message data is delivered 
directly into the user's address pace with exactly 0 copies. Small inline messages use a different 
optimization, often carrying the data of the message in a single low-level control message, but 
Mach3 semantics till require 1 copy into the receiver's buffer. 
Good performance in both the inline and out-of-line cases requires fast address-space switching 
(which is poor on the Paragon i860 due to the lack of an L2 cache), fast interrupt handling, and 
high-performance message passing. Tables 7 and 8 compare the rates for inline and out-of-line 
remote Mach3 message data, user-to-user, with full Mach3 IPC semantics across node boundaries. 
Latency is reported in units of microseconds, bandwidth in units of millions of bytes/second. 
One "K" is 1024 bytes. 
Table 7. Remote Mach3 IPC inline latency (us). 
Message Length Pent ium® Pro i860XP-50 
(bytes) 200 MHz QD = 8 R1.4 MP3 NICB 
0 
4 
64 
128 
256 
1024 
110.09 
110.35 
111.65 
190.54 
192.29 
205.73 
604.71 
603.50 
622.72 
767.80 
784.42 
833.75 
Table 8. Remote Mach3 IPC ool bandwidth (MB/s) .  
Message Length Pent ium@ Pro i860XP-50 
(bytes) 200 MHz QD = 8 R1.4 MP3 NICB 
4K  
8K  
16K 
64 K 
128 K 
256K 
512K 
1024 K 
15.72 
28.10 
45.99 
89.27 
104.79 
114.67 
118.78 
121.39 
3.58 
7.02 
13.10 
38.92 
57.15 
66.72 
77.03 
83.41 
Performance levels for NORMA on the new platform are encouraging, considering the dif- 
ferences in logical page size, and that the message-coprocessor mode is not yet enabled on the 
Pentium ® Pro nodes. While the i860XP-50 is still climbing the curve to reach the RDMA speed- 
of-light bandwidth from Table 2, the Pentium ® Pro node has started to be throttled back by 
protocol overheads in RDMA. 
5. SUMMARY 
The project is on track and progressing well towards delivery. Functionality and stability of 
Paragon OS on the new system is sufficient for use within our organization. 
The MCMSG layer has proven both malleable and robust in the new environment, with current 
performance adequate for our immediate needs. Supporting message-coprocessor mode on the 
new platform will give a significant boost to performance. 
High performance access to PFS [11] functionality from applications running on Cougar rather 
than Paragon OS will require attention. Work in this area is underway. 
Moving Paragon TM Operating System 43 
We have benefited from seasoned Mach3 kernel (with post-R1.2 NORMA IPC) and OSF/1AD 
server technology from the Paragon environment.  We have been fortunate in that  we have not 
had to spend significant amounts of t ime debugging immathre  OS technology. Much of our t ime 
has been spent in the proper areas: uncovering poor # i fde f  practices in the code base, and 
focusing in on the differences between the systems rather than on their similarit ies. 
We have also benefited from previous Intel Architecture work in both the Mach3 kernel and the 
OSF/1AD server. Much of the CPU-specif ic code s imply works, a l though pull ing five-year-old 
code out of mothbal ls  was somewhat of a task. 
Significant amounts of work remain for Paragon OS to exploit the fault detect ion and isolation 
capabi l i t ies of the new platform. Given the sheer size of the 1+ TFLOPS system, unexpected 
hardware and software failures will need attent ion.  
We are stil l looking to find someone that  has the t ime and energy to get an X l lR6  server 
compi led for our sys tem. . .  
Several kinks and quirks are still present due in part  to differences in the compiler chain as well 
as in processor and node environments. Continuing the discovery of Paragon-specif ic code that  
is not current ly  generic to Intel 's mult icomputers still requires some attent ion.  However, those 
differences do not appear  to be seriously hamper ing members of the evaluation staff from throwing 
five years of accumulated Paragon regression and feature-coverage tests at the sys tem. . .  
REFERENCES 
1. A.B. Maccabe, K.S. McCurley, R. Riesen and S.R. Wheat, SUNMOS for the Intel Paragon: A brief user's 
guide, In Intel Supercomputer Users Group '94 Proceedings, June 1994, pp. 245-251. 
2. P. Pierce, The NX message passing interface, Parallel Computing 20 (4) (April, 1994). 
3. K. Loepere, Editor, Mach3 Kernel Interfaces, Open Software Foundation and Carnegie Mellon University, 
(1992). 
4. G.F. Pfister, In Search of Clusters: The Coming Battle in Lowly Parallel Computing, Prentice Halt PTR, 
(1995). 
5. R. Zajcew, P. Roy and D. Black et al., An OSF/1 UNIX for massively parallel multicomputers, In 1993 Win- 
ter USENIX Proceedings, January, 1993, pp. 449-468. 
6. C. Peak, The San Diego Zoo: A multicomputer stress test suite, In 1993 Winter USENIX Proceedings, 
January, 1993, pp. 119-130. 
7. J.S. Barrera III, A fast Mach network IPC implementation, I  Proceedings of the USENIX Maeh Symposium, 
November, 1991, pp. 1-11. 
8. P. Pierce and G. Regnier, The Paragon Implementation of the NX Message Passing Interface, SHPCC94, 
(1994). 
9. G. Regnier, NX message buffering on the Paragon, In Intel Supercomputer Users Group '9~ Procee.dings, 
June, 1994, pp. 309-313. 
10. Intel Corporation, Paragon TM User's Guide, Document 312489, (1993). 
11. M. Arunachalam, A. Choudhary and B. Rullman, Implementation a d evaluation of prefetching in the Intel 
Paragon parallel file system, In Proceedings of the International Parallel Processing Symposium, April, 1996. 
