The Hyperswitch Communication Network (HCN) is a large-scale parallel computer prototype being developed at the Jet Propulsion Laboratory (JPL) in collaboration with several large computer companies. These companies plan to build commercial versions of the HCN computer. The HCN computer being designed is a message-passing multiple instruction multiple data (MIMD) computer and offers significant advantages in price-performance ratio, reliability/availability, and manufacturing over traditional uniprocessors and bus-based multiprocessors. The design of the HCN operating system is a uniquely flexible environment that combines both parallel processing and distributed processing. This programming paradigm can achieve a balance among the following competing factors: performance in processing and communications, userfriendliness, and fault tolerance. This communication architecture extends the application of parallel systems to supercomputer problems that place heavy demand on the system for high bandwidth, low latency, and non-local communication. Hence, the HCN concept is classified as distributed supercomputing. This paper describes the HCN system, and reviews the performance/cost analysis and other competing factors within the system design.
. Applications
The need to solve more complex problems is outpacing the ability of the world's fastest computers to solve the required applications within acceptable time periods. At the same time, even with the continuing advances in microelectronics technology, it is becoming increasingly difficult to design and build more powerful computers. The demands for increasing capability, higher performance, and fault tolerance are continually being placed on computers. Several application examples are given below.
Space Flight Operations
computation and data-handling capabilities multiplies as the complexity of spacecraft and their operation grows; the computational demands are exacerbated when operations involve the concurrent multiple missions environment. The addition of more computers appears to be the best solution today, and the present Space Flight Operations Center (SFOC) system is comprised of about one hundred conventional workstations networked together.
The flight missions Magellan, Galileo, Ulysses, Mars Observer, TOPEX, and CRAF/Cassini, are expected to be operating together in the next decade; EOS, Lunar Observer, Mars Rover, and another half dozen missions are proposed to be supported by JPL. The current Flight Operations network will eventually become "unsteerable" assuming even a factor of four increase in physical count of workstations. What is needed is a ground data system that will provide for more accurate and timely data processing that is both informative to the user and cost effective to the project.
. 2 Consolidated Command Center
Military applications include information management systems (a consolidated command center) to support needs in terms of planning, decision making, and fault diagnosis. As with the JPL needs, ground data systems are required to help cope with the increasingly complex problems in the logistics of supply and support, as well as with strategic and tactical planning. Although the precise functional and performance requirements of such a consolidated command center are of necessity evolving, certain basic, generic command center requirements include: survivability, which places a geographical distribution requirement on the system; a high level of fault tolerance: multi-level security: flexible and highbandwidth communications and networking; interfaces with a wide variety of machines; scalability, to match
The ground-based command and control operation system has played a crucial role in the success of JPL and NASA. Traditional mainframe computers have been employed for science and engineering applications in the past, providing centralized processing and data management resources for each project. The demand on performance requirements and dynamic reconfigurability, to respond to variable workloads; high-performance database management; and supercomputer-class floating point computation speed. The JPL Right Operations applications share many of these same requirements, so these generic requirements are the driving force behind the Hyperswitch Communication Network (HCN) system described therein.
. System Overview
The HCN computer is a loosely coupled MIlMD system with distributed local memories attached to multiple processor nodes (see Figure 1) . The interconnect topology is a hypercube network used with a hyperswitch [1] [9] message routing element in each node. Messalge passing is the major communication method [lo] among the computer nodes in the HCN computer shown in Figure 1 . The HCN message-routing method demonstrates a highly fault-tolerant capability, while providing am adaptive routing hardware based algorithm with very low message latency. With a very low communication overhead, parallelism is potentially profitable. This is because the programmer seeking maximum performance is strongly tempted to partition a problem into the finest possible granularity to create the maximum amount of parallelism. Therefore, in a message-based parallel computer, the performance of fine granularity computation depends crucially on the rate of message exchange. Fine granularity is a system that effectively supports processes transmitting short messages between code blocks that are less than several hundred instructions in length.
. 1 Hardware
Each node comprises one or more state-of-the-art Motorola microprocessors and is expected to provide from 50 to 300 MIPS per node with comparable floating-point performance. The prototype machine as a whole will include a total of 32 nodes, with at least two microprocessors per node.
The system of links, communication, and ;application processors thus supplies a homogeneous, MIMD supercomputing resource, while allowing for expansibility and attachment of special-purpose processors. The partitioning of the system into is reflected in the software: the communication system completely hidden by the operating system software, which presents a distributed object-oriented view. Applications run on the application processors and utilize the attached special-purpose processors in a transparent, fault-tolerant fashion. The class/object view which hides the communication system also hides the attached devices without limiting their use.
. Software
The class/object paradigm provides for intra-process (intra-program) communications. Global, named communications links, such as might be needed for a distributed DBMS, are easily implemented. Figure 2 compares the distributed operating system with a current workstation-class networking view. We also expect faulttolerance facilities to be supplied through the object system. By refining an object's simple send method to be an atomic multicast, we set the stage for shadowed processes that are virtually transparent to the user.
The programming environment will evolve noticeably in ten years. We see powerful debugging and monitoring tools developed that will provide facilities equal to or better than those provided by the tools currently available for developing sequential programs on workstations. System with Distributed Multicomputer Operating System Support rn-hitccture is designed to support applications that require both fault tolerance and high performance. The overall performance of these systems that are composed of many tightly coupled processes, such as data searching, sorting, graphics and information processing depends largely on both the efficiency of communication between nodes and the efficiency of fault recovery. All node functions support this structure.
. Architecture

Computer Node
The node processor structure shown in Figure 3 consists of two M88K units on one system bus (IMbus). Each unit contains one M88K microprocessor and four M88K cache and memory management devices. These devices include high-speed memory caching, two-level demand-paged memory management, and support for shared-memory multiprocessing. The M88K is a highspeed reduced instruction set computer (RISC) microprocessor. One M88K unit can be configured as the master CPU and the other M88K unit as a checker. This master/checker configuration contains comparator circuits that examine the internal and external states of all active output signals. If a mismatch occurs on any output, then an error signal is asserted, the node hardware recognizes the fault, and the operating system software reorganizes to a faulting node by logically enabling a redundant node. In addition to being applied in the master/checking mode for fault-tolerant applications, the two M88K units can be used as independent microprocessors for increased performance (shared node memory multiprocessing).
Message processing latency is reduced by directly executing messages with the custom message processor using special microcode and hardwired logic. The message processor provides support for macro primitive loading and execution, message error checking, message transmission, message broadcasting, process-to-process synchronization, and message receive buffering. The message processor is also configured in a master/checker configuration for fault recognition and recovery. In addition, processor registers are used to save the message transfer control contents when other communication interrupts occur. The custom message processor is designed to field the communication interrupts and handle all the ordinary communication events.
In early hypercube systems, message routing latency was in the hundreds of microseconds, but with the development of the hyperswitch communication chips and the message processor, latency has been reduced to a few microseconds. With a hypercube interconnection network, any two nodes that are not directly connected by a link must have their message connected by intervening nodes. But within a hyperswitch communication network, there is virtually no performance degradation when messages are sent from one end of the network to the other end. The tested hyperswitch chips use an informed heuristic search algorithm, which can automatically avoid congested or faulty links based on its previous congested experience. Therefore, a message does not wait for a busy link, because the hyperswitch network tries to route the message through uncongested or fault-free links. The 1/0 links are two 200 Mbits/s bit serial channels, one for data input and one for data output. One of the log n I/O node links can be selected as a fiber optic channel for long-haul communications. Multiple fault detection and recognition is built into the hyperswitch communication chips. This allows dynamic recovery software to reorganize around a faulting channel or node to restore normal operation.
. Operating System
The HCN-based operating system (OS) will be a balance among the following competing factors: performance in processing and communications, userfriendliness, and fault tolerance. We can make the best use of our resources by adopting existing operating system code wherever possible, and by building a system that supports modem programming paradigms. Figure 4 shows the user environment and concurrency support available from the OS. The primary programming applications, a second view will be supported: processes and messages. The processes referred to here are either UNIX-style processes or the view given by the JPL Mark I11 [7] [8], a principal difference will be a marked reduction in the "hypercube" view. The user will be less aware of the cube-based communications than was the case with the Mark 111. lightweight tasks. While this view is closer to Mt#;:g, l $G:
OS-UNIX
The HCN operating system supports an object-oriented concurrent programming paradigm. Concurrent object-oriented programming is a methodology in which the system to be con,structed is modelled as a collection of concurrently executable program models called objects. This powerful paradigm, in which the HCN is to be written, exploits parallelism both in the architecture and application.
C++ has been chosen as the primary programming paradigm for the HCN because it can serve two purposes: (1) its compatibility with C makes it a language close to the machine so that all important aspects of a machine are handled simply and efficiently in a way that is reasonably obvious to the programmer. For example, the user creates an object and specifies where the object is to be placed. All subsequent manipulations of that object are done in the usual C++ fashion with no reference to the object's location. (2) The object-oriented features in C++ make it a language close to the problem to be solved so that the concepts of a solution can be expressed directly and concisely. C++ provides constructs to express class/subclass hierarchies, type abstraction, and inheritance. Extensions are added to C++ to support parallel processing, such as remote object creation and concurrent message passing using futures.
. 2 Concurrent Process-Oriented Programming
The process-oriented programming paradigm is a more traditional way to program an application on parallel machines ---a set of sequential processes cooperatively solve a problem by exchanging information via message passing. The Process model has been shown to be a powerful programming model for a distributed-memory multiprocessor. Therefore, the HCN OS will :support the process model in addition to the object-oriented model. A UNIX-compatible distributed operating system, such as Mach [2] or Chorus [3] is under considera.tion. The advantage for UNIX compatibility is that much existing software can be ported to HCN easily; therefore, the application development effort will be significantly reduced. Mach is a multiprocessor operating system kernel developed at Carnegie-Mellon University. In addition to having binary compatibility with Berkeley's UNIX 4.3, it provides facilities for supporting sharedmemory or distributed-memory multiprocessors, a new robust virtual memory design, and a capability-based interprocess communication facility. The capabilitybased design and the virtual memory design in Mach can be enhanced to support mandatory access control. By supporting both object-oriented and process-oriented programming paradigms with UNIX compatibility, fault tolerance, and multi-level security, the HCN OS is capable of serving many different types of applications.
. 3 Process Management
Multitasking will be fundamental to the operation of each node, as well as memory management and memory protection. The advantages of multitasking include better performance with an asynchronous commmunications system, multi-user timesharing of the HCN, and more flexible programming for the user.
The HCN OS process management will be similar to that of Mach. The kernel will support the concept of threads (lightweight tasks), which allows the construction of multi-threaded tasks. Such tasks can contain multiple execution paths, all of which can be active concurrently. Threads of a single task can execute concurrently, each in a separate physical processing element. Threads may be created, terminated, suspended, and resumed with HCN OS primitives that are much faster than the corresponding forklexec 's of UNIX.
. Memory Management
The HCN OS will provide memory management, including virtual memory, similar to that of Mach. The kernel performs memory management at a node where physical memory is treated as a cache for the contents of virtual-memory objects. In Mach, each virtual-memory object is managed by a pager. Such pagers could be used to allow memory sharing across a loosely coupled or distributed configuration.
. 5 Message System
Two HCN OS message management systems will be supported. One is similar to that of ES-Kit. ES-Kit [41 is an operating system kernel developed at MCC to support distributed, object-oriented execution in extended C++. The other is Express [5] , which is a message management system that provides a portable platform on which parallel programs and applications can be built. Therefore, applications built on other concurrent computers using Express can be easily ported to the HCN computer. Both systems will be employed as the bases of the HCN Operating System. Efforts will be devoted to evaluating the feasibility of adding other parallel constructs, such as Distributed Objects and Multiple Threads. All communication between nodes is by messages. Therefore, both systems must provide communication services without sacrificing performance. Express offers a well understood programming model and is backwards compatible with existing applications, whereas ES-Kit offers a much more sophisticated development tool and a strong basis for experimental developments, such as faulttolerant parallel extensions.
. 6 Programming Tools
The tools available will be a C++ source-level debugger that is aware of tasks and remote objects, and performance-monitoring tools for visualizing program behavior. The emphasis will be on graphic tools and a simulation environment for the debugging of application code. Graphic tools. such as a hierarchical diagram of classes and instances, for example, greatly increase program reliability and programmer productivity.
. 7 Distributed File System
HCN OS supports transparent remote file access whether the file resides on a disk attached to a remote HCN node or on a workstation connected to the HCN network. The file system and directory structure are UNIX compatible. For example, facilities are provided to mount/dismount file systems, and to transfer files between different disk drives.
. Fault Tolerance
The HCN is a set of homogeneous processing elements interconnected by a high-bandwidth network. These processing elements are connected to a heterogeneous set of data sources and sinks, including workstations, graphic displays, disks, and special-purpose processors. To make the entire HCN a fault-tolerant system, we need to assure fault-tolerant operation in the following three components:
1. The homogeneous processing element.
The communication network.
3. The interface to the heterogeneous external sources or sinks. The fault-tolerant design of the HCN should be able to survive one or more failures occuring in any of the above three categories. In the prototype effort, the HCN is used as the homogeneous processor network. In the following paragraphs, we will discuss the fault-tolerance design in the HCN with respect to the above.
. 1 The Homogeneous Processing Element
Each HCN node has built-in self-checking hardware consisting of dual M88K CPUs for error detection, and error detection circuitry for node memory and system buses, An exception is signalled to the operating system when the hardware detects an inconsistency. The OS then initializes the damage assessment program and error recovery program to identify the type and location of the fault, and resumes the whole system to a safe state. In a distributed system, a single node failure cannot be isolated from the rest of the system. Therefore, a global recovery mechanism has to be employed to synchronize and reconfigure the system.
The common checkpoint/rollback recovery technique has been identified as being inadequate for a message-passing distributed system due to the so-called "domino effect." Instead, user processes are duplicated in two different nodes, and processes are synchronized actively by messaging calls. When the primary process is faulty, the backup process will resume the primary's position with minimum recovery delay. This fault-tolerance capability is transparent to the user.
. 2 The Communication Network
The Hyperswitch can detect channel errors by two levels of parity check. The adaptive routing algorithm built into hardware can then bypass faulty links and route a message through. Using automatic retry, it can also recover transient errors occurring in data transmission. In addition, self-timing hyperswitch channels can determine the data rate locally, so no system-wide clock is needed. However, the hyperswitch has limited hardware support for message broadcasting, not to mention atomic broadcasting. Moreover, it does not search exhaustively all the possible routes and thus may not be able to find a route successfully in the presence of faulty links. The HCN operating system has to perform the following functions to augment the fault-tolerance abilities of the hardware: -Send point-to-point messages despite faulty links or nodes in the hyperswitch network.
-Tolerate up to (nD) link faults, where n is the dimension of the cube. In other words, a point-to-point message can be routed between ar~y two pairs of nodes if there exists less than or equal to (n/2) faulty links.
-Find a feasible minimal path when the hyperswitch fails to find an optimal route when faulty links exist.
-Broadcast message to the entire cube or to an arbitrary set of nodes.
-Ensure that the broadcast message is received either by all the non-faulty nodes in the recipient group or by none of them (i.e., atomic broadcasting).
-Use a fault-tolerant broadcasting algorithm to bypass faulty nodes and links in building minimal spanning trees.
. 3 The Interface to the Heterogeneous External Sources or Sinks
When an HCN node is connected to an extelrnal device via a channel, say, a VME interface, either the node, the channel, or the external device may cause single-point failure to the entire system. Therefore, all the three components have to be protected from failure with redundancy. In addition, a fault-tolerant interface between the HCN node and the attached devicedprocesses has to be built to perform error recovery for the external device. For a device without self error-checking capability, triple module redundancy may be adopted and a voting mechanism must be included in the interface softwarehardware. For a device with sell' checking capability, a duplicate device is necessary for backup purpose. The physical channels between the node and the external device should also be duplicated to protect the system from single-point failure on the node that is attached to the external device.
. Performance Simulation M:odeling
The modeling work described in this plaper was performed using an object-oriented modeling iiool called SES/Workbench [6] . A model is composed of one or more submodels where each submodel is represented by an extended directed graph consisting of nodes, arcs, transactions, and resources. Transactions are entities that flow from node to node along the arcs representing a process to be executed, data to be processed or transferred, or a control signal to be acted upon. An SESWorkbench model typically contains many transactions executing efficiently in parallel; in fact, a model can be thought of as a parallel program with multiple execution threads where each transaction in the model represents a separate execution thread. Many such transactions (execution threads) may be executing simultaneously (in parallel).
The user may specify different "categories" of transactions, each of which behaves differently. Each transaction may carry with it an arbitrary user-defined data structure such as the history stack for adaptive routing used in the hyperswitch. Performance statistics are collected and reported for each transaction category.
Each node of a model represents the manipulation (e.g., the allocation or release of a hypercube channel) of a physical or logical resource, or some other processing step in a transaction's life. In a computer system model, a node might represent the scheduling of a microprocessor, a system bus, or a disk drive. In a software model, a node might represent a software module, a subprogram, or a process control action such as the forking of a process into a subprocess. In a digital electronics model, a node might represent the manipulation of a chip, buffer, bus, clock, register, or gate.
. 1 Hypercube Model
The model, for example, of the adaptive routing hypercube (HCN) and the Mark I11 was developed as a single module for each hypercube, consisting of five main submodels: transaction-gen, process-flow, process-msg, gen-packet, and send-packet. Resources are also defined at the module level and include:
channel-hypercube channel token pool
CP
array of message processors aParray of node processors sh-memaddressed shared memory resource In addition, the model parameterization includes declarations for: cube dimension, cp and ap service times (based on MIP rating, for example), number of channels, operating system overhead for communications, packet size, link speed, routing time, number of communicating processes, traffic distribution, etc.
The transaction_gen submodel generates the workload for the application investigated according to a probabilistic distribution. Each transaction consists of a sequence of local processing requests followed by the sending of messages over the links of the hypercube. The process-msg submodel models the message processing (source node to destination node) overhead in the node (application) processors. The ap at the source first gets the common bus shared with the cp (the cp has priority), and then gets shared memory for the message and moves the message to this memory. The cp will then take over the processing of the message. When the message has arrived at the destination node, it is removed from shared memory and the memory is released for use.
The genqacket submodel models the packetizing and sending of each packet. The message transaction generates packet transactions by looping through a fork node. Model execution time is reduced in this way because packets are only created when needed. This is done by using a block node to hold the generating parent transaction until interrupted by the child packet.
The send-packet submodel, handles the setup of the path (circuit) from source to destination node using the adaptive routing algorithm [11[91 and the Mark I11 [7] [8] .
For the adaptive routing algorithm, the first packet is processed as a header that establishes a path by requesting header reaches its destination, the remainder of the packets are smamed down the path and experiences only a wire delay. Once the message has reached its destination, the path channels are released using a loop node with a release node. The workload was obtained from an actual program running on the Mark-I11 hypercubes at JPL. This program [lll is an emulation of a portion of a constellation of missile sensors, trackers, battle managers, and weapons platforms (see Figure 5 .) It is composed of the following major tasks, each of which is a separate C program: SWIR (short-wave infrared) sensor; tracker of SWIR sensor data capable of stereo processing; LWIR (long-wave infrared) sensor; tracker of LWR sensor data capable of stereo processing; a global engagement manager which allocates weapons in the arsenal based on ability to engage and the probability of kill: a fire control module which schedules weapon release and performs guidance; an environment generator which launches the threat, flies the SDI platforms, and generally takes care of functions performed by the enemy or by nature; and a simulation monitor which doubles as the null task when not running on node 0 of the hypercube.
. 3 Architecture Comparisons
To quantify the performance of the above simulation program for the HCN and to verify the measured performance of the Mark 111, we simulated both architectures. The goal of the simulations was to determine the message saturation point for each hypercube architecture based on the measured simulation program workload. We simulated the complete communication system including the overhead of message management and network contention. Table 1 shows the relative performance of the simulated systems for different message arrival rates, The table shows the minimum, mean, and maximum for message response time and the mean number of sensor tracks per second. The poor performance of the Mark I11 can be traced to the communication processor overhead at each hop in the transmission path. The HCN is more resistant to saturation because the routing algorithm searches for alternate paths, leading to a higher probability of path establishment. Figure 6 illustrates this by plotting both the mean message latency and mean number of tracks per second for each hypercube. As can be seen, the Mark I11 32-node has reached saturation at about 20K messages/sec. Whereas, the HCN is just starting to saturate at 1 million messages/sec.
. Cost/Performance Tradeoffs
With today's microelectronic devices the cost of fast devices, tends to grow faster than the performance benefit of the increased device speed. Hence, the cost per unit of computing power tends to be greater for high-end machines than for low-end machines, although this trend Figure 6 . Saturation Study on 32-Node Hypercubes is technology-dependent and could change over time. The relative performances and cost ranges of four classes of commercial computers [12] [13] are plotted in Figure 7 .
The estimated performance and cost ranges of the HCN are also shown. As can be seen, the low-cost 1.echnology of the HCN provides an opportunity to create a costeffective high-performance system by combining slowspeed microprocessors. As stated in Section 2, the cost advantage of using low-cost technology is balanced by the degradation in efficiency that inevitably occurs as the number of processors increases. Therefore, communication efficiency and hardware conncxtivity are the major concerns in the choice of a cost-effective message-passing computer architecture. As seen in Section 6, system modeling was used to scientifically explore costlperformance tradeoffs of the HCN-based system [14] 
. Conclusion
An important part of this work included establishing a strategy for how long-lived systems should be designed and constructed. In particular, the HCN should1 be viewed as an ongoing and continuing design --the HCN is never complete in the sense that newer and better technology continues to appear, and the HCN must be able to take advantage of that technology. Furthermore, the users of the HCN systems are also changing and imjproving its utilization --that is, new applications are encountered and new responses to those applications must be devised and implemented. Therefore, as work gets under way on the HCN, it is important to be designing the next step. Of course, the next step is not a complete replacement, but is an evolution of the HCN's fundamental components. By the time several steps of that evolution have occurred, the system may be quite different from its original form, but better and more adapted to the problems that it solves.
Given the rapid pace of technology, a design that reaches ten years into the future can serve best by being flexible --able to incorporate modification, to incorporate 
