Parallel simulations executing on a closely coupled network of processors
require synchronization of logical processes across all processors.
This synchronization overhead requires coordination of host processors and additional message traffic in the host network. We introduce specialized hardware to offload all parallel simulation synchronization overhead from host processors and the host network in a closely coupled network.
In this hardware design, critical synchronization information is disseminated in the form of reductions (binary, associative operations) performed on state vectors of values provided by the logical processes. Our hardware description is based on the completed design of a four-node prototype, designed and built at the University of Virginia. (GVT) network (Fitoque, Gamin, and Pottier 1991) and our own framework (Reynolds 1992) . Recent simulation results (Snnivasan 1992) show beyond a doubt that sp~ialized hardware such as the framework we describe here can yield significant benefits to parallel simulations.
We describe the details of a hardware realization of our framework (Reynolds 1992 ). This description is based on a completed design of a four-node prototype expected to be operahoml Summer, 1992. This prototype is designed to interface with a Spare Clustera set of Spare-le engines connected through a VME backplane. The interface to the Spares is through the Sun SBUS.
Goals of the design include speed, scalability, adaptability and generality. Simulations project our framework hardware wilt be able to compute GVT asynchronously in 10 to 20 microseconds on a 32-node system, easily two orders of magnitude faster than on conventional parallel processor host networks. While the prototype design includes a synchronous reduction network for speed, a scalable asynchronous design has keen completed.
We chose to go with the synchronous design in the prototype to reduce potential problems such as race conditions and to extract speed reliably, To achieve adaptability -the potential to interface to any parallel processor -we have isolated the SBUS interface completely.
Irttercomection to other parallel processors or through alternate channels has been kept simple. We have kept the network general by incorporating high speed general purpose ALU's in the reduction network. All operations, including 32-bit fixed-point arithmetic, can be performed in less than 40 nanoseconds. The importance of attaining our goals will become clearer in the remainder of this paper.
In the sections that follow we present a brief overview of the framework tirst described in Reynolds (1991) .
The ease and speed with which important synchronization values such as Time Warp's GVT can be computed will be made evident. Also, we present a set of correcmess criteria for the hardware portion of the framework. Race conditions were a persistent problem that had to be addressed from the outset. Following that we describe the hardware prototype in detail, and close with a discussion of the future directions of our effort.
MOTIVATION
'l%ree components integral to the design of a lowIevel j?amework supporting parallel discrete event simulation (PDES) are (1) One such set of globally reduced vahtes ancl related synchronization algorithms have been presented :n detail in Reynolds (1992) and in Pancerella (1992) . The vahtes computed are the minimum next event time and the minimum logical timestamp of messages that have been sent but not acknowledged.
These values, along with synchronization algorithms to correctly maintain them, are sufficient to eliminate causality errors and support deadlock-free parallel simulation even when ]message traffic is always present. The elimination of causality errors allows an LP to recognize when it can commit to processing an irreversible act such as 1/0.
Simulations (Srinivasan 1992) show that messages can be acknowledged efficiently in a high speed reduction network (as proposed by Pancerella (1992) , Jefferson and Sowizrat 1985 , Samadi 1985 , Lin and Lazowska 1989 , Bellenot 1990 , and Conception and Kelly 1991 Chandy and Lamport (1985) , a framework consisting of synchronization values and related algorithms can be used to evaluate termination conditions even when there are outstanding messages in the parallel simulation.
As mentioned above, computing minimum unreceived message times and acknowledging messages in a reduction network can be used in order to detect outstanding messages in a system. Moreover, a sum of the number of all messages sent minus messages received at all LP's can be computed in a reduction network to detect outstanding messages in the system. If this value is maintained correctly, a sum of zero indicates that there are no outstanding messages in the host communication network.
Our
hardware-based framework supports the computation and dissemination of alt of the values just discussed, across all processors without coordination of host processors, i.e., without barrier synchronization. The reduction network interfaces with dedicated auxiliary processors that manage the high speed 1/0 from the network.
Employing auxiliary processors provides a separation of the synchronization activity (performed on auxiliary processors) and the application being simulated (performed on host processors). In sum, our framework offloads all parallel simulation synchronization overhead horn host processors and the host network.
A TOP-LEVEL

VIEW
Often sets of values such as those mentioned above have temporal relationships that must be preserved. In this section we discuss steps to ensure that necessary temporal relations can be maintained.
Correctness Criteria
There are two properties that must be ensured in the design of the fkunework hardware in order to maintain temporal relations among globally reduced values and, thus, to guarantee correctness of synchronization algorithms:
(1) a~micity of read accesses to an instance of globally reduced values and (2) order preservation of inputs from a given LP, Meeting these requirements in hardware is challenging; hence, interfaces into and out of the reduction network must be designed with care.
When the PRN computes multiple reductions representing a state of a simulation, it is crucial that each LP can access a set of globally reduced values atomically to guarantee that the values represent a consistent global state. In addition to atomicity, some synchronization algorithms (Reynolds 1992) require that the order an LP changes values input to the network be preserved in global counterparts by the underlying hardware. This is necema.ry, for example, if an LP sends a message, updates a minimum unreceived message time, and then changes its current next event time, other LP's must see any effect that the unreceived message time has on its globai minimum no later than the effect its new next event on the global minimum next event time is seen in the system. This ordering constraint prevents race conditions that ean cause global reduction values to reflect an incorrect global state.
The no laler than ordering property suggests that if two reduction input values, v 1i and v Zi, are updated in order, this ordering must be guaranteed at two times: (1) when vatues enter the reduction network and (2) when their globally reduced counterparts leave it. There is no simple way to guarantee that an LP witl see globally reduced values in exactly the same order in which they had loea.1 changes submitted to them on an input side of the network.
Even if the network is designed to input local values born an LPi in the order in which LP, changes them, and therefore emit any changes to globally reduced values in the same order, when the network emits these values, there is no simple way to guarantee that they will be processed in the same order. This is compotmded further by the fact that a "well-intentioned We now discuss a hardware design for a parallel reduction network which supports PDES in a manner consistent with our established correctness criteria.
A Functional View of the Framework Hardware
A top-level view of the system including the framework hardware appears in Figure 1 . The host system for the framework hardware is a closely coupled network of high speed processors with its own network for interprocess communication.
This host network is independent of the framework hardware.
l[n our prototype the host system is a Spare Cluster where Spares can communicate through a VME backplane.
Each hostprocessor (HP) is paired with a corresponding am"lia~processor (AP) which interfaces directly to the reduction network.
The general-purpose auxiliary processors, one processor per host prc~essor, provide the interface between host processors and the reduction network.
There is a high speed bidirectional communication channel between a host processor and its corresponding auxiliary processor. The prototype interface between a host, a Spare-le, and the 32-bit general purpose auxiliasy processor, a 25 MHz Motorola 68020, is a dual-ported RAM connecting the Sun SBUS and the auxiliary prmessor.
The SBUS has a bandwidth of about 100 megabytes per second for 32-bit words (Sun Microsystems 1990). We expect a potential thrcmghput of 25 megabytes per second from host to auxiliary processor.
Each AP has 256 Kbytes of RAM -expandable up to lMbyte -to store synchronization programs and related data structures (See Reynolds (1992) , Pancerella (1992), and Srinivasan (1992) A further advantage of a dedicated processor interfacing with the host processor and the reduction network is that an AP can compute the input reduction values based on multiple LP's executing on one host processor and coordinate the synchronization activity of multiple LP's.
The spedc details of the interfaces -both between a host processor and its auxiliary processor and between an auxiliary processor and the PRN (both input and output) -are discussed in later sections.
DETAILS OF THE HARDWARE DESIGN
In the sections that follow we discuss the specifics of the hardware in our prototype design.
In this discussion we focus on how we ensure the correctness criteria established earlier.
Setup
Each auxiliary processor boots up in a "listening" state in which it monitors its host processor interface. A host processor sends tagged data to its auxiliary processor representing a program to be loaded and executed by the AP. The physical interface between a host processor and its auxiliary processor is described in the next section.
One of the host processors in the system and its corresponding auxiliary processor is designated as a master pair of processors. The master pair communicates PRN programming information to the state machine controlling the PRN. Critical information to be passed to the state machine includes the number of components in a state vector and the operations to be performed on components. For example, it can be spedied that all tirst components are to be summed, all second components OR'ed, and the minimum is to be taken on all third components in a three component state vector. Pancerella, and Srinivasan The master host processor can send tagged data representing new PRN programming information to its auxiliary processor at any time.
Similarly, host processors can send data to their respective auxiliary processors indicating they are to receive new programs to execute, This will permit dynamic reprogramming of the AP's and the PRN. We assume that applications running on the HP's and programs running on the AP's are sufficiently robust to support rhis dynamic reprogramming.
Host Processor -Auxiliary
Processor Interface
Functionally, there are two data paths between a host processor and auxiliary proeesson one from the HP to the AP and the other from the AP to the HP. Each processor is a reader on one data path and a writer on the other path. The host occasionally writes tagged information to the interface which rhe AP processes, W on the tag, and generates values to input into the PRN. Similarly, the AP writes globally reduced values to the interfaee which is read by the HP. Framework algorithms require that (1) no information sent by the HP is lost and (2) the AP processes the data in the order in which it is sent by the HP. Under the established correctness criteria, an application executing on the HP does not need to see all globally reduced values; a recent version of globally reduced values, however, is expected to be available to the HP. Hence, the implementation requires at least a FIFO queue from HP to AP and a single set of registers from AP to HP.
The prototype interface between HP and .4P is implemented by a dual-ported RAM, such that the host processor is connected to one port and the auxiliary processor is connected to the other. Each of these ports is memory-mapped into the respective processor's address space. The two data patlM are managed in the dualported RAM by software resident on the host and auxiliary processors; soft semaphores rely cm the exclusive-write support provided by the dual-ported RAM.
The host processor accesses the dual-ported RAM via SBUS. This HP interface isolates the particular host processor -a Spare-le in the prototype -from the rest of the system. If the host system changes, this HP interface is the only thing that will need to be redesigned. Isolating the HP interface provides adaptability to other parallel processors or closely coupled networks. For example, the SBUS interface could be changed to a SCSI or VME interface, and all that would be required is the logic to respond to requests by the HP on the dual-pot-ted memory.
The Parallel Reduction Network
As seen in Figure 1 , the PRN is a binary tre~of depth logzn, where n is the number of host (and auxiliary) processors. Each stage of the PRN consists of lhalf as many ALU's as the stage above it, with the first stage having n/2 ALU'S.
The PRN's binary tree properties allow a global reduction operation to be comptmd and disseminated in O (log n) time.
A single ALU node is shown in Figure ; !, The ALU's in the prototype parallel reduction network require 40 nanoseconds to perform a 32-bit fixed-point addition. Each 32-bit input register is paired with a 32-bit The PRN propagates the tag of the input that "wins" a selective operation, a minimum or maximum operation, so that the tag of the smallest or largest element emerges tlom the bottom of the PRN for a minimum or maximum operation.
In the case where there is no single choice in a selective operation (i.e., both operands are equal), the PRN selects deterministically the tag which is propagated.
A selective operation requires two operations in the ALU: a compare and a selech ence, this requires 80 nanoseconds in our prototype.
Pipelining is employed in order to use this network efficiently: partial results are pipelined through the log n stages of the PRN such that each stage of ALU's is always busy. The PRN can pipeline binary, associative operations at a rate equal to the delay time of a stage. The time for a value to pass from one level of the PRN to the next is a minor cycfe.
Currently, this delay is projected to be no more than 150 nanoseconds. Thus, the time to produce a sequence of values for state vectors of length m is 150* m nanoseconds (plus the time to fill the pipe 150 * logzn nartosaonds).
The pipelining in the prototype is performed synchronously.
An asynchronous design for the reduction network has been completed. In the asynchronous design, each ALU node of the PRN computes and outputs a result once it has completed an operation and two input values are available from the preceding stage. Each PRN node operates in a demanddriven manner, where operations are performed as both inputs become available.
This asynchronous design is desirable for later versions of the reduction network for two reasons. FwsL a PRN operating asynchronously is scalable since a hardware handshake cart be used to control communication between nodes this eliminates both a central clock in the PRN and the potential problem of clock skew in a large network. Second, this facilitates the addition of floating point processors at each ALU node.
A long operation, such as a floating point operation, forms a one-time "bubble" in the pipeline. With a synchronous network, the minor cycle must allow for the longest operation.
Thus, a synchronous design creates wasted time when a shorter binary, associative operation is performed, and an asynchronous design alleviates this problem.
We note that the synchronous design is simpler, and it is faster when only operations with uniform execution times are performed.
As seen in Figure 1 , the interface to the PRN from each processor is identical;
Each AP has sets of memory-mapped input registers and memory -map~d output registers.
A processor can write to the input registers and read from the output registers; the PRN will read values from the input registers and write the corresponding globally reduced results into the output registers. This memory-mapped interface is a possible source of memory contention if both the PRN and the Reynolds, PancereHa, and ,%inivasan auxiliary processor attempt to access the input or output registers simultaneously.
We discuss next how the interface between the processor and the PRN is constructed in order to minimize the memory contention, to facilitate atomic writes with and without overwrite capabilities, and to preserve state vectors.
Auxiliary Processor-PRN Interface
The AP-PRN interface is designed to operate on state vectors in order to support both atomic accesses of globally reduced values and order preservation of input values to the reduction network. From an LP's point of view, it feeds a valid state vector to the PRN, where "valid" is defined by the application using the framework hardware. Furthermore, the hardware provides an atomic read access to a single output state vector so than an AP can read an entire state vector. The application software, however, must access whole state vectors, not individual elements.
The prototype hardware limits state vectors to size eight; each of the eight elements is a register pair, one 32-bit data register and one 32-bit tag register. The data register can be a 32-bit integer, a 32-bit fixed point number, or any 32-bit logical value, depending on the reduction operation to be applied. All numeric values are two's complement.
The tag register can contain any 32-bit value. The PRN can be programmed to operate on state vectors of size two to eight, depending on the application.
The PRN reads the state vectors, prmesses them by performing the corresponding reduction on each element, and writes a globally reduced state vector at each AP.
An auxiliary processor and the reduction network operate asynchronously with respect to one another. As shown in Figure 3 , three banks of eight input and output register pairs provide an interface of isolation, such that both can access the register banks with minimal interference.
This interface is designed to guarantee that the PRN never blocks while waiting to read a value or write a value. The PRN is expected to read and process state vectors at a rate much faster than an AP produces them; the PRN, therefore, may read and process the same state vector repeatedly. Similarly, on the output side, the PRN will produce globally reduced state vectors faster than an AP can read and process them, and as a result the AP's may lose some state vectors. All reads to registers from the PRN or an AP are nondestructive.
We now discuss the input and output interfaces in greater detail. When an auxiliary processor has completed writing a new state vector, it sets two single-bit control flags: the overwrite bit (OW) and the owner bit (0). The owner bit is always set when the AP has finished writing a valid state vector into the AP input registers; this indicates that the interface controller now owns the top level of registers. When the interface state machine transfers this state vector to the Intermediate input registers, it resets the owner bit indicating that the AP once again owns the AP input registers. If the .4P attempts to write to the AP input registers while the owner bit is still set, it will be blocked. However, given the relative speeds of the PRN and the AP, this is not expected to happen often. and it is transferred to the Intermediate registers, the overwrite bit will prevent the transfer of a newly written AP level state vector until the Intermediate input registers are ultimately transferred to the PRN input registers. The control logic guarantees that AP input registers are only moved to the Intermediate input registers when this process does not cause the PRN to block or when it does not lead to a loss of integrity of a state vector. Finally, we note that due to the relative speeds of an AP and the PRN, it is very unlikely that an overwritable state vector will be overwritten prior to being read by the PRN; however, we have designed the network to provide the guarantee anyway, for future use.
The PRN reads state vectors of a specitied size cyclically, starting with the vzth element and proceeding to the first element.
Thus, the PRN reduces the mth element, followed by the (m -1)st, and so on. The PRN is pipelined, thus the processing of the (i -l)st elements commences as soon as the top level of ALU's completes processing the ith elements.
The PRN reads the ith register pair from each of the n input banks
'nputc&L:Qutpoin The time for the PRN to read an entire state vector is an input cycle. An input cycle finishes when the tirst elements of the state vector are consumed. At the end of an input cycle, the controller transfers the Intermediate input registers to the PRN input registers. The transfer is overlapped with the last PRN read in the input cycle, for this reason, our hardware requires a minimum state vector size of two. The transfer from the Intermediate registers to the PRN registers has a higher priority than the transfer from the AP registers to the Intermediate registers so that the PRN never blocks.
We note that log2 of n and m are not necessarily equal. Therefore, while the PRN is reading from the M input register pair from all n processors, it is not necessarily writing the M output register pairs. That is, the PRN may complete reading state vectors from each of n input register banks at a different time than when it completes writing new reduced state vectors. The writing of a reduced state vector for a set of input state vectors will lag by (((m-1) + logzn) * 150) nanoseconds, where the minor cycle time is 150 nanoseconds and there are n processors.
Auxiliary
Processor-PRN Interface: Outimt
As shown in Figure 3 , the three banks of output registers are constructed to preserve state vectors and to minimize AP-PRN interference in a similar fashion to the input register banks.
Once every m minor cycles (assuming a futl pipe in the PRN), the PRN gerwrates a globally reduced state vector, which is written to the PRN output registers. This state vector is transferred to the Intermediate output registers and finally to the AP output registers, which are readable by the AP. Once again the interface controller guarantees that the PRN never blocks, and transfers between output register levels are prioritized to prevent this.
Each time the PRN completes writing a state vector into the bottom row of registers, the values are shifted into the Intermediate output registers. When the bottom row is shifted, the values in the intermediate row are concurrently shifted into the AP output registers unless the AP has locked the top row because it is reading the .4P output registers.
In that event, the Intermediate output registers are overwritten by the PRN output registers, and the contents in the intermediate registers are lost forever. The AP output registers have a control bit, an owner bit (0), that is set and reset by the auxiliary processor.
The owner bit determines whether Intermediate output registers can be written to the AP output registers or are 10SGit also ensures an atomic read of a state vector by the AP.
The AP may block momentarily if it attempts to set the owner bit to itself while the intermediate values are written in parallel to the registers readable by the AP. Applications using the framework hardware must lx robust enough to tolerate the loss of state vectors emerging from the PRN. We note that an AP never sees a partial state vector. State vectors are either lost in their entirety or not at all.
If an application cannot tolerate state vector loss of globally reduced states, art alternative is to use two extra input registers and compute tagged selective operations to perform a double handshake (l%ncerella 1992) . In this scheme, all updated globally reduced state vectors can be read by the auxiliary processors even though physical state vectors may still be lost. We note, however, that it is expensive (in terms of computation time) to implement this property in the framework hardware and it should be avoided when possible.
RELATED WORK
Using a separate synchronization network for improving system performance is not a new idea. The IBM RP3 (Pfister, et. af. 1985) was designed as a shared memory multiprocessor that houses both a combining network for synchronization traffic and a low latency network for regular message traffic. Our reduction network is not as complex or expensive as a combining network, yet it performs global synchronization operations very efficiently.
We claim no novelty with respect to reduction networks. Lubachevsky (1988) suggests using a binary tree implemented in hardware in order to support synchronization barriers and to compute and broadcast a minimum next event time in a bounded lag PDES. His control synchronization network is presented strictly in support of this PDES protocol.
The Finite Element Machine (FEM) (Jordan, Scalabrin, and Calvert 1979 and Crockett and Knott 1985) , a NASA prototype, utilizes a binary tree-structured max/summation network to perform the global sum and maximum calculations necessary to support structural analysis algorithms.
Like the hardware we propose, the sum and max calculations in the FEM are calculated alternately without processor synchronization.
Our hardware design, however, employs a set of input and output registers which are treated as a single state vector, whereas the FEM uses a single input and a single output register, At about the same time that we introduced our framework, Filoque, Gauh-in, and Pottier (1991) proposed the use of a processor network with programmable logic for efficient global computations, such as the computation of GVT in a Time Warp simulation.
This hardware is not a single" network like the PRN; it is, however, a distributed system of sockers, one per processor. The reprogrammable sockets are connected in a pipetined ring, forming the computation engine. A token is inserted into the ring by a designated control socket. It travels around the ring, performing partial computations at each socket. When the token returns to the controller, the global computation is complete. Therefore, their proposed hardware performs global computations in O (n) time whereas the PRN performs the same computations in O (log n) time. Furthermore, the proposed synchronization algorithms for computing GVT in Filoque, Gautrirt, and Pottier (1991) rely on the host communication network for message acknowledgements and our framework uses the framework hardware for this purpose. The goals of koth approaches are similar, but our framework is more efficient, more flexible, and more scalable.
Several researchers have proposed the use of hardware to implement barrier synchronization. Hoshino (1985) has an efficient barrier synchronization in the PAX computer. Stone (1990) suggests the use of global busses to compute maximum values and to implement fetch-and-increment.
The hardware that we propose, on the other hand, provides support for a larger class of algorithms than barrier synchronization algorithms. 
