Synchronization is often the dominant cost in conservative parallel simulation, particularly in simulations of parallel computers, in which low-latency simulated communication requires frequent synchronization. We present and evaluatelocal barriers and predictive barrier scheduling. two techniques for reducing synchronization overhead in the simulation of message-passingmulticomputers. Local bamers use nearest-neighbor synchronization to reduce waiting time at synchronization points. Predictive barrier scheduling, a novel technique that schedules synchronizations using both compile-time and runtime analysis, reduces the frequency of synchronization operations. In contrast to other work in this area, both techniques reduce synchronization overhead without decreasing the accuracy of network simulation. These techniques were evaluated by comparing their performance to that of periodic global synchronization. Experiments show that local barriers improve performance by up to 24% for communication-bound applications, while predictive barrier scheduling improves performance by up to 65% for applications with long local computation phases. Because the two techniquesare complementary, we advocate a combined approach. This work was done in the context of Parallel Proteus, a new parallel simulator of message-passing multicomputers.
Introduction
Software simulators are used in parallel computing for architecture design and application development. They are necessary because hardware prototypes of parallel computer architectures are time-consuming to build and difficult to modify. While sequential simulators have traditionally been used for this purpose, they tend to be slow and inadequate for detailed simulation of large systems.
Consequently, parallelism is often used in an attempt to speed up simulations, as well as to accommodate the large memory requirements of simulated applications. Despite its benefits, parallel simulation introduces the need to periodically synchronize the simulating processors, for correctness. This synchronization introduces overhead that often dominates simulator execution time.
Synchronization is a particular problem in multicomputer simulators because accurate simulation of the fast networks found in parallel systems requires frequent Synchronization. Less frequent synchronization leads to less accurate simulation, which is often undesirable. For example, accuracy is needed when modeling network contention, which, in turn, is useful in algorithm development. Also, accuracy in network simulation is necessary for testing and evaluating a network or network interface. 
86
This paper describes local barriers and predictive barrier scheduling, two techniques for reducing synchronization overhead in simulations of message-passing multicomputers. Local barriers use nearest-neighbor synchronization to reduce waiting time at synchronization points. Predictive barrier scheduling, a novel technique that schedules synchronizations using both compile-time and nmtime analysis, reduces the frequency of synchronization operations. Both techniques reduce synchronization overhead without decreasing network simulation accuracy. We implemented local barriers and predictive barrier scheduling as pafl of Parallel Proteus, an executiondriven conservative parallel discrete-event simulator of message-passing multicomputers that is based on Proteus [6] . Experiments with Parallel Proteus on a CM-5 show that local barriers are effective at reducing overhead in simulations involving frequent communication, while predictive barrier scheduling is most effective for simulations in which communication is infrequent. Because the two approaches are complementary. we advocate a hybrid adaptive approach that dynamically chooses between the two techniques.
The following section provides background information on the types of multicomputers we are interested in simulating and on parallel simulation in general. Later sections discuss the problem of synchronization overhead in more detail, present and evaluate the local barrier and predictive barrier scheduling techniques, and discuss related work.
Background
This section describes the types of machines we want to simulate (the target architecture), overviews parallel simulation concepts and terminology, and explains the need for synchronization.
Target Architecture
Our research focuses on the simulation of large-scale, messagepassing MIMD multiprocessors made up of independent processor nodes connected by an interconnection network (see Figure 1) . The network consists of wires and switches. Each processor m s one or more application threads which communicate with threads on other processors via messages. #en a message travels through the network, it incurs a network delay proportional to the number of links (hops) traversed, plus any additional delay caused by channel contention. We want to simulate the network accurately enough soas to reflect the additional network delay caused by network contention (hot spots). This entails simulating the progress of each packet through the interconnection network hop by hop.
Definitions
Our research deals with parallel discrete event simulation (PDES). This model assumes that the entities making up the simulated system change state only as a result of discrete, timestamped events. An event's timestamp corresponds to the time at which it occurs in the simulated system (simulated rime). Each of a parallel simulator's host processors simulates one or more entities by processing relevant events in non-decreasing timestamp order.
A host processor is at simulated time t when it is about to process an event with timestamp 1. Becauseof communication between components of the simulated system, host processors can generate events that need to be executed by other host processors. These events are transferred from one host processor to another via timestamped event transfer messages.
In Parallel Proteus, each target processor and network switch is an entity. Each host (CM-5) processor simulates one or more target processors or switches by maintaining a queue containing events of the following types: thread execution, message sendlreceive, and packetrouting. When one target processoror switch sends a message to another, the host processor simulating the first processoror switch sends an event to the host processor simulating the second.
. 3 Synchronization
The chief difficulty in PDES is handling causality emrs, which occur when event transfer messages amve late. For example, one host processor can simulate a processor that sends a message at simulated time t to another target processor (simulated by a second host processor). If, in the simulated system, this message is scheduled to arrive at the destination target processor at time t + q. an event with timestamp t + q is sent from the first host processor to the second. A causality e m r occurs if simulated time on the second host processor has exceeded t + q when it receives this event.
Our work focuses on conservative PDES, in which synchronization is used to avoid causality errors. Proper synchronization ensures that when a host processor processes an event with timestamp t , it will never receive an event with timestamp smaller than t from another host processor.
How often synchronization is needed depends on the simulated system. If the minimum time for a message to travel between two simulated entities A and B is equal to q cycles, then the two host processors simulating them (X and Y, respectively), must synchronize at least every q cycles of simulated time. Thus, the simulated time on X will always stay within q cycles of simulated time on Y and no causality e m r s will occur. The minimum value of q for the entire system is called the synchronization rime quantum Q.
In the case of multiprocessor target architectures, the synchronization time quantum Q is equal to the minimum delay incurred by a message traveling between any two target processor nodes or network switches. (The term quanmm also signifies the period of time between synchronizations during which simulation work is done).
The simplest way to perform the necessary synchronization is to execute a global bamer every Q simulated cycles. We refer to this technique as periodic global barriers. Hardware support for global synchronization (such as on the CM-5 or Cray T3D) makes each global barrier relatively efficient and easy to implement. However, simulator performance suffers if the barriers must be performed very frequently, or if load imbalance causes long waiting times at synchronization points.
Performance Issues
While the primary goal of using parallelism in simulation is to improve performance, synchronization overhead can limit speedup. Synchronization overhead depends on four factors: frequency. duration, level of detail and number of simulated entities. The frequency of synchronizations is controlled by the synchronization time quantum Q. The duration of a synchronization depends on the time it takes to execute the synchronization operation and on the time spent waiting at the synchronization point. Clearly, the lengthier and/or more frequent the synchronizations, the larger the synchronization overhead. However, in very detailed simulations, simulation work outweighs synchronization overhead, even if the synchronizations are very frequent. The same happens in simulations of large target systems, where each host processor is responsible for a large number of entities. Because more work is done between synchronizations, the synchronization overhead is amortized over many target procesIn simulators of parallel computers with fast networks, accurate network simulation usually requires the value of Q to be very small and, consequently, synchronization to be very frequent. Low overhead techniques for simulating the behavior of each target processor (direct execution as in Proteus and Tango [ 141 or threaded code 131) cause synchronization overhead to outweigh the time spent doing simulation work. Large memory requirements (for simulator state or application data) for each target processor prevent the simulation of a large number of target processors by each host processor. Consequently, synchronization overhead dominates simulation time.
For the target architectures we consider, Q equals 2 cycles of simulated time. This is due to the 2-cycle minimum delay through each hop of the target network (1 cycle for wire delay, 1 cycle for switch delay). With such a small Q. we were not surprised to find that, in our experiments with Parallel Proteus (using periodic global bamers), synchronization overhead accounts for 70% to 90% of total simulation runtime and, therefore, severely limits speedup. In some simulations. the high synchronization overhead was also due to long waiting times at synchronization points. Some researchers have explored methods of reducing synchronization overhead by decreasing target network simulation accuracy. (This increases Q, decreasing synchronization frequency). Because decreasing accuracy is not always desirable, our work focuses on reducing synchronization overhead while maintaining a high degree of accuracy in network simulation. The local barrier and predictive barrier scheduling techniques described in the following two sections accomplish this goal.
sors.

Local Barriers
Long waiting times at barrier synchronizationsare a chief component of the problematic synchronization overhead. They are caused by load imbalance during a quantum. The local bamer approach, like the local synchronization techniques commonly used in scientific computing and in other types of PDES, reduces waiting time by reducing the number of other processors for which each host processor must wait at each synchronization point. This section describes the local bamer technique and explains how it was implemented on the CM-5 for Parallel Proteus.
Technique
In most simulations, a host processor must keep within Q cycles of the simulated time of only several other processors -its nearest neighbors. The nearest neighbors of host processor X are those host processors that simulate entities that, in the simulated system, are directly connected to any of the entities simulated on processor X. In a local barrier, each host processor participates in a bamer synchronization with these nearest neighbors. Once a host processor and its nearestneighbors have reached the barrier. the host processor can move on to simulate the next quantum'. However, its nearest neighbors may still be left waiting for their own nearest neighbors to reach the synchronization point. Therefore, some host processors may be done with a local bamer while others are still waiting.
Loosely synchronizing the processors in this way (as opposed to keeping them tightly synchronized as in periodic global bamers) shortens bamer waiting times when the simulation work is not equally distributed among the host processors during a quantum. Only a few processors wait for each slow (heavily loaded) processor. Therefore, unless the same processor is heavily loaded during every quantum, the simulation is not forced to be as slow as the most heavily loaded processor in each quantum. Instead, some lightly loaded processors are allowed to go ahead to the next quantum while heavily loaded processors are still working on the previous quantum; the heavily loaded processors will catch up in a future quantum when they have less simulation work to do.
While this technique alleviates the problem of waiting, it introduces a new problem of software and communication overhead involved in performing local bamers in the absence of hardware support.
Implementation
In Parallel Proteus, the nearest neighbors of each host processor are determined at the beginning of each simulation. Each host processor, using information about the target network, determines which target processors and network switches are adjacent (one hop away on the simulated network) to the target processors and switches it is responsible for simulating. The host processors that are responsible for simulating these adjacent target processors and switches are the nearest neighbors.
As in the periodic global barrier technique, host processors synchronize every Q cycles. Between synchronizations, each host processor keeps track of how many event transfer messages it sends to (and how many it receives from) each of its nearest neighbors during the current quantum. Upon reaching a bamer, a host processor X sends to each of its nearest neighbors Y , a count of the event messages that were sent from X to Y; in the previous quantum. Processor X then waits until it receives similar information from each of its nearest neighbors. It polls the network until it receives and enqueues all of the incoming events it was told to expect, then goes on to simulate the next quantum.
In this approach, a processor may receive event transfer messages generated in the next quantum while still waiting on some from the A processor's nearest neighbors are not necessarily the processors closest to it on the host machine; they depend on the system being simulated and its layout on the host machine. last quantum. In order to distinguish event messages from different quanta, successive quanta are labeled RED and BLACK and each event transfer message is sent with a quantum identifier (RED or BLACK). Since two nearest neighbors are never more than one quantum apart, only two distinct identifiers are needed.
Predictive Barrier Scheduling
In contrast to the local bamer technique, predictive barrier scheduling improves performance by reducing the frequency of synchronizations rather than by making each synchronization faster. Predictive bamer scheduling takes advantage of the fact that, during periods of the simulation when the simulated entities do not communicate, synchronization is not necessary. It is only necessary to synchronize when communication is taking place, to make sure it is simulated correctly. Therefore, performance may be improved by eliminating the unnecessary periodic synchronizations performed during computation phases. Predictive bamer scheduling accomplishes this by predicting when communication is going to occur in the target system and scheduling synchronizations only during communication phases.
In effect improving the lookahead [13] of simulations, this approach seems especially promising for applications with long local computation phases. The following sections overview the general structure of predictive barrier scheduling and describe in detail the compile-time and runtime prediction mechanisms it uses.
Overview
While synchronizations are scheduled statically in periodic global barriers (every Q simulated cycles), synchronizations in predictive bamer scheduling are scheduled dynamically, at runtime, based on the current communication behavior in the target system. Synchronizations are scheduled to occur frequently during communication phases, and infrequently during computation phases. The main sim- After each global synchronization (which ensures that all host processors are at the same simulated time and that all pending event transfer messages have been received), the host processors agree on a time at which to perform the next synchronization. Each host processor first determines the earliest time that any of its entities will next communicate. In order to calculate this s i , each host processor examines the state of each of its entities, i.e. processing nodes and network switches. If a network switch has a packet in a buffer or on incoming or outgoing wires, then it will be communicating soon, otherwise not. Predicting when target processor nodes will communicate is more difficult, since the application threads they run may send messages at any time. Predictive barrier scheduling uses a combination of compile-time and runtime analysis to predict when application threads will communicate. These are described in the next two sections.
Once each host processor has determined si, computing S, the minimum time at which any target entity next issues a communication operation, is straightforward. Each host processor contributes its si to a global minimum reduction operation. Then, since all processors are certain that no event transfer messages will be generated until simulated time S, they go ahead to simulate each of their entities up to simulated time S. (The global synchronization and global minimum cannot be combined into a single global operation because all pending event transfer messages must be received before computing each Si).
Runtime Analysis
The future behavior of an application thread can easily be determined by simply allowing the thread to execute further and observing its actions. Doing this without advancing the simulation time clock provides a very accurate mechanism for predicting how long threads will run before communicating. It can be used by host processors at runtime to determine when application threads will communicate and to schedule synchronizations accordingly. For example, if, at time t , all application threads are allowed to run ahead, and none communicate for 500 cycles, then the host processors know that synchronization is not needed until t + 500. With the help of the Proteus simulator quantum mechanism, predictive barrier scheduling runtime analysis predicts the future communication of threads in just this manner.
Like Proteus, Parallel Proteus simulates a thread executing on a simulated processor at time t by allowing the thread to execute up to 1000 local instructions2 at once without interruption. This 1000 cycle limit is called the simulator quantum. While this technique allows target processors to drift apart in time by more than the value of the synchronization time quantum Q, it does not necessarily introduce any inaccuracy into the simu~ation.~ The reason is that local instructions can be performed at any time and in any order as long as non-local operations are performed at the correct time and in the correct order. To ensure the correct timing and ordering of nonlocal operations, a thread is interrupted as soon as it encounters anonlocal operation. Control then returns to the simulator and the nonlocal operation is delayeduntil the simulator has processedall earlier non-local events. This technique allows the simulator to achieve good performance without sacrificing accuracy (see appendix A in Predictive barrier scheduling uses the 1000-cycle window into the future provided by the simulator quantum mechanism to determine the future communication behavior of application threads. If, at time t , all application threads simulated by host processor : have executed 1000 cycles into the future without hitting a non-local or communication operation (and the network switches it simulates have no pending messages), then si = t + 1000, because the host processor knows that none of its entities will communicate before t + 1000.
~7 1 ) .
2Local instructions are those that affect only data local to the target processor on which they are executed. Non-local operations are those that potentially interact with other pans of the simulated system. Examples of non-local opemtions are message sends, checks for interrupts, and access to data shared by application threads and message handlers. 
Compile-time Analysis
Because the predictive barrier scheduling runtime analysis involves stopping at communication operations, those host processors simulating frequently-communicating threads will have less work to do at some synchronization points than those simulating threads that communicate infrequently. In order to keep the runtime analysis workload reasonably balanced among host processors, a limit is placed (1000 cycles) on the amount of time a thread is allowed to run ahead. Therefore, while runtime analysis is a powerful prediction mechanism, it cannot be used to look arbitrarily far ahead into the future. For this reason, predictive barrier scheduling augments the prediction power of the runtime analysis with information generated at compile-time.
Dataflow analysis of the application code is done at compiletime to determine the minimum distance (in cycles) to a non-local or communication operation from each point in the application code. Then, the code is instrumented with instructions which make these minimum distances accessible to the simulator at runtime. When the application thread is executing,the code added in the instrumentation phase updates a variable which holds the value of the current minimum time (in simulated cycles) until the thread executes a non-local instruction. When computing s,, each host processor : examines these variables (one for each thread) in order to increase the prediction values generated by the runtime analysis.
Dataflow Analysis
At compile-time, a basic block (control flow) graph of the application code to be simulated is constructed, and dataflow analysis is performed on it. Each basic block consists of a sequence of local, non-branching instructions followed by a branch instruction or a procedure call. Each basic block is labeled with its execution length in cycles, and has one or two pointers leaving it which indicate where control flows after leaving the block (see Figure 2 ). This graph is used to calculate a conservative estimate of the minimum distance (in cycles) to a non-local operation from the beginning of each basic block. (In Parallel Proteus, threads are never interrupted in the middle of basic blocks; therefore, estimates are only needed for the beginning of basic blocks). Because we did not implement inter-procedural analysis, every procedure call is assumed to lead to a non-local operation. All communication operations are found at the end of basic blocks. Therefore, a first estimate of the minimum distance from the beginning of a basic block to a non-local operation is the size of the basic block. If the block ends with a procedure call (which is either a non-local operation or assumed to immediately lead to one), then the estimate is done. However, if the block ends with a branch statement (a local operation), the following is added to its current estimate: the estimate for the basic block following it or, if it ends in a conditional branch, the minimum of the two estimates for the two blocks following the first block. The minimum distances to nonlocal operations for all the basic blocks in the graph are computed iteratively using the algorithm of Figure 3 (based on algorithm 10.2 in [2]).
Code Instrumentation
After the minimum distance to non-local operation has been determined for each block. each block is instrumented with code that records this value. When a thread gets interrupted or makes a procedure call which causes control to go back to the simulator, the thread's minimum time until non-local operation can be accessed in the variable where it was stored immediately before the thread was suspended. This value is used by the host processor when calculating S i .
Summary
In predictive bamer scheduling, compile-time analysis is used to instrument code with instructions which store information about each thread's future communication behavior. At runtime, this information is combined with runtime analysis to generate an estimate of how long each thread will run before next communicating. During each bamer synchronization. each host processor uses this estimate, along with information about the state of the network switches it is simulating, to determine the minimum time until any one of its entities next communicates. Each processor contributes its local minimum to a global minimum operation to determine a global lower bound on the time any entity will next communicate; the next bamer synchronization is scheduled for this time.
This technique adds some overhead to each synchronization operation. The hope is that it will eliminate so many of the synchronization operations that it will improve performance over periodic global bamers even with a higher overhead cost per synchronization. It will clearly be most beneficial for simulations of applications that have local computation phases of considerable length.
Experiments
We incorporated local bamers and predictive bamer scheduling into Parallel Proteus and compared their performance with that of periodic global bamers. This section describes the experiments performed, and presents and discusses performance results.
Target Architecture
We simulated two dimensional mesh architectures with 16 to 1024 processor nodes and switches. Packets are routed through the network using virtual cut-through routing with infinite buffers.
Applications
The applications considered in this study are SOR (successive over-relaxation) and parallel radix sort. SOR was chosen to represent parallel applications with very regular communication pattems and long computation phases. In contrast, radix represents applications in which communication is more frequent and irregular. By varying parameters of both applications, we were able to experiment with a wide range of computation-to-communication ratios.
SOR
SOR is a stencil computation that iteratively solves PDEs [4].
Each iteration of the algorithm performs a relaxation function on a grid of points. This function requires only the value of the point to be updated and the values of adjacent points. Subblocks of this grid are distributed among the processors of the parallel machine. During each iteration, each processor performs the relaxation function on the grid points assigned to it and communicates with a fixed set of grid neighbors to obtain border values.
We used grid sizes of 2" to 2', entries, and up to 625 target processors. This application altemates between communication phases and long computation phases.
Radix Sort
Parallel radix sort sorts d-digit radix r numbers in d passes of counting sort [5] . The numbers to be sorted are distributed evenly among the processors. Each pass consists of 3 phases: counting, scanning and routing. The counting phase involves local computation only. The scanning phase consists of r parallel prefix scans. In the routing phase, each processor sends its data points to new locations.
This application has a short local computation phase in the count phase, and much communication in the scanning and routing phases.
We used 2-digit radix-64 numbers, requiring 2 passes and 64 scans per pass. 8192 numbers were assigned to each target processor and 64 to 1024 processors were simulated.
Experimental Methodology
We ran Parallel Proteus on a 32-node CM-5 partition. Each CM-5 node is a 33 MHz Sparc processor with 32 megabytes of RAM and no virtual memory. In order to get accurate timings and decrease variability, experiments were run in "dedicated" mode. This means that timesharing is tumed off, and each application runs to completion without being interrupted to run other applications.
The only source of variability, therefore, is in the routing of messages through the host network. In practiie, this variability was very small.
The data presented are averages (X) of three timings, with a / y 5 4%.
Predictive Barrier Scheduling
We evaluated predictive barrier scheduling against the baseline periodic global barrier approach by examining three criteria: overhead, percentage of barriers eliminated, and overall performance improvement.
As expected, experiments show that predictive barrier scheduling with both runtime and compile-time analysis always performs better than runtime analysis alone. However, the runtime analysis is in fact responsible for the overwhelming majority of the barriers eliminated. The compile-time analysis improves upon runtime analysis alone by eliminating an additional 4% to 26% of the barriers executed under the baseline periodic global barrier approach. We focus on the results for the complete predictive barrier scheduling approach.
Overhead
Predictive barrier scheduling adds overhead in four areas: 0 compilation, 0 application code execution, 0 barrier scheduling, and 0 barrier waiting time. Some overhead is incurred at compile-time to perform dataflow analysis. This overhead is negligible in our implementation, because the necessary data structures are constructed and provided by the program that adds cycle-counting instructions to the application code.
During code instrumentation, predictive barrier scheduling adds 6 instructions to each basic block of the application, but only a small percent of total runtime is spent executing them. Their overhead is negligible compared to synchronization overhead, which accounts for 80% of simulation time. During each barrier synchronization, host processors schedule the next barrier. This entails 5 to 50 cycles of additional work, and an additional global reduction operation (150 cycles, 4.5 microseconds).
As a result of less frequent synchronization, more simulation work is done between barriers, and load imbalance during quanta increases. This lengthens each barrier by increasing waiting times. In our experiments, each barrier (typically 20 to 600 microseconds long) becomes 20% to 1400% longer. The less frequent the synchronizations, the longer they are.
Barriers eliminated
The purpose of predictive barrier scheduling is to eliminate unnecessary barriers which occur during computation phases. Our experiments show that this method does, in fact, reduce the number of barriers executed by 12% to 95%. As expected, the number of barriers eliminated is greater in the simulation of applications with long periods of local computation between communication phases. To illustrate this feature, we graph the number of barriers executed versus granularity. Granularity is a measure of the length of local phases of computation relative to the length of communication phases. Applications with large granularity have long local computation phases. Those with small granularity are dominated by communication and short computation phases. Granularity is computed differently for each application because the length of local computation phases depends on applicationspecific parameters. We define granularity for SOR to be equal to the number of data points assigned to each target processor. In radix, granularity is defined to be equal to the inverse of the number of target processors. Figures 4.5, and 6 show that for both SOR and radix, the number of barriers eliminated increases with granularity. In general, radix has shorter computation phases than SOR, so fewer barriers are eliminated in radix than in SOR.
Performance Improvement
The performance improvement achieved by predictive barrier scheduling over the periodic global barrier approach ranges from -5% to +65%. Because predictive bamer scheduling lengthens each bamer, no improvement is seen when few barriers are eliminated. In general, about 3040% of the bamers (otherwise executed under the periodic global barrier approach) have to be eliminated in order to improve performance. As we expected, performance improvement increases with granularity, just as demonstrated in the previous section. This is evident in the graphs of normalized simulator runtime for SOR and radix, figures 7.8, and 9. Little improvement (only up to 10%) is seen in radix because its computation phases are short. SOR with large data sizes has large granularity and, therefore, achieves large performance improvements (up to 65%). Clearly, the predictive barrier scheduling method is only useful for applications with reasonably long local computation phases.
Local Barriers
The purpose of the local barrier approach is to improve the performance of simulators like Parallel Proteus by reducing the waiting time at synchronization points. Results show that this approach is most effective in simulations of applications with small granularity.
Overhead
The local bamer approach adds overhead to each synchronization because it requires each host processor to send, receive, and await the individual messages of each of its nearest neighbors. In contrast to hardware-supported global bamers, there is no hardware support on the CM-5 for the painvise synchronizationsrequired for local barriers. In the best case, when all processors are already synchronized, a global barrier (on the CM-5) takes 150 cycles (4.5 microseconds). Under the same conditions, a local barrier takes anywhere from 800 to 1300 cycles (24 to 39 microseconds), depending on the number of nearest neighbors each host processor has. Because of this high overhead, the local barrier method is expected to do better only when the periodic global barrier approach has long waiting times at each synchronization. In that case, local bamers improve performance by decreasing the waiting times.
Performance Improvement
As expected,experiments show that the local barrier approachimproves performance by decreasing barrier waiting time in situations where the periodic global barrier approach suffers from long waiting times. This occurs in applications of small granularity, where communication is frequent and computation phases are short. In large granularity applications, on the other hand, Parallel Proteus (using periodic global bamers) executes the long computation phases 1000 cycles at a time (because of the simulator quantum approach, described in 5.2). This leads to the execution of many unnecessary barriers in a row, with no simulation work in between. The waiting time at most of the barriers, then, is very short. Local barriers shorten synchronizations in small granularity applications by up to 25%. In large granularity applications where barriers are already short, the local barrier approach lengthens barrier time by up to 192%.
Graphs of normalized simulator runtime (figures 1 0 , l l . and 12) show that the behavior of the local barrier approach complements that of predictive barrier scheduling. Predictive barrier scheduling improves performance for applications with large granularity. while local barriers improve performance for applications with small granularity. Local barriers improve performance most in radix (up to 24%) and in small granularity SOR.
Discussion
The complementary relationship between predictive barrier scheduling and the local barrier approach suggests that one might combine the two techniques to achieve more consistent performance improvement. One way to do this is to use each technique in the situation where it does better: the local barrier approach during communication phases and predictive barrier scheduling during computation phases. In this way, unnecessary barriers in the computation phases will be eliminated by predictive barrier scheduling. However, the extra overhead required will not be incurred in the communication phases. For its part, the local barrier approach will shorten the The difficulty lies in determining when to switch from one technique to the other. One approach is to switch adaptively at runtime. Under predictive barrier scheduling, the simulator can monitor application communication by counting event transfer messages at runtime. If the simulator detects frequent communication, it will switch to the local barrier approach. However, switching from local barriers to predictive barrier scheduling is more complicated, because a global decision must be made to switch, but processors are doing local synchronization. To remedy this, each processor could continue to monitor event transfer traffic while synchronizing locally every Q cycles, but global synchronizations could be performed every 100 * Q cycles to facilitate a global decision. If there has been little communication traffic, signifying a computation phase, the simulator would switch back to predictive barrier scheduling. We have not experimented yet with this adaptive technique, but it is clearly worth doing so.
While our results are useful for observing the complementary behavior of predictive barrier scheduling and local barriers, the exact performance data is very specific to the particular host architecture (CM-5) we used in our experiments. Our experiments show that, in general, predictive barrier scheduling achieves greater performance improvements than the local barrier approach. However, the success of predictive barrier scheduling is due in part to the CM-5's fast global minimum operation; without it, this technique would be much slower. Local barriers do not perform as well becauseof high messaging overhead. On different host architectures, the relative performance of the two techniques will be different. Some multicomputers have support for fast fine-grain synchronization, but little or no support for global synchronization (e.g., Alewife [l]). Local barriers are likely to perform much better on such host machines, while predictive barrier scheduling will suffer from high overheads for its global operations. Therefore, the two techniques are also complementary with regard to the host architectures for which they are suited.
While the local barrier approach and predictive barrier scheduling have been evaluated in the context of a direct-execution simulator, they are certainly also applicable to simulators which simulate application code execution in a different way (e.g., threaded code [3]). First, local barriers do not depend on direct execution in any way.
Second. while our implementation of predictive bamer scheduling does rely on Parallel Proteus's support for direct execution, this is not essential to the technique. All that is needed to implement predictive bamer scheduling is a way of associating with each value of an application's program counter, an estimate of the time until the application will next communicate. The necessary analysis can be done at compile-time as in Parallel Proteus, but instead of instrumenting code, the information generated at compile-time can be stored in a table. As each instruction is simulated, information about future communication behavior can be looked up in the table. Therefore, the applicability of local bamers and predictive barrier scheduling extends to more simulators than just the one described here.
Related Work
The methods that other researchers have developed to reduce synchronization overhead in parallel simulators of parallel computers include reducing network simulation accuracy, balancing load, and improving lookahead. Several of these approaches will be described here as they relate to our work.
Some simulators of parallel computers are extremely detailed (e.g., cycle-by-cycle simulators such as W O -P , a detailed parallel simulator of the MIT Alewife machine [15]), with large overheads for simulating each cycle that far outweigh synchronization overhead. In contrast, our work deals with the synchronization overhead that dominates when low overhead techniques are used to simulate processor behavior.
The Wisconsin Wind Tunnel and Parallel Tango Lite are two direct execution-based parallel simulators of shared memory multicomputers [21] [14]. They synchronize using periodic global barriers. They achieve good performance by decreasing network simulation accuracy, which allows them to synchronize infrequently. The WWT researchers have explored a range of network simulation models (from very accurate to not very accurate) [7] . All of these could easily be incorporated into Parallel Proteus, but would not solve the problem of reducing synchronization overhead while maintaining accurate network simulation. However, WWT and PTL cannot exploit lookahead in the same way as Parallel Proteus can, because they simulate shared memory architectures. Communication can potentially happen on every memory reference in shared memory systems, so it is difficult to identify ahead of time periods of computation during which it is certain that no communication will take place. Also, all accesses to shared data (whether requiring communication or not) potentially affect other target processors, so it is not correct to allow application threads to run ahead until a communication operation is encountered.
LAPSE is a conservative, direct execution-based parallel simulator of the message-passing Intel Paragon [lo] [ l l ] [12] . It achieves good performance by exploiting two sources of lookahead. First, like the runtime analysis in predictive barrier scheduling, LAPSE lets some application code execute in advanceof the simulation of its timing. However, unlike Parallel Proteus, LAPSE does not augment this technique with compile-time analysis. Second, a large amount of lookahead results from the target and level of detail of LAPSES network simulation. LAPSE simulates only store-and-forward networks, and only at the packet-switching level of detail. As a result, LAPSE'S synchronization time quantum Q is fairly large (4750 cycles = packet-switching time) compared to that of Parallel Proteus (2 cycles = flit-switching time). Parallel Proteus's synchronization time quantum is so small because it simulates modem, high-speed interconnection networks that use cut-through routing; these require a very fine-grained clock for accurate (flit-level) simulation. Therefore, exploiting large amounts of lookahead in Parallel Proteus is much more difficult than it is in LAPSE.
SPaDES is a conservative parallel simulation approach used for simulating symmetric multiprocessors on shared memory multicomputer hosts [16] . It uses load balancing and split-phase (or fuzzy) bamers to avoid synchronization overhead. While effectively reducing waiting time for SPaDES simulations, balancing load in Parallel Proteus would involve too much overhead for transferring simulation state between host processors. In "aggressive" mode, SPaDES allows the host processors to be only loosely synchronized. This technique is similar in effect to local barriers (or nearest-neighbor synchronization).
Conclusions
Parallelism is necessary for fast, detailed simulation of large multicomputers. Synchronization overhead, however, often severely limits the performance of conservative parallel simulators. This occurs because low-latency communication in simulated networks requires frequent synchronization of simulator processes. One way to reduce synchronization overhead is to sacrifice accuracy in the simulation of the network. Focusing on local bamers and predictive barrier scheduling, our research demonstrates that for simulations of message-passing parallel computers, nearest neighbor synchronization and application-specific optimization can improve performance by reducing synchronization overhead without sacrificing accuracy in network simulation.
In our experiments, local bamers improved the performance (by up to 24%) of communication-bound simulations by reducing waiting time at synchronization points. Predictive bamer scheduling, on the other hand, improved the performance (by up to 65%) of computation-bound simulations by eliminating unnecessary synchronizations. Because of their complementary behavior, we advocate a hybrid adaptive approach that dynamically chooses between the two techniques.
