Abstract. Designing and implementing a large-scale parallel system can be time-consuming and costly. It is therefore desirable to enable system developers to predict the performance of a parallel system at its design phase so that they can evaluate design alternatives to better meet performance requirements. Before the target machine is completely built, the developers can always build an symmetric multi-processor (SMP) for evaluation purposes. In this paper, we introduce an SMP-based discrete-event execution-driven performance simulation method for message passing interface (MPI) programs and describe the design and implementation of a simulator called SMP-SIM. As the processes share the same memory space in an SMP, SMP-SIM manages the events globally at the granularity of central processing units (CPUs). Furthermore, by re-implementing core MPI point-to-point communication primitives, SMP-SIM handles the communication virtually and sequential computation actually. Our experimental results show that SMP-SIM is highly accurate and scalable, resulting in errors of less than 7.60% for both SMP and SMP-Cluster target machines.
Introduction
Nowadays, large-scale parallel computers that comprise thousands of processors cost millions of dollars and take years to design and build. For system developers, it is greatly desired that the performance of a parallel system can be predicted efficiently and accurately at its design phase. This can help them evaluate different design alternatives to better meet performance requirements [25] .
There are two popular performance prediction methods: model-based and discrete event simulation. The former [3, 11] builds a parameterized model based on the signatures extracted from the given system, such as the number of cores, the number of instructions executed, and memory access patterns. Then the model is used to estimate the system performance. However, a model built for an application cannot be applied to another one. Moreover, as the system complexity increases, the factors that affect the system performance become too complex to be extracted thoroughly. So it can be difficult to obtain accurate predictions for large-scale parallel computers with the model-based method.
Discrete event simulation is capable of simulating large-scale parallel systems. The simulation procedure jumps from one state to another upon the occurrence of an event [5] . A simulator updates the simulation time and state information by processing the events. Events are generated either by the preextracted trace (trace-driven) or by the execution of the program (executiondriven). For trace-driven techniques, such as Dimemas [12] , PERC [19] and SIM-MPI [18] , the event trace is obtained with instrumentation tools by running the program before the simulator runs. For execution-driven techniques, such as WWT [17] , WWT II [14] , MPI-SIM [15] and BigSim [26] , the events are generated when the program is running.
In discrete event simulation, the large-scale parallel machine to be conducted is called the target machine, the machine on which the simulator executes is called the host machine, and the program whose performance is to be predicted is called the target program. Through analysis, we find that before the target machine is completely constructed, the system developers can always build a symmetric multi-processor (SMP), whose central processing units (CPUs) are the same as those in the target machine. If the target machine is an SMP, such as NEC SX-9 [20] , CRAY CX1000-S [4] or IBM SP-SMP [8] , the developers can use some of the target machine's CPUs to build a smaller SMP. If the target machine is an SMP-Cluster, a cluster whose nodes are SMPs, such as Roadrunner [2] , Tianhe-1A [24] or K Computer [1] , the developers can have at least one of the SMP nodes at the design phase. Therefore, we have opted to use the smaller SMP (or the SMP node) as the host machine to predict the performance of the target program on the target machine.
SMP-SIM

Related Work
Discrete event simulation has been researched extensively in academia and industries. A number of simulators have been developed, including WWT [17] and WWT II [14] of University of Wisconsin, Dimemas [12] of CEPBA (European Center for Parallelism in Barcelona), MPI-SIM [15] of University of California, PERC [19] of San Diego Supercomputer Center, and BigSim [26] of University of Illinois at Urbana-Champaign. Among these simulators, MPI-SIM and BigSim are the most similar with SMP-SIM. All the three predict the performance while executing the target MPI applications.
In MPI-SIM, each process of the target program is simulated by a thread. MPI-SIM uses a portable library MPI-LITE to translate the target program into a multithreaded program, and measures the execution time of sequential computation codes when the multithreaded program is executing. However, threads differ from processs in terms of such program behaviours such as cache replacement policy and execution pattern. As a result, the sequential computation time cannot be measured accurately. Moreover, because MPI-SIM is not based on the SMP host machine, messages must carry the simulation timestamps that are used to calculate communication overheads.
BigSim is a simulator developed for BlueGene/C. The simulator contains two parts, a parallel-function emulator and a parallel-network simulator BigNetSim [27] . BigSim defines a set of application interfaces, such as addMessage and sendPacket, which are used to implement the MPI interfaces. All the application interfaces are executed by the simulator, and the other codes are directly executed on the host machine. In BigSim, the sequential computation time is calculated by heuristic approaches, which cannot lead to high accuracy. Several parallel programming languages are implemented on BigSim, including MPI, CHARM++ [10] and Adaptive MPI [7] .
Zhai et al. [25] proposed a new method, called Phantom, to estimate the sequential computation time, which is used in a trace-driven simulator SIM-MPI [18] . Phantom needs to execute the target program twice. During the first execution, the communication traces of parallel applications are generated by intercepting all communication operations for each process and the computation between communication operations is marked as sequential computation unit. During the second execution, the real sequential computation time is measured on a target processing node for each process one by one. However, executing the target program twice is time-consuming. And it cannot deal with the uncertain programs because an inaccurate trace is usually generated.
The above simulation approaches are not based on SMP. To the best of our knowledge, there has not been an SMP-based performance simulator for MPI programs.
Basic Ideas of SMP-SIM
To design a discrete event simulator, three key questions need to be answered: what events exist in the simulator; how are the events generated; and how the events are processed for performance prediction. In this section, we introduce the basic idea behind SMP-SIM by addressing these questions. Without loss of generality, we assume that the core count of the target machine is equal to the process count of the target program. It should be noticed that as mentioned in Section 1, the CPUs in the host and target machines are the same.
Event Definition
For an MPI program, at any point during execution, a process is in either the sequential computation state or the communication state. So there are four types of events during the simulation of an MPI program: simulation start events, simulation end events, communication start events and sequential computation start events. A simulation start event means that the simulation starts, i.e. a process enters the sequential computation state. A simulation end event means that the simulation ends, i.e. a process ends. A communication start event means that the process enters the communication state. A sequential computation start event means that the process enters the sequential computation state.
It is not difficult to find out that the simulation start event and the simulation end event for an MPI process are the first event and the last event, respectively, of the process. Communication start events and sequential computation start events are generated alternately, as shown in Fig. 1 
Event Generation
Events are generated either by the pre-extracted trace (trace-driven) or by the execution of the program (execution-driven). For a trace-driven method, the event trace is extracted with instrumentation tools while running the program. Then the simulator uses the trace to drive the events and predict the execution time of the program. However, when predicting the performance of a large-scale parallel computing system, the time to extract the trace and the space to store it become intolerable. Moreover, due to some uncertain factors (e.g. branches, dynamic instruction generations and non-deterministic communications in the SMP-SIM program), trace-driven simulation may be inaccurate with the incorrect trace acquired.
For an execution-driven method, events are generated during program execution. Thus, program behaviours such as branch prediction and dynamic instruction generation can all be simulated. Moreover, an execution-driven method can make use of available system resources to directly execute portions of the application code and simulate the features that are of specific interest or unavailable [16] . We prefer an execution-driven method over a trace-driven method, because the former is closer to the program execution reality and has higher efficiency. Therefore, in SMP-SIM, the events are generated when the target program is running on the host machine: an MPI communication primitive), the process enters the sequential computation state and a sequential computation start event is generated.
Processing the Events
The aim of processing the events is to estimate the execution time of the target program running on the target machine. For this purpose, SMP-SIM maintains the simulation time for each process of the target program, denoted as t s . When an event is generated when the target program is executing on the host machine, the simulation time of this process will be updated to the generation time of the event. The generation time of an event is the time when the event is generated if the target program executes on the target machine. Therefore, when a simulation end event is being processed, the simulation time of its corresponding process (i.e. the generation time of the simulation end event) represents the execution time of the process when it runs on the target machine.
As a parallel simulator that can be run on a small-scale SMP, SMP-SIM utilizes the characteristics of SMP to deal with the events efficiently and accurately:
-Manage the events globally at the granularity of CPUs. Because the processes allocated to a CPU on the host machine may outnumber the cores in the CPU and all the processes share the memory in SMP, SMP-SIM globally manages the events that are generated by the processes allocated to the same CPU by using a shared-segment. The event with the smallest generation time will always be selected to be processed first.
-Use a virtual-actual combined method to process the selected event. If it is a communication start event, it will be processed virtually, i.e. the communication overhead is estimated by the communication model; if it is a sequential computation start event, it will be processed actually, i.e. the sequential computation time is measured by direct execution.
It should be mentioned that for the MPI programs with non-deterministic communications (e.g. a receive request contains MPI ANY SOURCE as the source), the simulator needs a synchronization mechanism to make sure that the right messages (i.e. the messages that are received when the program executes on the target machine) are accepted during the simulation on the host machine. The synchronization mechanism used in SMP-SIM is optimistic mechanism [9] [21], which allows to process the earliest available event with no regard to safety. When an older message arrives, a rollback mechanism is needed to undo an earlier out of order execution and re-execute the events to guarantee the correct sequence of event processing. However, synchronization mechanism is not the focus of this work. The interested readers please refer to [9] for a detailed description.
Framework of SMP-SIM
Due to its good configuration, high performance and portability, MPICH [6] has become one of the most popular MPI libraries. SMP-SIM is designed based on MPICH. We have modified the MPICH library and integrated all the functionalities of SMP-SIM into the modified MPICH library. MPICH consists of two layers, a machine-independent layer and a machinedependent layer, separated by an abstract device interface (ADI), as shown in Fig. 2 . The machine-independent layer consists of an MPI application programmer interface (API) layer and an MPIR runtime library layer. The MPI API pro-SMP-SIM vides the user with programming interfaces and handles the MPI structures irrelevant to environments. The MPIR runtime library translates the complex MPI primitives in MPI API into point-to-point communication operations. The main functionalities of SMP-SIM are implemented by adding a Simulation API Layer between the MPI API and MPIR runtime library layers. As shown in Fig. 2 , the simulation API layer comprises three modules: primitive decomposer , communication model and event management.
In the primitive decomposer module, all the MPI communication primitives are reconstructed by using core point-to-point communication primitives. This is the base of the other modules. When a process encounters a communication primitive as the target program runs on the host machine, the primitive decomposer module will decompose it into several core point-to-point communication primitives and then these core point-to-point communication primitives will be invoked one by one. The invocation and the return of a core point-to-point communication primitive will generate the corresponding events. When an event is generated, the event management module globally schedules the events, processes the selected event and updates the simulation time of the event's corresponding process. When processing a communication start event, the event management module will interact with the communication model module to calculate the time overheads of the corresponding core point-to-point primitive.
Next, we will introduce these three modules in detail.
Primitive Decomposer
MPI provides users with a lot of communication primitives, including collective communication primitives and point-to-point communication primitives. In MPICH, all collective communication operations are implemented in terms of point-to-point communication operations. So, we choose four core point-to-point communication primitives from the MPI library: MPI Ibsend (non-blocking buffered send), MPI Issend (non-blocking synchronous send), MPI Irecv (non-blocking receive) and MPI Wait [15] . Using these four core point-to-point communication primitives, we can reconstruct the other point-to-point and collective communication primitives. Table 1 lists the ways to reconstruct the other five point-to-point communication primitives in MPI. The collective communication primitives can be reconstructed by four core point-to-point communication primitives and the five primitives listed in the first column of Table 1 . When a process of the target program encounters a communication primitive on the host machine, the primitive decomposer module will decompose it into several core point-to-point communication primitives and then these core point-to-point communication primitive will be invoked one by one. Consequently, the corresponding events will be generated. Therefore, the communication start events mentioned in Section 3.1 can be further divided into four types, non-blocking buffered send start events, non-blocking synchronous send start events, non-blocking receive start events and wait start events, which are generated due to the invocation of MPI Ibsend, MPI Issend, MPI Irecv and MPI Wait, respectively. All the events that will appear in SMP-SIM are listed in Table 2 , where the four items in the third row are all communication start events. As shown in Fig. 3(a) , if two communicating processes are physically distributed, MPI Ibsend returns as soon as the buffer at the sender side is available. The data can be sent to the network while they are being copied to the buffer (generally speaking, B mem > B net ), according to the configurations of the current machines. The data are safe when they have been completely copied to the buffer. 
As shown in Fig. 3(b) , if two communicating processes are physically centralized, the buffer will not be reserved and MPI Ibsend returns as soon as it is invoked. Then, we have: MPI Issend. MPI Issend starts a non-blocking synchronous send. A handshake, which is used to make sure the matching receive has started, happens between the sender and the receiver via REQ and ACK messages before the data message is sent to the receiver. MPI Issend will return when ACK from the receiver is received. Fig. 4(a)(b) shows the case where two communicating processes are physically distributed in target machine. The data can be sent to network while they are being copied to the buffer, and the data are safe after they have been copied to the buffer. t send return , t send saf e and t arrive can be calculated as in Equations 8 -10: Fig. 5 . It is worth mentioning that the case shown in Fig. 5(c) (13) where t arrive can be calculated as shown in Equations 3, 6 and 10.
Event Management Module
SMP-SIM is a parallel simulator, and the processes are fixed to their own CPUs while the target programs are running on the host machine. The target machine has more CPUs than the host machine and the core count of the target machine is equal to the process count of the target program. So the processes allocated to a host machine's CPU outnumber the cores in it. The event management module maintains a waiting event queue for each CPU, i.e. the processes allocated to the same CPU share a waiting event queue. At the beginning of running a target program on the host machine, this module inserts an E Start event for each process into its corresponding waiting event queue and sets the generation time of this event as -1. During the execution of the target program, the module manages the events at the granularity of CPUs using the workflow:
Step 1. Step 2. The event management module switches the process that is related to e onto an idle core and then processes e. Step 3. After accomplishing the processing of the selected event e, the event management module deletes e from the waiting event queue and may insert a new event e new into the waiting event queue as follows: After the work of step 2 and step 3, if there is any un-processed event in the waiting event queue, the event management module will start the work of step 1 again and deal with the new selected event as described in step 2 and step 3. It is not difficult to find out that for each process, there is only one related event in the waiting event queue at any time during the simulation. Because only the processes whose events are selected to be processed can run, the unselected processes must be suspended. Processes allocated to the same CPU may be switched when a new event is selected to be processed, and the detailed implementation will be introduced in Section 5.
MPICH2-based Implementation of SMP-SIM
This section describes the MPICH2-based implementation of SMP-SIM. By recognizing the fact that the processes share the memory in SMPs, SMP-SIM creates event queues in the shared segment for all the processes at the granularity of CPUs. SMP-SIM re-implements the core MPI primitives and integrates the functionalities of event management into them. Fig. 6 shows the directory of the source codes of MPICH2. The major modification of the implementation of SMP-SIM is in the sub-directory /src/mpi. The implementations of MPI Init and MPI Finalize are in /src/mpi/init; the implementations of all point-to-point MPI primitives are in /src/mpi/pt2pt; and the implementations of all collective MPI primitives are in /src/mpi/coll.
Data Structure
Event is the core of SMP-SIM, and we implement the data structure of an event as shown in Fig. 7 .
-Attributes type, processID, doing, executable and t generate are used for all types of events. type, processID and t generate stand for the type, process ID and generation time of the event respectively; doing and executable judge whether the event is being and can be processed respectively. Based on the definition of the data structure of an event, we create two event queues, W ait Event Queue and Complete Event Queue, at the granularity of CPUs, and store them in the shared segment so that all the processes can access them conveniently. It should be noticed that in order to guarantee correctness, all the operations on the event queues must be protected by the lock mechanism, i.e. only one process is permitted to access a given event queue at one time. In order to briefly describe our implementation, all the operations on the event queues described in the rest of this section are encapsulated by lock() and unlock() implicitly.
Re-implementations of Core Primitives
This subsection introduce the re-implementations of six core MPI primitives MPI Init, MPI Finalize, MPI Ibsend, MPI Issend, MPI Irecv and MPI Wait. These re-implementations realize the functionalities of event management described in Section 4.3. Fig. 8 , besides the codes of the original MPI Init body, the re-implementation of MPI Init does the work as follows: firstly, each process inserts an E Sqe event into its corresponding W ait Event Queue; secondly, make sure that only one process is running on a given processing core and the other processes are suspended; thirdly, the process that has not been suspended starts to deal with the E Sqe event.
MPI Init As shown in
In the re-implementation of MPI Init, Count is a global variable shared by all the processes allocated to the same CPU, and it is used for logging the count of the processes whose E Sqe events have been started to be processed. lock() and unlock() are two functions used to protect the critical section between them. While the target program is running, the process switching mechanism is the priority scheduling mechanism instead of the time-sliced round-robin mechanism, so the line set the priority of this process HIGH in Fig. 8 The re-implementation of MPI Ibsend is shown in Fig. 9 , where the function f indN extEvent() is used for selecting the un-processed executable event e with the minimum generation time, and the return value of f indN extEvent() is e.processID. The major difference between MPI Issend(/MPI Irecv/MPI Wait) and MPI Ibsend is the work of processing the event related to the primitive, and the detailed processing methods can be found in Algorithm 1. 
MPI Finalize
As shown in Fig. 10 , in the re-implementation of MPI Finalize, before executing the codes of the original MPI Finalize body, we first remove the E Seq event that has just been processed from the W ait Event Queue, and then update and print the current simulation time of this process.
Experiments
We demonstrate the validation and accuracy of SMP-SIM in this section. Section 6.1 introduces the benchmarks and experimental platform used. Section 6.2 describes the methodology used for evaluating our work. Section 6.3 presents and analyzes the experimental results.
Benchmarks and Platform
We select three parallel kernels EP, CG, FT from NPB3.3-MPI benchmarks and Sweep3D-2.2d to evaluate SMP-SIM. The problem sizes of EP, CG and FT are all Class C, and the problem size of Sweep3D is 100 × 100 × 100. All the benchmarks are executed in Red Hat Enterprise Linux Server Release 5.5, and MPICH2-1.3.1 is used.
Our platform is a cluster with eight nodes, and each node is an SMP equipped with two 2.93G Intel Xeon X5670 CPUs and 24GB RAM. The interconnection is described in [23] . Notice that although there are 6 cores in Xeon X5670 CPU, we only use it as a 4-core CPU because most of the benchmarks must be executed by 2 n processes.
Evaluation Methodology
In order to validate the accuracy of SMP-SIM for different target machines, our experiments include two parts:
-SMP target machine: for a single SMP node in our platform, firstly, we use the two CPU as the target machine and only one CPU as the host machine, and test the accuracy of SMP-SIM; secondly, we use only one core as the host machine, and test the scalability of SMP-SIM with varying the scale of the target machine. -SMP-Cluster target machine: for the whole platform, we use all the eight nodes as the target machine and only one node as the host machine, and test the accuracy of SMP-SIM.
Results and Analysis
SMP Target Machine
As shown in Fig. 11 , when using two CPUs as the target machine and only one CPU as the host machine, for all the four benchmarks, the errors of SMP-SIM are between 2.56% and 6.14%. The accuracy of SMP-SIM for EP benchmark is the highest, because EP has the fewest communications. The sequential computation time is measured by direct execution, so the simulation of sequential computation is more accurate than that of communication. Consequently, the more communications a benchmark contains, the lower accuracy of SMP-SIM is. In order to test the scalability of SMP-SIM, we fix the host machine as a single processing core, and vary the scale of the target machine as 2, 4, and Fig. 13 , when using all the eight nodes as the target machine and only one node as the host machine, for all the four benchmarks, the accuracies of SMP-SIM are high and the errors are between 2.67% and 7.60%.
Conclusion
For the system developers, it has been long desired that the performance of a parallel system can be predicted at the design phase. Before the target machine is completely constructed, the developers can always build an SMP machine used as a host machine. In this paper, we introduce SMP-SIM, an SMP-based discrete-event execution-driven performance simulator that exploits the characteristics of SMP to achieve accurate and scalable performance prediction.
In SMP-SIM, we have integrated three modules, primitive decomposer, communication model and event management, by adding a simulation API layer on top of the MPICH library. By executing the target program on the host machine with the modified MPICH library, SMP-SIM manages the events globally, handles the communication start events virtually and sequential computation start events actually. In our experimental evaluation, SMP-SIM shows high accuracy and good scalability with prediction errors of less than 7.60%.
