Discrete event simulation provides a powerful mechanism for designing and testing new extremescale programming models for high-performance computing. Rather than debug, run, and wait for results on an actual system, design can first iterate through a simulator. This is particularly useful when test beds cannot be used, i.e. to explore hardware or scales that do not yet exist or are inaccessible. Here we detail the macroscale components of the structural simulation toolkit (SST). Instead of depending on trace replay or state machines, the simulator is architected to execute real code on real software stacks. Our particular user-space threading framework allows massive scales to be simulated even on small clusters. The link between the discrete event core and the threading framework allows interesting performance metrics like call graphs to be collected from a simulated run. Performance analysis via simulation can thus become an important phase in extreme-scale programming model and runtime system design via the SST macroscale components.
Issued by Sandia National Laboratories, operated for the United States Department of Energy by Sandia Corporation.
Telephone:
(800) 553-6847 Facsimile:
(703) 605-6900 E-Mail: orders@ntis.fedworld.gov Online ordering: http://www.ntis.gov/help/ordermethods.asp?loc=7-4-0#online Chapter 1
Introduction
A major branch of discrete event simulation explores application performance on high-performance computing platforms. Here we are concerned in particular with skeleton applications which replicate a full application's control flow. Only the minimal computation necessary to reproduce the communication pattern is executed rather than emulating instruction-by-instruction. Simulated (virtual time) therefore advances much faster than computation (wall clock) time for a single process. Ideally, several minutes of execution on a capability supercomputer could be executed within a few hours on a small cluster. Simulation can inform machine procurements, exploring how applications perform on speculative hardware before purchasing. While small test beds can be informative (particularly for simulator validation), many effects only manifest at extreme scales such as the rapidly shrinking mean-time-between failures (MTBF) as systems move towards exascale [1] . Conversely, simulation can inform software development. Before implementing and debugging every detail at extreme-scale, an application or runtime's performance can be evaluated via simulation.
Our particular HPC focus is science and engineering applications, most often partial differential equation (PDE) solvers. Notable examples include combustion (e.g. explicit Navier-Stokes solvers [2] ), astrophysics (e.g. CASTRO [3] ), or nuclear fusion (e.g. GTC [4] ). The vast majority of applications are written in the communicating sequential processes (CSP) model. The CSP model maps well onto the off-line simulation mode, collecting traces and replaying them step-by-step in the simulator. To handle resilience and massive parallelism, however, new runtime systems and programming models are being explored [5, 6] . The models are much more dynamic, moving work and data between processes. They therefore do not map well to trace replay, instead demanding on-line simulation where a real application runs and generates events. We present here the macroscale components of the Structural Simulation Toolkit (SST) [7, 8] with emphasis on programming and execution model exploration.
The discrete event simulation philosophy for the SST macroscale components emphasizes coarse-grained approximations. We are here not primarily concerned with high-accuracy simulation of a fixed communication pattern. Rather we want to evaluate how program execution evolves for varying inputs, machine parameters, or even component failures. The runtime layer that manages placement of processes, tasks, and data should be executed exactly instruction-byinstruction. In contrast, the compute-intensive math kernels are not executed, instead advancing the virtual simulation time via a coarse-grained model based on, e.g. number of flops or bytes read.
Of critical importance is the ratio of simulated virtual time (the time elapsed on the system being simulated) versus the wall-clock time elapsed on the machine actually executing the simulation. The SST macroscale components are aimed at simulating extreme-scale (e.g exascale) systems with 100K or more endpoints. To be useful we propose the general rule that one hour of wall-clock time should at least be capable of progressing one minute of virtual time to be useful, or roughly a 100:1 time dilation (at worst case 1000:1). The virtual time acceleration relative to physical wall-clock time relies primarily on coarse-grained approximations. If simulating 100K endpoints in serial on a single machine, coarse-grained approximations would have to accelerate virtual time by a factor of 1000:1 to achieve the time goal. In an extreme-case, a large matrix-vector product that requires 1 ms to actually execute might be simulated in 1 µs, leading to a 1000:1 time acceleration. However, even if leveraging coares-grained models for the compute-intensive kernels, we must still simulate the application runtime. If the runtime consumes 50 percent of active CPU cycles, our simulation strategy would at best achieve a 2:1 time acceleration. In contrast, as is often the case with MPI or even newer runtime systems like Legion [9] , the runtime is often 5 percent or less overhead.
If combined with parallel discrete event simulation (PDES) [10] , however, the coarse-grained requirements become much less severe. Even for 100-way PDES parallelism, coarse-grained approximations must now only accelerate application execution by a factor of 10:1. For PDES, the critical design choice now becomes a conservative or optimistic algorithm. While optimistic algorithms can be far more scalable, they require either checkpointing or reverse computation to account for time order violations. As described above, for our purposes, we wish to simulate real software stacks for programming model exploration. A great deal of application state exists, making checkpointing a serious computational bottleneck. Similarly, our goal here is to directly compile and run C/C++ codes, rather than building ad-hoc state machine representations of an application. Since arbitrary code can be executed, reverse computation becomes very difficult. A central question here is therefore whether conservative PDES is scalable enough to achieve our goals. This paper therefore describes our SST macroscale simulator as a programming model/runtime development tool. We first describe related work in Section 2. In Section 3, we describe the simulator components and basic examples of directly compiling and linking C/C++ code into the simulator. Section 2.3 details our choice of PDES algorithm and how it relates to other conservative and optimistic PDES algorithms. Section 5 describes some of the visualizations and performance metrics like call graphs and fixed-time quanta (FTQ) histograms made possible by our particular simulator structure. Finally, we describe a coarse-grained machine model constructed to maximize PDES performance along with scaling results for an example problem.
Chapter 2

Related Work
Simulators
We have previously outlined the large range of related HPC work [11] . Broadly simulators can be classified based on the following:
• How network traffic is generated, either replaying an application trace (i.e. off-line/postmortem) or running applications on-line.
• How network traffic is modeled: Simple analytic (i.e. latency/bandwidth) models or structural simulators with switches and queues to estimate congestion effects.
• How computation is modeled: simple constitutive models, performance counter convolution [12] , to even detailed cycle-accurate simulation [13] .
Features for some (but not all) literature examples are given in Tables 2.1 [7] Both PerfCtr/Model Packet/Flow Native The closest simulators to macroscale SST are BSIM, BigSim, and SIMGRID. BSIM also emphasizes coarse-grained modeling [15] . BigSim is structural, and large simulations incorporating congestion and statistical computation modeling have been performed [28] . Like SST, BigSim can be an on-line network emulator or on-line simulator estimating elapsed time without actually performing the computation [21] . SIMGRID is also designed primarily as an on-line simulator with various bindings (including MPI) [22] .
Programming Models
The simulation emphasis here is programming model exploration, in particular many-task runtimes that break from the MPI communicating sequential processes (CSP) model. In particular, we demonstrate runtime emulation for a many-task programming model developed in the last year within the lab. Related models include Charm++ [29, 30] , Concurrent Collections (CnC) [31] , Legion [9] , Uintah [32] , or DaGuE [33] . The runtimes control placement of data and tasks, with the majority of compute-intensive kernels and large memory allocations contained within tasks. The programmer focuses on the scientific problem being solved, leaving details of task-scheduling to the runtime. The runtimes are usually based on messaging or transport layers like Cray uGNI [34] , PAMI [35] , SHMEM [36] , or Gasnet [37] . For on-line simulation of task runtimes, simulators must provide API bindings matching these message layers.
Parallel Discrete Event Simulation
Parallel discrete event simulation (PDES) can be broadly split amongst conservative and optimistic strategies [10] , although a more exact taxonomy has been given [38] . We refer to each worker (thread) in a PDES run as a logical process (LP). In general, LP here is synonymous with MPI rank, which produces some confusion as we often have a real MPI program simulating a virtual MPI program. While a given LP may schedule most events to itself, at least some events are scheduled to other LPs. In order to exploit parallelism, each LP must have its own event queue and will thus have its own notion of time. The core problem in PDES is that events sent between LPs cannot violate time ordering, i.e. events cannot be scheduled in the past. When LPs exchange events (in the current case, when messages are sent between nodes or switches), the event has a virtual time latency, e.g. it may require 100ns for a network packet to hop from switch to switch. This latency creates a time window or lookahead δ =100ns in which the LP is free to run events with no risk of time violation. Once LPs have synchronized on a given time, t, they are safe to run events in the time window (t, t + δ ). Conservative strategies enforce strict time synchronizations amongst the LPs so there is no need for checkpointing or rollback. In many cases, the enforced synchronization can severely hamper parallelism as only a few (or one) LP may have events in the time window (t,t + δ ). The efficiency of conservative strategies is highly dependent on the lookahead δ as larger values will allow much greater parallelism.
Optimistic strategies allow LPs to run events with little synchronization, allowing greater parallelism at the risk of time violations. When violations occur, state must be rolled back, for which a popular method is Time Warp [39] . The most basic rollback is checkpointing, saving application state at regular intervals to provide a fallback point. In reverse computation, only an event list is needed which can be reversed to undo state changes. Checkpointing induces a significant overhead even if no time violations occur. Reverse computation induces most of the overhead when violations occur, but does require extra storage for the event list. Detailed packet-level network simulations have been performed at very large scales with the ROSS simulator [40, 41] .
Both checkpointing and reverse computation present special challenges for on-line simulation. There is a large amount (several TB) of application state such that checkpointing is likely not feasible. Reverse computation is also difficult since we want to allow arbitrary code to be executed on real software stacks. A simple, well-defined state machine is most amenable to reverse computation, e.g. a packet sent from a queue can be placed back in its original position to reverse the event. For arbitrary code, reverse computation is not trivial. Thus, despite potential for increased parallelism with optimistic PDES, we pursue a conservative strategy here.
Conservative approaches generally operate through event queues [42, 43] . Each LP maintains an event queue for every LP it communicates with which is associated with a unique virtual timestamp. As an LP receives events, the new events advance the queue timestamp. This scheme effectively exploits locality. Even if there are 100 LPs, each LP might only communicate with, e.g., 3 neighbors. Only local point-to-point communication is required to synchronize times rather than a global collective. This scheme requires an LP to regularly receive events on each queue, though. If no events are received, the queue time is never updated and an LP will block waiting for updates. If deadlock occurs, an LP must send so-called null messages to "request" a time update (although methods avoiding null messages have been proposed [44] ).
For our workloads, the event queue algorithm becomes highly inefficient. In many network topologies and models, each LP is connected to many other LPs, erasing the benefit of local communication. Furthermore, our simulator emphasizes coarse-grained compute and network models. Large time gaps often exist between consecutive events, often much larger than the lookahead. This leads to a pathological situation in which LPs are consumed sending null messages back and forth.
Here we instead use a global synchronization algorithm. While in some ways the simplest algorithm, it actually represents an optimization for workloads encountered with macroscale SST.
while The simulation begins at t = 0 with lookahead δ determined from the network parameters. We consider a simulation with N l p logical processes. Within the time window (0, δ ) each LP will call MPI Isend to deliver events to other LPs. At t = δ , every LP performs two blocking MPI collectives. First, they perform a reduce-scatter with a sum function. Each LP tracks the total number of events and the total number of bytes sent to every other LP. This array of size 2N l p is passed as input to the reduce-scatter. Every LP receives as input from the reduce-scatter two integers: the total number of messages and the total number of bytes received. The LP can then execute MPI Recv for every incoming message. The source of each message is not important since a wildcard MPI ANY can be specified. The size passed to MPI Recv need not be exactonly greater than or equal. After completing the receive, both the actual source and size can be determined. Once all messages are sent or received, the LPs perform an MPI Allreduce with a min function to determine the minimum virtual time. Once all LPs have agreed on a minimum virtual time, t 1 , the cycle repeats for the new time window (t 1 ,t 1 + δ ). Time synchronizations are pushed into highly-optimized collective operations, minimizing the communication overhead. Additionally there is no probing or polling for messages. Each LP determines exactly how many messages are pending and can execute receives without wasting cycles checking for new messages.
Chapter 3 SST Macroscale Simulator Components
Each virtual (simulated) process or kernel thread is modeled as an explicitly-managed user-space thread. It has its own stack and register variables, but is not scheduled as a full OS thread. For simulations involving many thousands of virtual processes, the OS scheduler would otherwise become quickly overwhelmed. Most events occur on the main simulation thread, which handles events and advances the virtual time. Any event which does not directly derive from the application (packets moving through the network, memory models) occurs here.
Code stack
The SST macroscale components and software stack are shown in Figure 3. 1. An application is written with function calls to standard software APIs such as MPI or pThreads. Those calls are intercepted and passed to an SST implementation rather than the standard implementation. Network and compute activity is now modeled with SST rather than being executed on real hardware In the header file sstmpi.h, we have # define MPI_Send _SSTMAC_MPI_Send # ifdef __cplusplus extern "C" # endif int _SSTMAC_MPI_Send (...) The SST header file links MPI calls to new symbols prefixed with SSTMAC to ensure that virtual (simulated) MPI calls do not conflict with real MPI calls in the parallel simulator core. The SST implementation of MPI looks like The C function maps the current thread to an MPI object. This allows multiple (i.e. 1K-10K) virtual MPI ranks to coexist within a process. Rather than a global function modifying global state, a member function modifies state local to the object. This corrects the global variable problem explained in Section 3. 3 The operating system class passes the message to a network interface (NIC) to inject. The operating system must then block the application thread to advance time. We get a context switch from the application thread to the main discrete event simulation (DES) thread. The DES thread now models the message being injected and any ACK notifications within the hardware. When the NIC generates an ACK, it passes the message back to the OS. The ACK now causes an unblock event to occur, returning back to the application context. 
Performance Analysis
The block/unblock mechanism is central to several performance metrics collected by SST. Every key gets extra state associated with it, identifying the type of event. The simulator logs the current time as a stack variable before switching to the DES thread. When then send is complete and the application thread resumes, it can compute the time elapsed between the block and unblock. The OS object can now log the time spent in the event, which has been identified as MPI. This becomes useful in the fixed-time quanta (FTQ) charts demonstrated in Section 5.2. A function stack and backtrace (not explicitly shown) is also maintained for the active user-space thread. The simulator logs the virtual time spent with a backtrace allowing an application call graph to be generated (Section 5.1), analogous to profiling or sampling tools like kcachegrind. Although the discussion here focuses on MPI, the same discussion applies to any event that progresses virtual time (compute, memcpy, disk write).
Global variables
Because user-space threads are central to simulating parallel programs (i.e. MPI ranks) as real software stacks, the use of global variables in application code becomes complicated. Consider: Here sst_global_int is a C++ class with syntax overloading matching an int type. We are also exploring compiler-based solutions, i.e. build the abstract syntax tree, identify global variables, replace them with SST lookup functions, and build the refactored code. Global variables are currently the biggest usability stumbling block.
PDES Results
Experimental Setup
A key step in any parallel-discrete event simulation is partitioning components amongst the logical processes (LPs). As alluded to before, the optimal partition should occur on high latency links to allow the largest lookahead. In most current systems, the injection latency (i.e. NIC into network) is largest at approximately 1 µs [45] . Links between switches within the network are much smaller, on the order of 100ns or less. Very fast programming model explorations can be achieved if network details are abstracted away. In many cases, the primary focus of the study is the intrinsic load-balancing or resilience capabilities of a programming model assuming a steady, reliable network. Thus, even though in some sense primitive, simple latency-bandwidth models can still be very useful as a first-phase study. In particular, LogGOPSim [14] operates primarily through simple analytic expressions and has been used successfully on a number of problems [46, 47] . If the exact details of network switches, buffers, and queues are abstracted away, the simulation can be partitioned on injection links, greatly increasing the lookahead. Here we explore a coarse-grained hardware system model with three bandwidth and three latency parameters.
• Memory bandwidth/latency
• Injection bandwidth/latency
• Network bandwidth/latency
Even though network congestion is essentially ignored, injection and memory congestion is accounted for -albeit in a coarse-grained way. Experiments were run on the Hopper Cray XE6 at the National Energy Research Scientific Computing Center (NERSC) using 16 cores per node with one logical process per core. Although Hopper at full scale is a petascale machine, here we use at most 16 nodes (256 cores).
Weak-Scaling Experiment
We ran a weak-scaling experiment, simulating increasing problem sizes from 32K-262K network endpoints (nodes) on an asynchronous many-task matrix kernel from Section 5.2. The runtime manages task and data through a distributed hash The curve in blue shows wall-clock time. The curve turns upward, indicating some scalability concerns. The "fixed constant" in the weak-scaling experiment is the total number of discrete events per logical process. Thus the number of events simulated per virtual process per second is decreasing for larger simulations. The application being simulated does not weak-scale, though, taking slightly longer for larger runs due to load-imbalance. Each simulation is therefore not covering the same amount of virtual time. If we normalize by plotting ratio of wall time to virtual time, we see a flatter curve. This demonstrates an expected feature of conservative PDES methods. If the same number of discrete events are spread over more virtual time, the simulation will be slower even though, nominally, it is doing the same amount of work. The ratio of wall/virtual time can be made to weak-scale, though. 
Call Graph
As discussed in Section 3, the block/unblock management of threads allows an application call graph to be generated. Here we demonstrate an example call graph for an experimental matrixmatrix multiplication code being explored with the simulator based on the Strassen algorithm [48] . Here the code is written in the standard MPI programming model. When SST runs, it creates a callgrind.out file readable by the KCachegrind/QCachegrind GUI ( Figure 5 We can zoom into other areas of the call graph ( Figure 5. 2) for more details. Here we see that 33 percent of the application is actually spent in a busy poll loop within MPI. Some of the contribution comes from an all-to-all collective while the rest comes from waiting on MPI requests in the gatherMatrices function. 4 show such an FTQ graph for a matrix-multiplication kernel, written in MPI and the asynchronous task runtime being explored within our group [49] . In both figures, green indicates communication overhead (i.e. bad) while blue indicates compute activity (i.e. good). In this simulation, we set one node to be degraded, running at half-speed (e.g. overheating, bad DIMM). As time progresses in the MPI code, communication overhead grows and grows. The MPI code is synchronous and does not rebalance. Thus all workers eventually go idle waiting on messages from the degraded node. In the many-task programming model, compute activity remains steady. Communication is a constant overhead, running on a dedicated thread. Because the many-task framework can rebalance and runs asynchronously, a constant stream of work is delivered to the nodes with only occasional patches of idle time. 
Chapter 6 Summary
Here we have introduced macroscale components for the structural simulation toolkit (SST). The tools enable on-line architecture simulation for HPC applications, running real software stacks and application code. While broadly applicable, the design is particularly suited for programming model exploration and runtime system development. The complications associated with parallel discrete event simulation (PDES) are outlined. Through explicit management of user-space threads, a notion of virtual time is maintained for individual software stacks allowing metrics like call graphs and fixed time quanta to be collected as if from real application runs. PDES results show approximate weak-scaling, demonstrating that simulations involving 1.5 M explicit threads or more are feasible. Although we have fallen short of our original goal of 100:1 wall time to virtual time, further performance improvements are possible. The Cray XE6 results used MPIonly parallelism. SMP-aware optimizations should greatly decrease parallel overheads in future versions.
