As supercomputers grow, understanding their behavior and performance has become increasingly challenging. New hurdles in scalability, programmability, power consumption, reliability, cost, and cooling are emerging, along with new technologies such as 3D integration, GP-GPUs, silicon-photonics, and other "game changers". Currently, they HPC community lacks a unified toolset to evaluate these technologies and design for these challenges.
INTRODUCTION
The HPC community has recognized that the development, procurement, and operation of large, capability-class supercomputers is necessary for the advancement of a range of scientific and technical challenges ranging from basic science to climate prediction to weapons design. As the HPC community reaches into the trans-petascale regime towards exascale, the task of building these computers is becoming increasingly difficult. In addition to the traditional challenges of raw performance and scaling, fundamental challenges in performance, power consumption [21] , cost, reliability [15] , programmability [32] , and cooling arise. Overcoming these challenges will require new approaches to how we fabricate, architect, program, and operate future computers. Major changes will have to be made to the memory, network, processor, and IO subsystems, along with concurrent changes in the programming model and applications.
It is seldom practical to construct hardware prototypes of sufficient size/number to explore this vast design space. Therefore, we have to rely on simulation to guide design and procurement decisions. Currently, the HPC architecture community lacks tools needed for such evaluations. A variety of simulators exist for individual system components, but no unified framework allows them to act in concert.
To address this problem, a number of institutions 1 have developed a simulation framework for simulating large-scale HPC systems. This simulator allows parallel simulation of large (tens to hundreds of thousands of nodes or more) machines at multiple levels of detail (from cycle-accurate execution-driven instruction-based to abstract message-driven simulation). It couples multiple models for processors, memory, IO, and network subsystems. This simulator, the Structural Simulation Toolkit (SST) aims to become the standard simulator framework for designing and procuring HPC systems by helping Industry, Academia, and the National Labs in designing and evaluating future architectures.
SST Requirements
Effective supercomputer design and evaluation requires a simulation environment for quickly simulating large HPC systems in a variety of ways. Some key requirements:
• Scalable Parallel Simulation: The simulation framework must allow very large parallel simulations of even larger parallel machines. This will allow us to use the supercomputers of today to design and optimize the supercomputers and applications of tomorrow. Efficient parallel simulation will require built-in support for automatic partitioning, checkpointing, and event serialization.
• Multiscale: Different simulation models must allow either abstract or detailed evaluation of system components. This will allow different system characteristics to be evaluated at the necessary level of detail and accuracy, while still allowing other parts of the simulation to be performed in a faster but more abstract manner.
• Holistic: Raw performance is only one of several challenges for codesigned systems. The simulator must provide a unified interface to technology models, allowing estimates of power, energy, area, cost, and reliability.
• Open: To be effective, the simulator must be accessible to as large an audience as possible, both from a legal and technical standpoint. To meet these requirements, the SST is comprised of a simple simulation core that contains a parallel discrete event simulator and support services for simulation. components, representing hardware systems such as processors, network switches, or memory devices, interface with the simulation core to communicate and operate with a common notion of timeframe. The simulation core also provides support services such as power and area estimation, checkpointing, configuration/initialization of the simulation, and statistics gathering. The SST's modular interface eases the integration of existing simulators into a common framework and is licensed under a BSD-like license.
Parallel Simulation
The SST uses a parallel component-based discrete event simulation (DES) model layered on top of MPI. To achieve better performance, the SST uses a conservative (i.e. no rollback) distance-based [16] optimization. At the start of the simulation, the system topology is represented by a graph with components as nodes and connections between them as edges, with each edge labeled with the minimum latency between the connected components. The Zoltan [14] library is then used to partition components across the MPI ranks with the goal of balancing the load and partitioning across the highest latency links. Tests [29] indicate that the algorithm is scalable and shows less than 25% overhead at 128 ranks (11,904 simulated components) compared to a single rank for detailed simulations.
Multi-Scale
The SST includes a variety of processor, network, and memory models at different scales and levels of accuracy. This diversity allows the simulation user to make tradeoffs between accuracy, complexity, and time to solution, enabling an efficient design space exploration. The SST includes high level stochastic processor models (see Section 4.8) which take statistical representations of applications and run at faster than real-time speed and cycle-based detailed processor models based on SimpleScalar [11] (See section 4.1) which are driven by instruction execution. Network models range from detailed flit-level router models based on the RedStorm SeaStar router (See Section 4.6) to abstract models message-based models (See Section 4.4). Memory models include the highly detailed DRAMSim2 (See Section 4.2) and simple fixed latency models.
Holistic
Modern supercomputer design is complex multi-objective optimization in which execution speed must be balanced against power, energy, reliability, cost, and other factors. For example, current estimates of power consumption for an exascale machine using today's architectures range from hundreds of megawatts [20] to over a gigawatt [13] . In either case, with the cost of a megawatt-year of electricity being roughly $1 million [3] , powering such a machine could cost hundreds of millions to billions of dollars a year.
To assist the designer with power and energy estimation, the SST provides a common interface to a variety of power estimation libraries including McPAT [23] , Orion [6] , and Sim-Panalyzer [39] . The interface also includes hooks to allow thermal modeling tools such as HotSpot [18] to be included. This basic interface can be extended to provide area, cost, and reliability estimates as well.
Open
The SST source code and most of its of components are licensed under a BSD-like license allowing free commercial and noncommercial use. However, this license is non-viral and the internal interfaces are design so that component writers do not have to expose any of their component internals. This allows commercial vendors to provide components without revealing internal details of their implementation.
RELATED WORK
The SST simulation framework builds on a long tradition of architectural and network simulators such as M5 [8] , NS-3 [17] , and A-SIM [27] . In addition, it builds upon community [37] . The SST often seeks to directly include existing simulators to build a "best of breed" framework. The novel approach of the SST is to include these individual component models in a parallel, scalable, and open-source framework.
SST INTERNALS
The SST (Figure 1 ) consists of a simulator core, which provides simulation services, and pluggable components (see Section 4) which constitute individual simulation models.
The simulator core provides simulation configuration and startup (Section 3.1), the parallel model of computation (Section 3.2.1), checkpointing (Section 3.3), and a common interface to the technology models.
Configuration and Job flow
The SST in configured with a XML file which lists the components instantiated in the simulation, any component parameters which must be passed in, the links between the components, and the latency on the component links. This configuration is processed into a graph, with the component instances as nodes and the links between them as edges, which is then fed to the Zoltan libratry to find a partition which balances the number of components per host rank and which will maximimze the simulated lateny between components. Partitioning along high latency links means that rank will have to exchange messages less frequently in our conservative optimization.
Model of Computation
The simulation is carried out in a component-based discrete event model of computation. Each component can assign a clock to itself, to be triggered at regualar intervals. Components can also send events to other components along links, which have a minimum latency. When an event arrives at a component, it triggers an event handler function, in which the component can process and respond to the event. Alternately, the component can poll the link to recieve and process any outstanding messages.
Parallel Implementation
Parallelism is transparent to the component writer. Components interact through sending events to each other through link objects. All events inherit from a common base class, which also includes a time (see Section 3.2.2) tag to indicate when it should be delivered. All events must be serializable (using the Boost [1]Serialization Library), which transforms the event structure into a compact binary representation.
Whenever an event is sent, the SST core determines if the destination of the event is local (i.e. on the same MPI rank) or remote. Remote events are queued up for future delivery the next time the given ranks are due to synchronize. This occurs only as often as needed, based upon the latency patitioning given by the Zoltan library. I.e. if the components on two ranks are connected by a link with a minimum latency of 1000 ns, those ranks only need synchronize every 1000ns of simualted time.
The Boost MPI library is then used to perform the actual communication. When two ranks synchronize, they serialize and send the pending events to each other. When the events are recieved, they are integrated with the local event queues, where they wait for delivery to their target components.
Time
Time in the simulator is represented using a single 64-bit unsigned integer to count the number of atomic timesteps that have passed since the beginning of simulation. The actual atomic timebase (time increment represented by each atomic timestep) is user programmable and has a default of 1 fs (10 −12 seconds), which provides for over 200 days of simulated time. All times used by components and links are specified using strings (for example, "1.5 ns" or "1.73 GHz"), and are resolved at build time into a TimeConverter object. The TimeConverter object essentially represents a component's view of time and provides functions for converting from the component's timebase to the atomic timebase. The TimeConverter simply stores the number of atomic timesteps (refered to as its factor ) in the desired time interval. In the case of a specified clock frequency, the factor represents the number of atomic timesteps in the clock period. For example, a component with a 1 GHz clock would get a TimeConverter object with a factor of 1000 (assuming the default atomic timebase of 1 fs), which would also be equal to the factor for 1 ns.
The creation of a TimeConverter has two options. The first is to register a clock handler, in which case the handler is called once per clock period. The second is to simply register a timebase with the simulator, which can be used with the event driven interface. In either case the returned TimeConverter object is registered with the component's links, where it is used to convert latencies from the component's view of time to the atomic timebase. The use of TimeConverters insulates the components from both the need to know the value of the atomic timestep, as well as from knowing their own operating frequency. This allows a component be written with a generic timebase, which can be set at runtime.
Checkpointing
Because simulations may run for an extended period of time over a number of nodes, the simulator needs the ability to checkpoint and recover its state. To accomplish this, the simulator core uses the Boost Serialization Library to convert the core's state and the state of each component into a binary format. At a user defined interval, this binary state is dumped to a file which can be used to restart the simulation.
COMPONENTS 4.1 genericProc
genericProc is a configurable multi-core processor simulator descended from the SimpleScalar [11] toolset. It couples multiple copies of the sim-out-order pipeline model with a front-end emulation engine executing the PowerPC [26] ISA.
SimpleScalar is widely used in the architecture community and we have extended it with a cache coherency model, a prefetcher (using n-block lookahead), and by refactoring the memory model to allow connection with more accurate memory models, such as DRAMSim2. We have also added in event counting to help provide data for power/energy modeling. genericProc can be easily extended to access or control special hardware such as advanced memories or NICs. From the programmer's perspective this access can be done through overloading unused system calls or by a memory mapped interface. This makes the component useful for prototyping advanced processor features.
DRAMSim2
DRAMSim2 [30] is a cycle accurate DDR2/3 memory system simulator developed at the University of Maryland. The simulator models a memory controller that receives memory transactions (read, write), converts them into DRAM device commands (RAS, CAS, PRE), and issues them to simulated ranks of DRAM devices. DRAMSim2 keeps track of the state of every bank and bus in the memory system and issues requests so that they do not violate DRAM timing constraints. The simulated memory controller can safely execute memory requests out of order while respecting potential dependences in the transaction stream. DRAM device timing and power consumption parameters along with system level parameters such as memory controller queue depths, queuing structures, address mapping scheme, and row buffer policy can be easily configured using a simple ini file. Device timing parameters can be obtained from manufacturer data sheets or can be tailored to reflect new or custom DDR devices. The output of the simulation includes bandwidth, latency, and power statistics both globally and per rank for each simulation epoch. Power computation is performed using an event counting methodology developed by Micron Technologies [19] . Additionally, a visualization tool that enables graphing and comparing DRAMSim2 simulation results is currently being developed.
One of the most important goals of DRAMSim2 is that it strives to be accurate. In addition to extensive testing by hand and manual analysis of simulation output, DRAMSim2 contains an HDL validation mode for automated testing. Any simulation can be configured to output a verification file which is turned into Verilog code that can be run using Micron's DDR2/3 behavioral Verilog models. These models do extensive checking for timing violations so one can be reasonably certain that if it passes this test, the simulation results are accurate. Many non-trivial DRAMSim2 simulations have been verified, which indicates that the memory system model does not violate the DDR timing constraints.
DRAMSim2 also has the goal of being simple to integrate into simulation frameworks, such as the SST. While making extensive use of the C++ STL data structures, DRAMSim2 requires no external library dependencies and has been successfully built on Linux, OSX, and Cygwin on Windows. The DRAMSim2 library has a straight forward interface which requires minimal wrapper code (less than 180 lines of code, including headers) to work with SST. The DRAMSimC SST component provides an interface to the DRAMSim2 library. Upon instantiation, the DRAMSimC component registers a callback function for completed requests and begins to issue a clock signal to DRAMSim2. DRAMSimC converts any incoming SST memory request messages into DRAMSim2 transactions. After the memory request completes, a certain number of clock ticks later, DRAMSim2 executes the response callback which is turned back into an SST message and sent back. This encapsulation allows any component to drive the DRAMSim simulation, such as a processor model or the included DRAMSimTraceC component which executes memory traces.
DiskSim
The long-term goal of our I/O simulation work is to develop a complete simulation framework to evaluate the scalability of experimental I/O systems and protocols. The first step toward that goal is demonstrate accurate simulation of existing disk-based storage technologies. Our SST components for disk simulation extend the functionality of the DiskSim software [10] , a complex and well-proven simulation model capable of simulating a large variety of disks and storage topologies from all the major manufactures. If a particular disk is not explicitly supported by DiskSim through an existing parameter file, an additional tool called DIXtrac [33] extracts these from a disk compliant with the SCSI protocol.
Our SST components for I/O are effectively lightweight wrappers around the DiskSim software, using it as a black box to provide accurate latency and bandwidth timings for block-based requests; however, integrating DiskSim with SST required a number of engineering fixes. First, we modified DiskSim to be 64-bit compliant to support current hardware architectures. After significant testing and validation, we submitted these changes back to the original developers for general distribution. Second, we developed "bridge" software to convert SST requests to DiskSim requests. Finally, we are making modifications to DiskSim to enable compatibility between the simulator clock used by SST.
At all levels of our DiskSim integration, we validated components by comparing simulated results to real measured values from the "skippy" and "seeker" benchmark codes [36] . The benchmark codes measure disk bandwidth, latency, rotational latency, head switch time, and cylinder switch time. All tests validate within reasonable error limits.
Generic Router Models
The generic router model component can be used it situation where the simulation of a large network is required, but emphasis is on simulation of the endpoints and the detailed inner workings of the network routers are not as relevant. The router is a model of the type found in machines such as the Intel Paragon, ASCI Red, and to some extent Cray's XT line of machines. Messages are wormhole routed 2 and use source-based routing.
When a message passes through a router, a configuraable hop delay is added to simulate processing of the route information. The router components act like full bandwidth preserving crossbar switches. If a path from an input port to an output port is available, the message is forwarded without further delays. If the output port is busy, the router component computes at what time the blocked message will be able to proceed and delays forwarding the event for the blocked message by that amount of time.
Using output port delays and input port event rescheduling, the router model can model congestion in the network, even though there is no flow control protocol between routers in place, and the links have, in essence, infinite capacity.
Link bandwidth is a parameter passed to the router model which it uses to compute output port capacity and control the flow of outgoing data accordingly.
The router model does not support virtual channels. However, message deadlock cannot occur because messages are sent across the links in the form of single events, which do not prevent other messages from using the same links.
The router model maintains a small number of counters to enable statistics on the number of messages coming in and going out of each port, how often congestion occurred and how much delay that caused. To provide power/energy consumption information, the McPAT or ORION power models can be enabled in a router component. This generic router model component allows for a variety of topologies. Currently, there is support for two and threedimensional meshes with or without wraparounds in the x, y, and z dimension, binary tree, binary fat tree, hypercube, a flattened two-dimensional butterfly, and a fully connected graph. An XML configuration file generation tool, genTopo, is included to make configuration of large networks easier.
Communication Pattern Component
For many simulation studies, it is important to have realistic network traffic, but computation at the endpoints is of limited relevance. The communication pattern component of SST allows traffic generation without incurring the processing / memory overhead of a full endpoint simulator.
The only communication pattern implemented at present is ghost which simulates ghost cell exchanges on a five-point stencil operator where each rank communicates only with its East, West, South, and North neighbor. Implementations of communication patterns for FFT, the NAS parallel benchmark integer sort (IS), and master/slave are under way.
Each pattern generator is implemented as a state machine. They simulate compute time by suspending operations until a future event indicates the passing time and the need to transition to another state in the state machine. The state machines contain states for waiting for messages, if the algorithm has a dependency on incoming data. The state machines have additional states to enable checkpoint/restart, fault handling, and recovery after a fault.
Red Storm Router Model
The Red Storm router model is a near cycle accurate model of the SeaStar router using in the Cray XT3 through XT6 line of supercomputers. The component primarily models the internal crossbar and input/output queues of the SeaStar. Flexibility is added by parameterizing queue depths, FLIT size and number of FLITs in a packet. The model has been compared to actual runs on Red Storm, exhibiting errors of just 5% for long-and 12% for short-messages. [38] 
QSim
QSim is a front-end for execute-at-fetch microarchitectural simulations that extends and instruments the QEMU [7] processor emulator. QSim adds the ability to arbitrarily control the advancement of execution of the emulated CPU cores and to register callbacks to examine instructions and memory accesses. For this reason, it can be regarded as a reimplementation of Shade [12] for the manycore era. However, unlike Shade, which provides only user mode emulation of the Sparc and MIPS architectures, QSim provides a paravirtualized full-system simulation of 32 and 64 bit x86 CPUs. Because the advancement of execution within QSim's emulated CPUs is controlled by an external timing model, achievement of accurate instruction timing is possible, although some features, like wrong-path execution for misspeculations, remain difficult to implement.
QSim is a library external to the simulation environment, only requiring calls to a timer interrupt function to convey the passage of time to the operating system running within it. While this disables certain CPU features like the Timestamp Counter (TSC) and High-Precision Event Timer (HPET), it simplifies the design of Qsim and increases the freedom of the QSim user.
SST and QSim are combined by a set of simulator independent components called Slide. Though Slide is still work in progress, the QSim library has been successfully demonstrated with both a simple multi-cycle timing model based on the Intel 386 instruction timings and a cycle-level model of a uniprocessor nonblocking cache hierarchy using SST as the simulation back end.
SST Stochastic Processor Models
SST currently implements two stochastic processor models that can be used in system simulation. These include an AMD Opteron and a Sun Niagara 2 processor model. The Opteron is presently a single-core model; we are in the process of developing a multi-core model. We have both a single-and multi-core Niagara 2 model. These models are statistical and based on a Monte Carlo technique [5, 34, 35] .
The Monte Carlo processor modeling technique is based on the equation, CP I = CP Ii + CP Is, where CP Ii is the ideal or intrinsic CPI based on the instruction issue width and CP Is is the CPI due to stalls (CPI is cycles per instruction). CP Ii is obtained from processor manuals; the stall causes are determined from both processor manuals and from micro-benchmarks designed to stress a particular processor component. For many processors, general reasons for stalls include cache misses, branch mis-predictions, and issue stalls due to data dependencies.
Processor models comprise most of the major micro architecture components, including caches, branch predictors, issue queues, and execution units. Parameters to a processor model include characteristics of the micro-architecture and application characteristics. Micro-architecture characteristics consist of component latencies, obtained from processor manuals or from micro-bench-marks. Application characteristics include dynamic instruction mix and statistics on stall causes. These are collected using hardware performance counters and dynamic binary instrumentation tools.
The current versions of these models within SST take the dynamic execution trace, push each instruction through the model, and essentially return the cycle at which each instruction is completed. This information is passed out of the model to any connecting component models. Models can be used as high-level processor components of a larger system simulation or they can be used as stand-alone models for performance prediction and design-space analysis.
EXAMPLE MEMORY STUDY
The SST has been used for a number of studies, including network and memory studies, power/thermal modeling studies, application analysis, and network protocol optimization.
For many applications, the main memory subsystem is the dominant factor in on-node performance. Understanding how applications stress the memory system is important in optimizing applications and designing future memory sys- Using the genericProc and DRAMSim2 components, an 8-core processor connected to a DDR3 memory system was simulated. Several applications were run to examine different memory usage patterns: GUPS (random access) [4] ; PageRank from the MTGL[31](graph traversal); Mantevo's [2] MiniMD (molecular dynamics); and Mantevo's HPCCG (a sparse matrix-vector multiplication). The effect of the average load-store queue length and memory latency are shown in Figure 2 . This experiment quickly isolates which applications are most memory intensive (GUPS and HPCCG), highlighting a performance bottleneck in the memory system when running GUPS. The extraordinarily high latency of memory operations shows an overloaded memory controller and bandwidth limitation, indicating need for redesign.
Using McPAT and DRAMSim2's internal memory models, it is possible to determine the major power consumers for each application (Figure 3 ).
SUMMARY
The SST is an open, modular, parallel, multi-objective, multiscale simulation framework for HPC architectural exploration. It contains a number of components including processors, memory models, network components and storage models, ranging from very detailed to very abstract. Interfaces to a number of power and thermal models allows multi-objective design space exploration. The SST has been used in a number of architectural studies.
The SST project is continuing to grow in a variety of ways. Current projects include: Development of area, cost, reliability technology models; Improvement of partitioning algorithm to include estimates of component computate and memory requirements; More complex storage topologies and RAID [28] configurations, simulation of file system software overheads, and simulation of evolving non-volatile storage architectures such as SSDs and Phase-change memory; Integration of stochastic processor models with execution-based front-end(s) and detailed memory models. Addition of Monte Carlo processor for the IBM Cell BE, the HP Itanium 2, and the Sun Niagara 1; Integration of the MacSim [22] GPU model, Zesto [24, 25] processor model, and M5 node model.
