We present an end-to-end simulation framework that is capable of simulating High-Performance Computing (HPC) systems with hundreds of thousands of interconnected processors. The tool applies discrete event simulation and is driven by real-world application traces. It provides a semantically correct replay of MPI application traces and maintains reasonable simulation details of both the processors in general and the interconnection network in particular. Among other things, it features several interconnection network topologies, flexible routing schemes, arbitrary application task placement, point-to-point statistics collection, and data visualization. With a few case studies, we demonstrate the usefulness of this tool for assisting high-level system design as well as for performance projection and application tuning of future HPC systems.
Introduction

Problem Statement
The next generation of High-Performance Computing (HPC) systems will be distributed systems with hundreds of thousands of processing nodes interconnected via large c packet-switched interconnection networks. In the design and development of such new systems, accompanying performance modeling by simulation is indispensable to evaluate the system design options and to help optimize the performance of the processors, the interconnection network, and eventually the entire system, including system software and HPC applications. The unprecedented system scale makes simulation at a useful level of detail a challenging task and eventually demands for new methods and tools as we present in this paper.
Considering established methods and tools applied in support of prior system developments, problems ei-ther concentrate on partial aspects only or cannot scale sufficiently, e.g. individual simulations are performed of complete processor systems including the applications running on top of an operating system on the processors. Execution-driven full-system simulation is applied by modeling processors and applications very accurately at a degree of detail down to the instruction and clock cycle level (e.g. MAMBO [1, 2] , SIMICS [3] ). In this way, precise application execution times, IPCs (Instructions per Cycle), cache miss rates, TLB (Translation Lookaside Buffer) miss rates, etc. can be obtained, which are the preferred performance measures for computing systems. One can simulate individual processors or small clusters of processors in this way, but the problem is that the very high degree of detail does not allow the scaling of this kind of simulation to HPC systems of many thousands of interconnected processors: this would be too time-and resource-consuming, if not even practically impossible. High scalability is required because we cannot rely on extrapolation, and obviously the impact of an interconnection network is not covered this way either.
On the other hand, there is a large field of switch and interconnection network simulation, in particular in the telecom area. In this field, discrete event-driven simulation is typically applied at a higher degree of abstraction by modeling switches or networks of switches as queuing systems at packet-level granularity. Hence such network simulations can scale well to thousands of network ports on reasonably-sized computers. However, there the main intention is to obtain throughput and delay statistics for synthetic statistical traffic models. While this might be sufficient for telecom applications, it is not suitable for the very different characteristics of HPC interconnect traffic or for obtaining application benchmarks. Obviously the workload generated by applications running on distributed processors is not covered this way.
In a first phase of system design, the individual simulation methods for processors and for switches or interconnection networks still are excellent means for optimizing the respective subsystems. However, separately optimized subsystems typically do not yield an overall system optimum from the performance and cost points of view, because the interaction between the subsystems can be very hard to estimate without an integrated approach. Furthermore, the impact of different network topologies, switch parameters, link and network parameters, and routing, flow control, congestion control and deadlock prevention algorithms on the run time of real-world HPC applications cannot be studied by separate simulations of processors and network. The same holds for different ways of placing application tasks to system nodes. All these network-related aspects are becoming increasingly important so that the network performance can keep up with the processor performance under proliferating system size. In prior simulation work that focused on the processor side, little attention has been paid to the largescale system integration and particularly these networkrelated aspects. Hence, an integrated approach of a highly scalable end-to-end simulation combining sufficient details on both the processor/applications side and on the network side is needed for a second phase of system design. This requires the right level of abstraction, an appropriate, efficient 'light-weight' simulation tool -a tool that can flexibly cope with the numerous design alternatives that may arise during system design, and a tool that allows distributed parallel simulation to cope with the very large scale.
Our Approach
The work presented in this paper is the result of our effort towards a full-system end-to-end simulation of versions of the PERCS (Productive, Easy-to-use, Reliable Computing System) HPC architecture for implementation by 2010. We decided to build on an existing, event-driven network simulation environment we had previously developed and used for switch and interconnection network simulations in telecom applications. Based on the efficient and flexible OMNEST framework (also known as OMNeT++ [4, 5] ), this simulator allowed the simulation of multistage fat-tree or mesh-type packet-switching networks driven by statistical traffic at the appropriate level of detail. We extended this tool to incorporate the required integration with the processor and application workload aspect by replacing the existing statistical packet generators with a new abstract computing node model that is driven by real-world application traces. To cope with the significantly larger system sizes, we newly exploited the OMNEST parallel simulation capability.
As the Message Passing Interface (MPI) standard [6] is pervasively used in HPC applications, our application traces are MPI traces, i.e. traces of the MPI calls and computing events in the application software. The trace files are recorded per task of the application on a real system that should be similar to the target system, e.g. a previousgeneration system. Alternatively, trace files can be generated synthetically based on deep application knowledge. Our simulator allows arbitrary placement of the tasks onto the system nodes, i.e. the replay of a particular task trace file can be associated to any arbitrary computing node in the system. The nodes replay the associated trace files by generating the appropriate semantic actions, which eventually cause I/O messages to be sent or received via the interconnection network or computing time to be spent in the node itself. The computing time is determined by parameters that account for the processor differences between the system traced and the simulated target system. A precise determination of these parameters requires the expertise of the processor developers and/or comparisons with detailed full-system simulations (MAMBO) of a single target processor. Messages injected into the network experience the full effects of the network protocols under investigation. The network time is determined by the resulting queuing times and link delays in the simulated network. By replaying application task traces in a semantically correct way -as opposed to just pushing static traffic traces into the network, our model accounts for the impact of the network loop, i.e. responses and/or acknowledgements from peer tasks are waited for to unlock subsequent computations and/or transmissions reactively. It would be difficult, however, to precisely differentiate potential effects of waiting-time-dependent application behavior, which is a well-known issue for trace-driven simulation. Fortunately, this problem does not occur in classical MPI applications.
To be useful for both the high-level system design of future HPC systems and application tuning, our simulation framework maintains an appropriate level of trace-driven simulation details of both the processors in general and the interconnection network in particular. Among other things, it features several network topologies, flexible processor, switch and network adapter models, a rich set of routing schemes, arbitrary application task placement, point-to-point statistics collection, and data visualization. First applications of our tool allowed us to provide useful feedback to system designers. In the following, we demonstrate this with a few case studies. Thanks to the possibility to simulate full-size systems by parallel simulation, we were able to validate the correct full-system function and determine the full-system performance, which in turn enabled the validation of analytical performance estimates. We have simulated up to 65,536 nodes, each with eight processor cores, on a 32-way SMP cluster. We believe that even larger simulations are possible.
Our paper is organized as follows. In Section 2 we describe the simulation methodology and the underlying simulator framework in more detail. In Section 3 we present exemplary results from four case studies to illustrate the capabilities and usefulness of our simulation environment. Owing to the focus of this paper on simulation methodology, we skip detailed quantitative descriptions of the simulated system configurations and application characteristics. Instead we focus on qualitative results. In Section 4, we briefly discuss related work, followed by conclusions and a brief outlook in Section 5.
Simulation Methodology and Tool
Simulation Framework
Due to the scope of our simulation goals, our simulation framework coherently integrates a comprehensive set of proprietary and/or third-party components:
2 The OMNEST discrete event simulation core with parallel simulation support.
2 Network modules describing the overall HPC system model topology, i.e., its subcomponents and the way they are interconnected.
2 Pluggable modules modeling the subcomponents of the HPC system, i.e., switches, network adapters and computing nodes.
2 Modules for statistics collection and central control functions.
2 Postprocessing tools for results visualization.
2 Tools for application-trace recording on real HPC systems.
2 Tools to process/translate application traces of different formats.
2 Tools that assist manual trace synthesis.
All along our design and implementation, we carefully optimize each component in great details and introduce parallel simulation support to achieve both high simulation accuracy and very large simulation scalability. In the following we describe the simulation framework, its components, and further aspects of the simulation methodology in more detail by using an underlying system model of an exemplary HPC system with a fat tree interconnection network. Figure 1 illustrates a high-level overview of the exemplary HPC simulation system model. For the network part, our simulation framework builds around two basic, flexibly configurable and replicable OMNEST modules, i.e. a switch module and a network adapter module. These two basic modules are designed in such a way that any arbitrary interconnection network topology can be flexibly arranged in an OMNEST network description by explicit or algorithmic specification of the connections between multiple instantiations of the switch and network adapter modules. Over time, we have created a set of network descriptions for the major network topologies under discussion in the HPC community and in our own projects. For illustration purpose, Figure 1 shows a particular threelevel fat-tree topology. Our corresponding OMNEST network description is defined universally for fat trees up to a reasonable maximum number of fat-tree stages. Using conditional module array sizes and conditional connections allows us to parameterize the actually desired number of levels and the size of the individual levels. The origin of our network simulator was limited to synthetic statistical traffic. For this purpose, each network adapter was fed by a statistical packet generator module. Certainly, this continues to be useful for modeling random-access benchmarks, such as the GUPS (Giga updates per second) benchmark or the PTRANS (parallel matrix transpose) benchmark, which are useful tests for the full-scale communications capacity of the network [7] . These benchmarks assume randomly distributed all-to-all traffic of short (PTRANS) or very short (GUPS) messages, which can be modeled much more easily by synthetic generators than by traces.
Simulation Model
On the other hand, our new extension for end-to-end simulation replaces the original packet generator module by a new, generic computing-node module driven by application traces. To achieve simulation scalability, this node module is confined to a reasonable level of abstraction. As illustrated in Figure 1 , each node module is a compound OMNEST module that models multiple processor cores interconnected by a bus model. Furthermore, one or multiple application tasks can run concurrently on the node's processors. This is controlled by a system kernel. The corresponding task modules perform semantic actions driven by an application trace file. In the OMNEST configuration file, each task of an application, i.e. its associated trace file, can be assigned to a task module of an arbitrary node module. Thereby one or multiple tasks can be placed onto the same node. Optionally, each task of an application can be placed to multiple nodes. In this way, multiple simultaneously running partitions of the same application can be modeled.
Obeying the causality between messages, the task modules replay the associated trace files by generating the appropriate semantic actions that eventually cause computing time to be simulated in the node itself or MPI messages to be sent or received via a network adapter across the network of switches. MPI messages are sent to the ingress part of the source network adapter, where they are segmented into smaller network packets referred to as 'flits'. The flits traverse the switch modules in one or multiple hops before they eventually reach the egress part of the destination network adapter, where they are reassembled into the original MPI messages. These are forwarded to the receiving node module, where computations or MPI receive actions are scheduled to wait for them. At the same time acknowledgement messages are returned to the source the received message originated from.
Having generator-based node modules as well as tracereplaying node modules available, it also becomes possible to configure simulations with both types of nodes. Thereby a set of trace-replaying nodes may represent an MPI application partition that is running in the presence of synthetic random background traffic generated by a complementary set of nodes configured as generators. This is far more realistic than running an MPI application in an empty system and it is less complex than producing background traffic by running multiple different MPI applications simultaneously. 
Model Subcomponents
Switch Module
The switch module has a flexible combined input-and output-buffered architecture as illustrated on the left-hand side in Figure 2 . It can be configured into most popular switch architectures by parameterization. The size of the switch in terms of number of ports, the port speeds, the internal speed of the crossbar, internal packet processing and arbitration delays, the number and logical organization of input and output queues, the number of virtual circuits or priority classes and various scheduling options are examples of switch parameters. Further functions supported are credit-based flow control, numerous routing options, and deadlock-prevention mechanisms. In simulations accompanying the development of new switches, it frequently happens that new functions need to be added or others changed. In OMNEST this is possible in a flexible manner because the functions of the lowest-level modules (referred to as simple modules) are programmed in C++. For other specific cases we have also simplified switch models, for example, a generic InfiniBand switch module.
Network Adapter Module
The network adapter module consists of an ingress and egress part, as illustrated on the right-hand side in Figure 2 . Similar to the switch module, most realistic network adapter architectures can be configured by parameterization. The link speed at the node and network side, internal packet-processing and arbitration delays, the number and logical organization of the queues, the physical buffer sizes, and scheduling options are examples of network adapter parameters.
The ingress part of the network adapter performs the segmentation of MPI messages into network packets, the flits. Correspondingly, the egress part performs the reassembly of flits into MPI messages. Optionally, a resequencing function is provided to account for the fact that flows of packets can get out of sequence in multipath network topologies. Understanding the overhead of re-sequencing is important because this overhead may destroy expected gains of sophisticated multi-path routing schemes.
Trace-replaying Node Module
As described above, the trace replaying node module provides a model for multiple processor cores and multiple trace-driven tasks running on the processor cores. There are a number of parameters that determine the processor speed, such as the system bus bandwidth, the number of memory controllers, HWMMIO (Hardware Memory Mapped IO) latency, and system call latencies relative to the parameters of the real system on which the application traces are recorded. One may also set these parameters to values to project a future system configuration.
An interesting option is to set the processor speed to infinite. In this way, runtime results are obtained that exclusively contain the portion of time spent for I/O and in the network. This allows the impact of the network to be crystallized and enables meaningful comparisons between different network options. Furthermore, based on the comparison of the runtime with the real, finite processor speed, one can evaluate the balance between processing and networking, i.e. whether a system is processinglimited, network-limited, or balanced.
Task Module
The key submodule of the trace-replaying node module is the task module. Its high-level architecture is illustrated in Figure 3 . A trace reader function reads the lines from the trace file assigned to this particular task and node. As there might exist thousands of trace files that need to be accessed constantly, we apply a caching mechanism to this trace-reading function.
From a read trace line, a semantic action scheduler determines which kind of MPI-call-related module needs to be called or whether computation time needs to be simulated. For each MPI call type that is expected to occur in MPI traces, a corresponding module is provided. Point-to-point MPI calls directly involve a Send or Recv operation to be performed. A Send operation eventually causes an OMNEST message to be created that represents the MPI message. The message size is determined by the size indicated in the trace line. Because the target of an MPI message is another task rather than a network address as required for routing in the network, the destination network address for the message created has to be determined by a specific function that finds the network address of the node module the target task is placed to. On the other hand, a Recv operation waits for actual reception of a corresponding message. In the case of MPI collectives, the corresponding modules simulate what the MPI library functions supposedly do: for example, a Broadcast operation might be decomposed into multiple sequential point-to-point actions.
For the MPI application traces we have studied so far, we only had to support about a dozen different MPI calls. In general, however, there are two problems using this approach. One is that MPI defines many dozens of MPI call types, and it simply is a lot of work to write all the corresponding modules, some of which may never be needed. Second, the specific simulation module implementations for collectives might not be appropriate for other MPI library implementations. The efficiency of different MPI library implementations varies, and some are optimized for the specific network topology of the system they are provided for. Furthermore, collective library functions may even behave differently for different placements of the tasks onto the system nodes. Hence, as much as possible, we try to obtain traces that are not taken at the usual MPI call level but rather at a lower level, where eventual point-to-point messages can be recorded. This has the advantage that simulation modules for MPI collectives are no longer needed in our task module. Second, the properties of the MPI library implementation are now reflected in the low-level trace rather than hard-coded in the simulator. So the simulator implementation remains independent of the specifics of the individual MPI library implementations, although the simulation results still reflect the differences in the MPI level software implementations. In this way, the simulator becomes a helpful tool also for the software design of the MPI library for the target system. However, to precisely reflect even details of different code behavior for different task placements, it becomes necessary to record individual traces for different task placements. Using the same trace for different task placements would only be an approximation.
Routing Function
For routing, our network simulator framework supports either hop-by-hop or source routing. In the case of hop-byhop routing, the route determination is performed step-bystep in the switches, i.e. each switch determines through which port it has to route a given packet. In the case of source routing, the route determination is performed in the network adapter once for the entire route from source to destination. The sequence of route hops, also referred to as source route, is then carried in the header of the network packets so that each switch can extract its route decision from the packet header and act accordingly.
In both cases, the actual route decision can be based on algorithms or on routing tables. For the network topologies of interest, we have written specific linear routing algorithms that can be plugged into the switch module (for hop-by-hop routing) or into the network adapter module (for source routing) when a particular topology is being studied. The advantage of such algorithms is that they can be made to reflect the exact specifics of the routing in the real system. The drawback is that they have to be written for every new topology or class of topologies. Alternatively, or generally, routing tables as are applied in many real systems can be used. For this case, we use a central control module that can read a routing table file and forward the corresponding table portions to all switch modules (for hop-by-hop routing) or to all adapter modules (for source routing) in the network.
If neither routing algorithms nor routing tables are available or before they become available, the central module of our simulator can also generate a routing table file itself for any arbitrary network topology by using the OMNEST topology exploration concepts. From each switch (for hop-by-hop routing) or each adapter (for source routing) to each destination, the unweighted single shortest-path function provided by OMNEST is applied to the topology object that incorporates all switches and network adapters. This function is based on the known shortest-path algorithm by Dijkstra [8] . For lack of a weighted shortest-path function in OMNEST, we optionally place one or multiple dummy modules onto network links and incorporate these in the topology object. These dummy nodes have no function other than counting as a node in the OMNEST topology object. In this way, we can model a limited weighted shortest-path algorithm by using the existing Dijkstra algorithm. This is useful if a higher number of short-length and/or high-bandwidth hops should be preferred over a lower number of longlength and/or low-bandwidth hops.
Furthermore, as there might exist multiple shortest paths, as e.g. in the fat-tree topology, we need to find all of them. For lack of a multiple shortest-path function in OMNEST, we do the following. We determine the length of the single shortest path found by the OMNEST single shortest-path function, starting from the desired specific source switch. Then we apply the same single-path algorithm iteratively, starting from all neighbor switches the ports of that specific switch lead to. Then, all shortest paths found that have a length that is shorter by one than the initially determined length are alternative shortest paths. Our routing tables finally contain all possible shortest paths for each source/destination pair.
As we encode the relevant system topology and configuration data in the file name of the routing table file, the system can automatically determine whether a routing table file for the actual system has been generated previously and, if so, can re-use it to save computation time. In addition to having our simulator generate routing tables as described above, we can use tables generated externally by other means or tables created or modified manually.
If multiple paths exist, independently of whether they are determined by a table or by a specific algorithm, we have to apply another step that selects one of the many possible paths. For this we have numerous predetermined options, ranging from static via random or round-robin path selection to state-dependent adaptive schemes based on shortest queues or highest number of available flowcontrol credits. Note that adaptive routing makes most sense with hop-by-hop routing, which was our motivation to provide hop-by-hop routing. Otherwise, source routing might be desirable because many actual systems apply source routing. However, in many cases, hop-by-hop and source routing are logically identical.
If virtual channels are used, routing may also include a final step of selecting the right virtual channel for the next hop. Specific algorithms for this are needed in certain topologies (although not in fat-tree networks) to prevent cyclic dependency deadlocks [9] .
Task Placement
The task-to-node placement is principally provided by assigning the number identifying the task (task rank), to a taskRank parameter of a particular node in the OMNEST configuration file. This could be done explicitly task-bytask such as: **.node [0] .taskRank = 0 **.node [4] .taskRank = 1 **.node [8] .taskRank = 2 ... While this allows a truly arbitrary assignment of tasks, it might become tedious for a regular assignment of a large number of tasks. To simplify this, we initially introduced an alternative function assignTask() (defined in the OMNEST distributions file) that automatically assigns all tasks of an application to nodes in a predetermined way. In this way, only one configuration line is necessary, such as:
where 1parameter2 identifies one of the predetermined schemes, such as sequential task-to-node assignment (task i to node i) or randomly shuffled task-to-node assignment. More recently we opted for a more flexible way of entering task placement information that is specified in XML format in a separate file. In this way, we can specify the task-to-node placement of more than one application partition of the same and/or of different applications. Note that XML also enables portability of file formats, which can facilitate the integration with other simulation tools that may have different file formats.
For the XML-based task placement, we added a specific task placement module that has an OMNEST XMLtype parameter that refers to the XML document file containing the desired task placement information. At the beginning of a simulation, this task placement module reads and parses the XML document, and according to its content dynamically creates nodes of the appropriate type (i.e. trace-replaying nodes or generators), and correctly assigns the taskRank parameters for trace-replaying nodes. In the syntax of the XML document, we allow numerous notation options for explicit task-to-node placement or notations for assigning ranges of tasks to individual nodes or to ranges of nodes. Range-wise task-to-node assignments can optionally be specified as randomly shuffled by adding a random="true" attribute. Possible task placement notations in our XML format may look like:
1application id="0" trace="tracefilename1" statistics="true" 2 1placement task="0" node="0" /2 1placement task="1" node="4" /2 1placement task="2" node="8" /2 ...
1/application2
1application id="1" trace="tracefilename2" 2 1placement task="0..7" node="128" /2 1placement task="8..31" node="129..131" /2 1/application2 1application id="2" trace="tracefilename2" 2 1placement task="0..31" node="133..164" random="true" /2 1/application2 1generators2 1placement node="165..255" /2 1/generators2 1application id="4" like="2" offset="256" 2 1/application2
Application Traces
Traces of MPI applications running on real systems are produced by 'instrumenting' the application code with additional tracing code that records MPI and computation events chronologically per task into trace files. It is important that there is as little disturbance of the application execution by the instrumentation code as possible. Various tool kits exist for this purpose including their associated trace file formats. Examples are the Sequoia tool kit [10] , the TAU tracing tool of the Open Trace Format (OTF) tool kit [11] , the tracing facility of the UPC Paraver tool [12] , or the HPCT tool kit as applied for analyzing IBM's Blue Gene R 1 /L [13] . From various laboratories and supercomputer centers, we have obtained a number of popular HPC application traces of different sizes (number of tasks) from various scientific application fields, such as ocean modeling (HYCOM [14] , POP [15] ), weather research and forecast (WRF [16] ), shockwave physics (CTH [17] ), molecular dynamics (AMBER-HHAI and AMBER-JAC [18] , CPMD [19] , LAMMPS [20] ), fusion physics (GYRO [21] ), gas dynamics (SPPM [22] ), and radiation transport physics (SWEEP3D [23] , UMT2K [24] ).
Because of their different origins, these traces have more or less comprehensive formats from different tracing tools. For the purpose of our simulator, we decided to use a simple trace format derived from one of the formats we obtained. Our trace reader can read this format directly. For other trace formats, we decided to write separate programs (Perl scripts) for translating the relevant subset of trace information from the original trace into the format understood by our trace reader.
In some cases, traces are very long and unwieldy although the really interesting phase of the trace is relatively short. Then it is necessary to negotiate with the trace provider for some way to identify the interesting phase in the program and only trace this phase. However, it is not trivial to start a trace in the middle of an application. Because of the causal relationship among different tasks, this must happen in a controlled way so that no task waits for messages that have not been traced before.
For another set of applications in the field of weather forecast, such as DWD (German Weather Service), ECMWF (European Centre for Medium-Range Weather Forecasts) versions T639 and 4DVar, we use synthesized artificial traces that were generated by application experts with the help of separate programs (Python scripts).
Common to all application traces we have considered so far are their very wide message size distributions and similar characteristics in the communication patterns. Message sizes may vary between one byte and many megabytes. Figure 4 shows a message-size histogram measured for two exemplary applications, i.e. the molecular dynamics application LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) and the photon transport application UMT2K (Unstructured Mesh Transport version 2000). Note that this diagram is doublelogarithmic. Because of the wide span, the message size range is divided into logarithmically spaced cells.
Traffic patterns typically exhibit much more nearneighborhood than far-distance traffic and, when represented in a matrix, generally diagonal patterns. The diagram on the left in Figure 5 shows a typical example of such a traffic matrix for the LAMMPS application. The high diagonal patterns essentially are made by long messages, whereas the ground floor in this matrix is mainly formed by the short messages. From this characteristic, it becomes clear that interconnection network topologies that favor near-neighbor communication are preferable, which leads to the topologies we are considering in our studies. However, this only makes sense if the tasks are placed onto the system in sequential order as is done in that diagram. In contrast, the diagram on the right in Figure 5 shows how the traffic matrix would look if the tasks were placed in a random way onto the system. In this case, the network topology ideally would have to be equally fair to close and far-distance communication patterns. As sequential task placement is not always possible in a real system, where many nodes may be occupied by other applications, we need to be able to study the impact of different task placements.
Parallel Simulation Aspects
OMNEST supports parallel distributed simulation [25] provided that some constraints are obeyed, as outlined in the OMNEST manual or in [26] . The network must be partitioned into segments that can be physically distributed.
SIMULATION
Volume 86, Numbers 5-6
For this it is useful to define the OMNEST network description as a structure of submodules rather than as one flat network of basic switch and network adapter modules. Our basic modules are designed in such a way that they can be arbitrarily nested. For example, relative references such as to the parent module are avoided. The highest-level submodules of the network or sets thereof can then serve as the parallel partitions in a distributed simulation. In some architectures, the partitioning is naturally given by the physical packaging of the system. An example of this is the hierarchical direct interconnect system described in Section 3.1. In this model, the connectivity in the first hierarchy level is essentially on-board connectivity, in the second board-to-board connectivity, and in the third rack-to-rack connectivity. In parallel simulations of this model, we used parallel partitions of small sets of racks.
The requirement for a look-ahead time by the OMNEST parallel simulation feature is sufficiently fulfilled by the fact that the only connections between physical partitions in our models are network links, which are modeled realistically with a finite delay in any case. One of the constraints to obey in parallel simulations is to avoid global objects. However, lookup tables of global data can hardly be avoided in practice. Hence, each parallel partition must have its own identical copy of global lookup tables. As lookup tables are normally read-only, synchronization of these table copies is not needed after they have been initialized with global information at simulation start time. An example for such a lookup table is the one underlying our function that returns the network address of the node on which a specific task is placed. Initially, the system-wide global task placement information is contained in a corresponding XML file. As mentioned above, we use a module responsible for reading and parsing this XML file at initialization time and eventually transforming the information into a lookup table. We provide at least one instance of this module in each parallel partition and ensure that its code is executed no more than once. In this way we create identical lookup tables in each parallel partition, with each containing the global task placement data. In sequential simulations, only one such table would be created.
In the process of preparing the simulator for parallel simulation, some caution was required regarding where module or gate objects nested in other submodules are accessed. Some OMNEST APIs return the intended object in a sequential simulation, but will return placeholder modules or proxy gates if the target object is in another partition of a parallel simulation. Similarly, when working with OMNEST topology objects, a topology object of switch modules, for example, contains all switch modules of the entire system in a sequential simulation. In a parallel simulation, it contains only the switch modules that are inside the current partition. Understanding and taking these facts into account were essential for successfully simulating in parallel.
Simulation Data Collection
While flits traverse the network, we record and accumulate in each traversed module various information, such as queuing times, time stamps or hop counts, and carry these data inside the OMNEST messages that represent the flits. Once the flits reach the egress side of the network adapter, all kinds of statistical results can be collected from the information carried in the flits. Overall system-wide statistics such as total throughput, queuing times etc. are collected by default. Point-to-point statistics (i.e. per source/destination pair) can be collected on demand for the measures of interest.
For simulations of more than one application partition and possibly in the presence of random background traffic, results are typically only of interest relative to one specific application partition. In our XML file for task placement, we can specify for which application partition statistics results are to be collected by adding an attribute statistics="true" to the particular application tag. The flits of this particular partition will then be marked accordingly at their generation time. Eventually only flits marked in this way will be considered for statistics collection.
The statistics collection is performed in one or multiple statistical measurement modules. For the case of parallel simulation, we apply decentralized statistics collection by providing at least one instance of such a module in each parallel partition and ensure that each message traverses such a statistics measurement module exactly once. The results collected in this way in a decentralized manner are then sent periodically from each distributed statistics measurement module to a single central module that exists only once and in only one of all parallel partitions. In this central module the partial results are consolidated and printed as, optionally, periodic and eventually final results. Periodic results allow the observation of changes of certain measures during the run time of an application (see Section 2.9). The granularity of measurement periods is a parameter. In large systems, some caution is necessary in determining the period and the measures of interest to prevent the generation of impracticably large amounts of data.
To expand the capabilities for analyzing the large amount of possible simulation results, we also implemented the possibility to write time-stamped traces of the MPI and computing events processed during the simulation of an MPI application. For this purpose we have chosen the format of the UPC Paraver tool [12] so that this powerful visualization tool can be directly used to analyze these results. This can be particularly useful for optimizing the application code and tuning it to the specific properties of the HPC system considered.
Finally, we have also implemented provisions for writing time-stamped traces of network events occurring at specified network components. These could be traces of all incoming and outgoing flits and credits at all ports of a particular switch in the system. Such traces are useful not only for debugging the simulator, but also as realistic traffic test cases for the hardware-level simulations being performed to verify the actual component hardware designs in VHDL (VHSIC hardware description language).
Simulation Data Visualization
Simulation data visualization helps understand the communication patterns of MPI applications at run time. Our simulator enables the periodic collection of point-to-point and collective operations among the nodes. For each simulation step of the application, we collect internode I/O operations (the number of send/receive operations), mean message size, standard deviation, and total bytes transferred. We also collect this information for collective operations and for a combination of point-to-point and collective operations. Furthermore, we track when load balancing is initiated. We also collect data for the case when multiple tasks are placed on a node. The amount of I/O versus data that leaves the node is a useful indication of how the network load changes when SMP nodes are used. Such periodically collected data can then be plotted into frames of pictures, which can be sequenced to animate the run-time communication operations in a movielike manner. For this we use a Java script that allows the animation to be viewed in an Internet browser. Figure 6 shows a sample picture frame of the animation of I/O operation distribution in the CTH application, a popular shock-wave physics program developed by Sandia National Laboratories, USA. With data visualization, one can easily spot where the network hotspots are. Then optimal task placement and adaptive routing can be used to finetune the system.
Simulator Portability
The parallel version of our simulator runs on a cluster of SMP machines with the Parallel Operating Environment (POE) of the AIX R 1 operating system. The simulator also runs on x86 machines with either Linux forms. We believe the simulator can also be ported to other hardware and operating systems.
Case Studies
Interconnect Design
One of the main goals of our simulation environment is to support the design of the interconnection network for an HPC system under development. Simulations of all the options under discussion help decide on the appropriate network topology, switch architecture, routing protocol, and so on. In this context, we applied our tool to study HPC systems with fat-tree interconnection networks, as shown in Figure 1 , and variants of this architecture. Examples of questions in this context are:
2 What are the optimum size of switch modules and number of fat-tree levels?
2 What is the required link bandwidth?
2 How should the switch buffers be organized and dimensioned?
2 How should the multi-path property of the fat tree be utilized in the routing protocol?
2 Can the number of switch modules in higher fat-tree levels be reduced?
A case study related to the last question is exemplarily addressed in Section 3.2. An alternative interconnect architecture studied is the hierarchical direct interconnect architecture as illustrated in Figure 7 . In this architecture, each computing node is assumed to be a multi-core chip, i.e. a chip multiprocessor (CMP). A switch module is associated with one or a few computing nodes. A small set of switch modules are directly and fully interconnected via a first hierarchy level of interconnection links (level-1 links), forming first-level groups, e.g. boards. Multiple boards are directly and fully interconnected via a second hierarchy level of interconnection links (level-2 links), forming second-level groups, e.g. racks. Finally, multiple racks are directly and fully interconnected via a third hierarchy level of interconnection links (level-3 links), forming the third-level group, which is the full system. A property of this architecture is that routing between arbitrary nodes may involve multiple intermediate hops via links of the different hierarchy levels. For example, to reach a destination in another rack, a level-1 link hop may be required to reach the appropriate switch module that has a level-2 link to the right first-level group (board) that has a level-3 link to the right second- Figure 8 . Slim fat-tree interconnection network level group (destination rack) and vice versa on the destination rack. Nonetheless, such a hierarchical completegraph topology tries to balance the number of hops to reach any other node and the total number of links required to build the network. Examples for questions regarding this architecture are:
2 What is the required link bandwidth for each hierarchy level?
2 How should the switch buffers be organized and dimensioned per link type?
2 How can cyclic dependency deadlocks be prevented?
2 What is the routing protocol? Should only direct/shortest routes be used or can the additional use of indirect/longer routes improve performance?
A case study related to the last question is exemplarily described in Section 3.3.
Fat-tree Case Study
An ideal full fat-tree network as shown in Figure 1 provides the full bisectional bandwidth in all fat-tree levels.
In other words, the number of links and the aggregate link bandwidth between levels are constant, and the number of switch modules is the same in all levels except that in the highest level only half as many are needed (here all switch ports are used as downward ports). As the number of levels that must be traversed is a function of the source/destination distance, and HPC applications typically have more near-neighbor than far-distance communication, bandwidth and cost could eventually be saved by reducing the bandwidth or the number of switch modules in the higher levels as shown schematically in Figure 8 . The question is how far can this reduction go without significantly impacting performance?
To study this, we performed a series of end-to-end simulations of a fat-tree-based HPC system using some real HPC application traces. Gradually we reduced the bandwidth of each fat-tree level by 25%, 50%, and 75% relative to the preceding level. In addition, we studied the impact of different switch module sizes and task-to-node placements. Figure 9 shows some exemplary results obtained from simulations for the LAMMPS and the UMT2K applications with sequential task-to-node placement (task i to node i). In this case we used application traces for 128 tasks placed onto the nodes on the middle 128 ports of the network. For the network switches we used architecture and parameters typical for InfiniBand switches. Dependent on the chosen switch module sizes of 8 3 8, 16 3 16 or 32 3 32, we used networks with 4, 3, and 2 fat-tree levels, correspondingly. For other applications we obtained results that are qualitatively similar. The diagrams show the measured mean system delay of flits as a function of the bandwidth reduction relative to the preceding fat-tree level. On the other hand, the diagrams in Figure 10 show the same kind of results for randomly shuffled task-to-node placement. The three different switch system sizes are considered in each case. The tables included in the diagrams list the percentages of traffic (number of flits) that traverse up to each fat-tree level.
As we can see, a bandwidth reduction of up to 50% appears feasible if the switch modules are larger than 8 3 8. The larger the switches are the more bandwidth reduction is tolerable. This relative behavior appears to be independent of the task placement, although the absolute system delay is significantly worse for random task placement. The tables show that a large portion of the traffic travels up all levels for random task placement, whereas for sequential placement it is the opposite. In fact, sequential task placement has turned out to be the optimum on fat trees for about a dozen of scientific HPC applications we have simulated so far. 
Indirect Routing Study
In the hierarchical direct interconnection model shown in Figure 7 , there is only one shortest path between any pair of nodes on different racks. The single highest-hierarchy link (rack-to-rack link or level-3 link) on this direct route could become a bandwidth bottleneck. More bandwidth could be made available between a pair of racks by allowing additional alternative 'detour' routes that lead indirectly via third racks. However, these indirect routes have a longer delay because they include more hops, particularly one more relatively longer rack-to-rack hop. Furthermore, alternative indirect routing involves a complexity and cost issue. The question is whether this is worth the effort and under which circumstances.
To study this, we performed a set of simulations based on the hierarchical direct interconnection model. We configured a system of five racks, each with 16 boards of 8 switch modules and only one compute node per switch module. We placed an HPC application across nodes of two of the racks: half of the application tasks were placed on nodes of one rack, and the other half on nodes of the other rack. In this case we took the same two applications of the same size as in the previous case study. When using only direct routes in a first simulation, half of the tasks have to talk to the other half of the tasks via a single rackto-rack link. In three more simulation runs, we then allowed additional indirect routes via one, two or three other racks. So each message has the choice of two, three or four possible routes, one of which is the direct route and the Figure 10 . Slim fat-tree results for LAMMPS and UMT2K applications with randomly shuffled task placement others are indirect routes. A first set of simulations was performed by just placing the application in question onto the system and leaving the rest of the system empty. In this case, the indirect routes traverse links and racks that do not carry any other traffic. A second set of simulations was done with background traffic exchanged between all nodes that do not carry the application considered. In this case we modeled the background traffic by placing additional identical application partitions on all remaining nodes of the system. The additional background load on the entire system can be expected to increase the delay, particularly, the delay via the indirect routes.
Another consideration is that if unbalanced direct and indirect routes are used simultaneously, routing with static route selection would distribute the traffic equally among the different paths that might not be the optimum paths.
Hence, we also considered the option of adaptive routing, which selects the various possible paths based on switchqueue occupancies. Figure 11 shows application run-time results obtained from simulations with the two exemplary applications, which are again qualitatively similar to results for other applications we tested. The first diagram shows the results obtained for the LAMMPS application, whereas the second diagram shows the case for the UMT2K application. Each diagram shows results for the cases without and with background traffic as well as for the options of static (dashed lines) and adaptive routing (solid lines). The diagrams show the application run time normalized to the run time of the case when only direct routes are used without background traffic. Furthermore, we applied here the feature of setting the processor speed to infinite to see We can see that allowing one additional indirect route to the direct route does not yet provide an improvement. On the contrary, for static routing, half of the traffic through the long indirect route is too much and causes a deterioration that can hardly be compensated if adaptive routing were applied. However, for more than one additional indirect route, significant improvements can be achieved in any case, in particular when combined with adaptive routing. While the comparison with and without background traffic does not show significant qualitative differences, the quantitative comparison clearly shows the relevance of simulating application partitions with background traffic. The background traffic of other applications negatively impacts the considered application, and vice versa the performance of other applications might be negatively impacted as well by the indirectly routed traffic of the considered application. Considering nonempty systems is important since in typical HPC centers multiple customers' applications are running simultaneously.
The comparison of the two applications shows that the impact of indirect routing can vary. The use of indirect routing turned out to be quite significant in the case of the LAMMPS application, even though the total amount of traffic exchanged between the two racks was only measured as 14% of the total traffic in our experiment. On the other hand, for the UMT2K application the use of indirect routing does not cause much improvement. This can be explained by the fact that the UMT2K application essentially has just near-neighbor traffic and very little traffic else. In our experiment, only 1.2% of the total traffic happened to be exchanged between the two racks. Hence the impact of indirect routing is low in this case.
Another observation made from these results is that the impact of indirect routing tends to saturate soon with increasing number of indirect routes. Hence we can expect that providing the capability for many additional indirect routes will probably not be worth the effort.
GUPS Benchmark Study
The random access GUPS (Giga updates per second) benchmark from the HPC Challenge suite [7] is a very important performance measure for HPC systems. Although GUPS does not really require trace-driven simulation, and its performance can eventually be obtained analytically, GUPS simulations turn out to be very useful for several reasons. For lack of large enough application traces, a GUPS simulation with synthetically generated traffic allows the simulation of the full-scale system. Thereby the parallel simulation scalability can be tested without a need for application traces. Furthermore, effects can be studied and validated that only occur at very large system sizes. Finally, GUPS simulation results can validate the correctness of analytical predictions as well as the correctness of large parts of the simulator.
We have performed parallel GUPS simulations of the hierarchical direct interconnection model (Figure 7) . Thereby we were able to validate predicted analytical results for various system configurations up to a system scale of 65,536 nodes. This included the validation of a predicted discontinuity effect at a very large system size above which the second-level connectivity becomes the system bottleneck, whereas below that size the third-level connectivity was the system bottleneck. Furthermore, we observed that the overall GUPS throughput tends to stabilize sooner than individual port buffer and link utilizations across the interconnection network for the uniform GUPS traffic patterns in a large-scale system.
On the one hand, we observed a surprisingly good speedup by parallel simulation. We distributed a small system that originally ran on a single processor onto a 16-node SMP cluster of the same processors. Thereby we observed a superlinear speedup on the order of 17. We suspect that this is due to the higher efficiency of parallel small simulations each with small event lists as opposed to a single large simulation with a very long event list. Furthermore, note that the 16-node cluster exhibits a 16-fold aggregate cache capacity, and our model for the hierarchical direct interconnection system allows a relatively efficient model partitioning. On the other hand, GUPS simulation of a large-scale system of 65,536 nodes certainly requires considerably more computing resources than small systems would. For example, we used a 32-node cluster with 512 GByte aggregate memory for the full-system simulation. It takes about one hour wall-clock time to simulate 10 microseconds at high GUPS injection rate (i.e. in system saturation). Simulations with low injection rates or large packet sizes run considerably faster. 
MPI Application Performance Projection
We have used our simulator to project the performance of some mission-partner MPI applications running on a projected future supercomputer system. One of the applications is ECMWF (European Centre for Medium-range Weather Forecasts). ECMWF is a complicated application with various MPI communication patterns. It has proved to be network-intensive on some existing systems.
We break down ECMWF into its representative MPI communication patterns and simulate them on the threelevel hierarchically fully connected interconnect (Figure 7) . We compare the execution time and throughput of these MPI patterns of configurations with 1, 4 and 16 level-3 links per pair of level-3 groups (racks). Figure 12 shows the results.
We observe a significant improvement in execution time and throughput when we increase from one to four level-3 links per pair of level-2 groups. However, we see only negligible improvement when we further increase to 16 level-3 links per pair of level-2 groups. The experiment shows that a configuration of four level-3 links per pair of level-2 groups is sufficient to meet ECMWF's network requirement. Hence, we can downsize the network cost of this application without performance loss. Alternatively, the extra level-3 links could be shut down to reduce power consumption.
Related Work
A rich set of prior art exists on simulation technologies for parallel and distributed computer systems [27] [28] [29] [30] . In this paper, we only discuss related work that most closely resembles our research. What makes our approach unique is its particular combination of methods with varied levels of abstraction and detail that are adjusted to the needs of the supported product development, and definitely with its capability to cope with an unprecedented system scale.
IBM's Blue Gene/L team used an interconnection simulator [31] to model Blue Gene/L's three-dimensional torus interconnection network during the design phase of the project. They selected a shared-memory parallel simulation approach to interconnect modeling. Messagepassing calls of applications are passed to the simulator via traces, which are collected using an IBM unified trace environment trace-capture utility that runs on IBM supercomputer machines. Traces of up to several hundreds of nodes were collected. The team used time-driven parallel simulation. In our work, we use the MPI interface to parallelize our discrete-event interconnect simulator. We expect the availability of commodity clusters and the scalability of MPI libraries to make our approach more cost-effective. Our trace synthesizer is able to generate traces for virtually any number of MPI threads. In addition, our simulator features the capability of placing simulated tasks arbitrarily, which can be very useful to tune MPI application on future systems.
BigSim [32] is an interconnect simulator to help optimize applications for larger-scale HPC systems. It uses optimistic synchronization to parallelize the simulation. In our work, we use detailed interconnect models to help also the interconnection network design. We use conservative synchronization to improve development productivity. Furthermore, as conservative synchronization minimizes load imbalance in the simulation, our approach can simulate very-large-scale systems. In our work, we also collect MPI traces from real machines or synthesize them based on detailed application knowledge.
Dimemas from UPC [33] is an MPI-trace-driven simulator primarily intended to help users develop and tune parallel MPI applications. For that purpose it models the nodes of a parallel machine and the applications running on them at a similar abstraction level as we do. However the network is modeled at a much higher abstraction level in the form of a set of buses. Hence Dimemas is not suited for studying the impact of different network topologies, switch architectures or network protocols, nor would it be useful for helping the design and optimization of the interconnection network, the switches and network protocols. A Dimemas simulation produces results in the form of another trace file that serves as input for separate performance analysis tools such as their powerful visualizing tool Paraver.
Sharapov et al. [34] present a performance estimation methodology for an HPC application running on a future HPC hardware architecture. It applies a hierarchical modeling method by combining queuing theory, such as the mean-value analysis model, with cycle-accurate simulation. Their interconnect model is an analytical model, whereas the processor micro-architecture model is cycleaccurate. They analyzed the application behavior with another workload characterization process.
Prakash and Bagrodia [35] study a few parallel protocols and application-specific techniques to improve the parallel speedup of simulating MPI programs. MPI programs are annotated and modified to run on the simulator. Their simulation is a function of the number of processors and MPI message communication latencies for the target systems1 therefore, neither the processor side nor the network side of the target system are modeled in sufficient detail. The simulations of up to 16 processors are studied.
Riesen [36] presents a prototype of a hybrid simulator in which an application runs unchanged on some compute nodes and its MPI communication is re-directed to a network simulator that runs on an additional compute node. The network simulator keeps track of the MPI traffic and feeds back simulated network delay information to the application. The application then uses such network information to update a virtual clock on its nodes. In this work, the target network can be arbitrary1 however, the target processor-side system has to be the same as the existing computing nodes. As only one compute node is the designated network simulator, it can become the performance bottleneck when a large-scale system is simulated.
Ricciulli et al. [37] introduce a process-oriented eventdriven simulator based on CSIM [38] . Each simulated device of a compute node is a separate process that interfaces with other processes through synchronization events and hardware queues. The simulator synchronizes the local clocks of the various local processes only when they execute synchronization operations. A central controller arbitrates the clock synchronization operations to simplify the implementation, although in theory it can also be distributed. The processor side is simulated in detail at register level. The network aspect of the simulation is rough, in that for network resource acquisitions no synchronization is modeled.
Bagrodia et al. [39] present a simulation environment that focuses on I/O and network simulation. It consists of a simulation kernel, an MPI communication library simulator [35] , a parallel I/O simulator, and a parallel file system simulator. As it uses direct execution to simulate the target local nodes, it requires that the host and target machines be similar.
Conclusions and Future Work
In this paper, we have demonstrated, using our simulation tool and methodology, that end-to-end simulation is feasible for HPC systems with hundreds of thousands of processors. The tool provides a semantically correct replay of MPI application traces and maintains reasonable simulation details of both the processors and the interconnection network. It features several network topologies, flexible routing schemes, arbitrary application task placement, point-to-point statistics collection, and data visualization. With a few case studies, we showed that this tool is very useful for high-level system design, performance projection, and application tuning of future HPC systems.
Nevertheless, leveraging modeling detail and simulation speed is challenging. Our plans include more refinements of the tool and its methodology as well as additional features to further improve its efficiency and usefulness.
Currently our models incorporate feedback between the network and the MPI-trace execution by the MPI semantic itself and by flow control. Saturation trees resulting from heavy congestion in the network would benefit from congestion management that throttles the injection rate based on feedback from the network. Hence it is on our list of future work to implement and study congestion control with our tool to help understand this difficult control problem under discussion in the HPC community.
Furthermore, as HPC-like large-scale computer systems are increasingly being applied also in commercial computing where applications do not necessarily use MPI, there is a need for generalizing the methodology of our simulation framework to cope also with non-MPI workloads in order to support the design of such systems.
