Abstract-Exascale networks are expected to comprise a significant part of the total monetary cost and 10-20% of the power budget allocated to exascale systems. Yet, our understanding of current and emerging workloads on these networks is limited. Left ignored, this knowledge gap likely will translate into missed opportunities for (1) improved application performance and (2) decreased power and monetary costs in next generation systems.
I. INTRODUCTION
Networks are the backbone of modern high-performance computing (HPC) systems. They serve as critical infrastructure that ties together applications, analytics, and visualization. Yet, we do not have a complete understanding of the performance and utilization of emerging network technologies under various workloads. As the race towards exascale continues and network technologies change and improve accordingly, a principled reexamination of network performance is prudent so that the HPC community can understand (1) the communication characteristics of modern workloads, (2) opportunities for optimizing application performance or network power usage, and subsequently, 3) design improvements for future HPC networks.
*Sandia National Laboratories s a multiprogram laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the United States Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.
In this work, we address these issues by performing a comprehensive, simulation-based study of large dragonfly networks and application motifs that represent workloads and communication patterns important to the HPC community. A primary motivating question was whether and how varying dragonfly network bandwidth and global link configuration can improve overall network efficiency. We analyze the relationship among a network's stalled, active and idle port cycles, and we combine these metrics into a useful visualization approach for understanding network performance. Furthermore, we study how varying power usage of network links impacts application runtimes.
For this study, we use the Structural Simulation Toolkit (SST) [17] and a range of relevant workloads on a dragonfly topology of 110,592 nodes to examine network design tradeoffs amongst execution time, power, bandwidth, and the number of global links. Our contributions are: 1) An evaluation of how tapering global dragonfly links impacts 11 important workloads; 2) An evaluation of the performance impact of link-widthreduction on dragonfly networks at scales significantly larger than previously studied; 3) A scalable and information-rich approach for visualizing network utilization performance. 4) Estimates on the potential for static and dynamic network power savings at exascale, derived through empirical measurements and existing models in literature; 5) Enhancements to the SST simulator, which extend the statistics provided by the router and NIC components.
After providing a general background on the SST simulation framework in §II, we describe the specifics of our simulated hardware environment and workloads §III. In §IV, we present the methodology and results of 88 simulations, examining the tradeoffs made between power, performance and the design of the network. Finally, we review related work in §V, with conclusions in §VI.
II. BACKGROUND
When simulating large scale systems we must consider many components such as the network, I/O, memory and CPU. It is important to select a simulator which finds the balance between scalability, complexity and accuracy. For these reasons, we have chosen a flexible, modular and scalable simulator called the Structural Simulation Toolkit (SST). In this paper we are primarily interested in power and performance of large scale dragonfly networks, which see use in many modern systems [12] . Dragonfly networks combine high radix routers and create virtual routers called a group which are fully connected to other groups by optical links. Throughout this paper we will refer to local, group and global ports. Local ports connect a router to a compute node or NIC. Group ports connect routers within the same group together. Global ports facilitate inter-group traffic and use optical links so that they may reach larger distance than is practical for electrical cables. One downside of optical cables is that they can suffer an increased cost compared to their electrical counterparts (depending on distance, amount purchased). For a visual reference the reader may refer to Fig 1. Because of our interest in networks, we have elected to utilize lightweight, scalable modules to simulate computation, while dedicating the majority of our simulation resources to simulate the network and communication. Specifically, we utilize SST to accurately represent the packet-level routing, buffering, and internal switch characteristics of 100,000 node dragonfly networks, as well as the MPI semantics and message matching. A. SST SST is a simulation framework that allows different components to be connected using a parallel discrete event simulation core. Along with the simulation core, SST provides a number of ready-to-use component libraries. SST is widely used by both industry and academic researchers. Throughout its history, the accuracy of SST has been validated in peer-reviewed publications and by hardware vendors [17] , [10] , [21] .
Group

1) Ember:
One of the SST libraries/components is Ember. Ember is a lightweight state-machine based event engine which replicates application communication patterns at a simulation end point. We term a single logical communication pattern a motif, drawing on the similar theme from Colella computational dwarfs [4] . A collection of motifs can then be arranged within each end point to represent a more complex single application or even a more complete workflow.
A motif works by creating a sequence of events when prompted which contain primitive operations for communication (e.g. send, receive, etc.), computation, waits or timing markers. The events are added to a queue and then executed one by one until the queue empties. Once emptied the motif is prompted to refill the queue with additional events (while being able to see the effects and returns from the previous executed set). Thus, motifs are able to execute short sprints of events punctuated by querying or logic.
Events which relate to communication are translated into operations at the message interface layer (Firefly). For instance, a communication event is encoded in Ember and then converted into operations by Firefly which tracks the semantics associated with the request.
By using short sprints of events and very small amounts of state at each end point, Ember is able to scale to very large simulated node counts without placing significant constraints on the amount of memory or processing required for the simulation to progress. Despite such simplicity, a collection of statistics relating to message sizes, message timing, message types etc. can be collected with ease.
2) Firefly: Firefly is a pair of state-machines that implement high level functional models of the host based communication library and the network interface card (NIC) logic.
The library state-machine supports point-to-point (e.g. send, receive, wait, etc.) and collective (e.g. alltoall, reduce, etc.) operations. Message data movement between network endpoints is based on a eager/rendezvous message protocol model. Message matching is based on message tag and src. The library state-machine has several parameters such as maximum length of an eager message, latency to check a posted receive for a match, latency to copy message data between buffers and latencies that mimic the time spent in various code paths of a library.
The NIC state-machine functionally moves data to and from the host over a bandwidth constrained path (i.e. bus). It also has a mailbox interface to the host that the host uses to initiate sends and gets. It models latencies through a NIC and latencies of data movement over a bus. It has parameters such as host bus bandwidth, transmit and receive latency, and NIC to host latency.
3) Merlin: Merlin consists of a set of components that allows a user to model a detailed network fabric. Merlin may be configured for a range of different network topologies. It also provides a number of tunable parameters, including buffer sizes, latencies, routing modes, and arbitration schemes.
The primary Merlin component used in this simulation is a high radix router model called hr router. This component models a single router, including input/output buffers, single large crossbar, and routing capabilities. The routing capability is controlled via a loadable "topology" model. The library currently supports mesh/torus, fat-tree and dragonfly topologies. All the topology models use deterministic (minimal) routing. Additionally, the dragonfly model adds two other routing modes: valiant and adaptive-local. Valiant routing chooses an intermediate group to first route to before routing to the final destination. Adaptive-local adaptively chooses between the minimal and valiant routes using a user defined threshold. The valiant route is chosen when the output buffer for the direct port has N (where N is the threshold) times more occupied space then the valiant route (i.e. since there are extra hops in the valiant path, the direct path must be more congested before the valiant path is chosen).
B. Stalled, Active and Idle
In addition to providing the functional simulation capabilities, SST also tracks a number of statistics during simulation. For this paper we've made additions to some of the statistics associated with the network fabric in Merlin. These counters record data pertinent to the usage of each simulated routers ports. We report these counters as three metrics (idle time, stalled time and active time) defined as: active: percent time a port is transmitting data. idle: percent time a port has no data queued for transmit. stalled: percent time not active or idle. By combining all three metrics, we provide a rich description of network usage, that no single metric can achieve. For brevity we refer to the collection of metrics as SAI (Stalled Active Idle). If we normalize these recorded times, SAI can be described by the following set of relationships:
These relationships limit the degrees of freedom to two, so we simplify the presentation of SAI by using a ternary plot. Throughout this work we utilize the SAI metric to explore the utilization of each motif for a given network. Therefore, we provide the reader with a brief explanation of how to interpret the results. There are three axes in the plot (one for each metric) with arrows indicating the direction that a particular metric increases. The red square represents a network port which is active for 100% of the run and is located to the top of the figure. Similar points can be seen for ports that are 100% stalled or 100% idle (green diamond and blue circle). The black star represents a port that spends 1/3 of the time stalled, active and idle. The orange triangle is a port which is stalled and active 10% of the time and idle 80% of the time.
believed to be similar to that of future exascale systems. In our experiments, we vary our simulations by motifs, bandwidth and the number of global links. Our dragonfly is comprised of 24 nodes per router, 48 routers per group and 96 fully connected groups. Throughout our experiments, the number of inter-group (global) links varies, but every group has at least 1 and up to 12 connections to each other group in the network. In each experiment this will be denoted in the legends by a number between 1 and 12 followed by glbl. b) Router and NIC Parameters: We simulate links with a bandwidth varying between 12.5 and 25 GBps. These bandwidths are conservative in terms of what can be expected in the Exascale timeframe, but their conservative nature means that they should be available circa 2018 and non-prohibitive in cost by the 2020 timeframe, making them a reasonable, if conservative target. The full parameters of the SST router and NIC components are provided in the following 
B. Processor Parameters
Each node in our system contains 5TF/s of processing power. This represents the minimum capabilities expected in the near future. Given that we simulate 110,592 nodes, 5TF/s only represents a half an EF/s computational power. To achieve true exascale performance we would need to roughly double the FLOPs per node. While this does not present a technical challenge to the simulator, 5TF/s allows us to more closely match the expectations of next generation systems.
From the viewpoint of our simulator, the difference in processing power manifests itself in the speed at which computation portions of motifs are completed. We assume that offload NICs are in use, so that CPU capabilities do not influence the speed at which network processing is done. Additionally, offload NICs allow for further simplifying assumptions, such as immunity to Network Induced Memory Contention [8] .
C. Evaluated Motifs
The selection of motifs covers a broad set of important microbenchmarks and communication kernels in HPC. The wall time to simulate a motif varies depending on many factors including the amount of simulated congestion, adaptive routing, number of simulated events and distribution of SST components across physical nodes. For simple motifs (like bcast), wall time to run a simulation may be just a single minute on a small number of nodes. Whereas, more complicated motifs increase the wall time to run a simulation and must be split across a larger number of nodes due limited memory. For example, for our parameters SST simulates bcast in a single minute, while random and sweep3d may take an hour to simulate on 32 16-core nodes.
1) AllPingPong:
The AllPingPong motif is a simple workload that divides the network into two logical groups of processes and then creates pairs consisting of one process from each group. The performance of this motif is dependent on both the bisection bandwidth and the number of hops (diameter) of the network. The communication acts in a predictable manner that is easy to reason about. Our experiments perform 1000 iterations of ping-pong, sending messages of 1024 bytes.
2) Allreduce: In Allreduce, each node receives n-1 messages (one for each other process in the system) and then a reduction operation is performed. Our simulations measure the impact for a single iteration of AllReduce sending 4 bytes, using 1 ns. of compute time per reduction operation.
3) AMR3D: This is a motif of a 3D adaptive mesh refinement, based from miniAMR. In order to accurately replicate the communication and computational characteristics of the real workload, the AMR3D motif must be given a block file which details how the mesh should be refined and defines the communication and computation that will take place. Because blockfiles must be produced on real systems running the real workload, we were limited to runs of 65,536 nodes which utilize 59% of our simulated system. Depending on the phase that a block file represents, the communication requirements may change considerably. We simulate two different block files, one from a early time in the run which is not particularly intensive and an additional block file that is from midway through the run. We denote these two different simulations as amr3d-lite and amr3d-heavy.
4) Bcast:
A simple broadcast is performed, such that each node receives a single message from the root node of the broadcast. We use similar parameters for broadcast as Allreduce, one iteration, 4 bytes of data, 1 ns. of compute, but with a root parameter set to rank 0.
5) FFT3D:
Our Fast Fourier Transform 3D motif uses block sizes of 1,992 for nx, ny and nz, with 125 FLOPs/element. 6) Halo3D: Problem sizes in the X,Y and Z direction are set to 100. Per cell there are 16 variables being computed 7) Halo3D26: Identical parameters to those used in Halo3D are used In Halo3D26. However, this motif represents communication between a larger number of neighbors, each process has 26 other neighbors that they communicate with, where each neighbor represents an adjacent point (including diagonals) in a three dimensional space.
8) Random:
In the random motif, each node selects one other node on the system to send a message to randomly at each iteration. Like AllPingPong, Random utilizes the network resources significantly, but differences in distribution of traffic may create hotspots on the network. The Random motif sends 1024 Byte messages for 10 iterations with a waitall synchronization between each iteration. At each iteration random destinations are recalculated.
9) Reduce:
Reduce is similar to Broadcast, except the flow of data is reversed (each node aggregates and reduces data rather than propagating it). Our reduce parameters are identical to broadcast for iterations, message size and compute time.
10) Sweep3D: Sweep3D models a wavefront propagating through a mesh, where each CPU represents a 2D column in a 3D mesh. We use values of 384 and 288 for values of pex and pey, respectively. The problem size in the X, Y and Z dimension is set to 100. Per cell there are 6 variables computed and the KBA (Nz-K blocking factor) is set to 10.
IV. METHODOLOGY AND RESULTS
In this section we review the results of the motif simulations, examining the network characteristics of each workload and provide observations to assist in the design of next generation networks. We begin with an assessment of motif runtimes as we taper the number of global links in the network. This is followed by an analysis of how adjusting the bandwidth of the entire network (from 25GBps per link to 12GBps per link) impacts motif performance. Finally, we look at the potential for power savings in the network.
A. Performance impact of reducing the number of global links
Reducing the number of global links impacts application performance by reducing available bisection bandwidth. In our simulations we begin with a network that has half bisection bandwidth. This means that each group in the dragonfly topology has 12 connections to each other group. Given 96 groups, each group has 12 × 95 = 1,140 global ports. This turns into 54,720 total global links. A network with this many global links would likely be prohibitively expensive. Additionally, the expense of procuring the extra links is only a part of the total cost which includes powering the link and purchasing higher radix switches.
1) Research Question:
In this section we ask, how many global links or what percentage of full bisection bandwidth does a exascale network require for reasonable performance?
2) Methodology: Our experiments evaluate the performance of the motif and the network as we move towards quarter, 1/8 and 1/24th bisection bandwidth (27,360, 13,680 and 4560 total global links, respectively). Throughout this work we use the terminology 12-global to refer to a half bisection bandwidth network for our simulated topology. Similarly, a 6-global, 3-global and 1-global refer to a quarter, 1/8 and 1/24th bisection bandwidth network. Fig 3 shows the impact of reducing the number of global links on the runtime of evaluated motifs. Looking at the results for 25GBps networks, as we reduce global links to 1/24th of the full bisection bandwidth, we see unacceptable increases to the runtime for allpingpong, AMR3D, FFT3D, Halo3D and Halo3D26 (806, 118, 189, 121 and 143%, respectively). The only motifs that see mild increases to runtime are the motifs that utilize the network the least, such as Sweep3D, broadcast, and AMR3D-lite. While 1/24th bisection bandwidth is clearly a poor choice, the majority of motifs see minimal degradation for topologies of 3 global links per group router. Specifically, all motifs other than allpingpong see a more modest (0-11%) increase to runtime for 75% reduction in global links. Allpingpong, which is essentially a measure of bisection bandwidth and network diameter, shows a more direct penalty to runtime as we decrease global links. In the ternary plot for AllPingPong (Fig 4) , we see increased activity for global links, which continues to grow as we remove available global links, becoming fully active for the 1-global network. Similar increases in global link activity is seen in Fig 5-Fig 7. However in these figures, group links see a corresponding increase in stalls as the traffic on each global link increases.
3) Outcomes:
Another observation is that the Halo3D and Halo3D26 motifs experience congestion and stalled cycles even with a 12-global network. Because these two motifs are stalled to begin with, they experience less of a relative increase to runtime than motifs like FFT3D which have relatively few stalls until placed on a 1-global network (Fig 5) . One of the reasons for the stalls observed in Halo3D26 is that the workload does not map to a dragonfly network as well as traditional mesh-based networks. In future work, more refined mapping strategies may be able to improve this.
For brevity we limit the number of ternary plots we present to those that are most interesting, but we should note that across our results we observed that local links were generally clustered together more tightly and more idle than global or group links. Additionally, the reader may notice that within Fig 3, that Halo3D26 25GBps-6-glbl actually sees a performance improvement compared to a 12-global run. We believe this is likely due to network stalls hitting a critical threshold in the routing algorithm that enables adaptive routing more readily with this number of links.
B. Performance impact of reducing total bandwidth (link widths)
One of the common methods proposed for saving power on the network is reducing the link width, which only partially limits a link rather than completely disabling it.
1) Research Question:
In this section, we ask what are the performance implications of reduced link width are on the evaluated motifs?
2) Methodology: To evaluate this, our simulations only reduce link bandwidth and keep other parameters such as latency identical to those used in the 25GBps simulations.
3) Outcomes: Results in Fig 3 suggest that for the less bandwidth intensive motifs, (such as AllReduce, Random, Reduce and Sweep) we may be able to statically reduce link bandwidth by 50% for a run and see modest increases to runtimes. This topic has been explored before with regards to smaller systems [14] using Cray Seastar interconnects and our results suggest that static reductions in network bandwidth may continue to provide power saving opportunities with modest runtime costs for a subset of workloads at exascale. Another interesting observation comes from comparing the runtimes and ternary plots of Halo3D26 simulations of 12.5GBps networks (Fig 8) and 25GBps networks (Fig 7) . Decreasing the total available bandwidth to this bandwidth sensitive application increases runtime as expected, however given the volume of stalled cycles in the 25GBps run, we expected Fig 3 to show a greater than 2X increase to runtime for a 50% link reduction (25GBps to 12GBps). Examining  Fig 8, we see proportion of time spent in stalled cycles decreases as ports spend an increasing proportion of time active. The reason for this behavior is that the adaptive routing threshold in Merlin is determined by the number of packets waiting in the outbound queue. As we decrease the bandwidth to 12.5GBps switches begin to buffer larger numbers of packets, which leads to more frequent enabling of Valiant routing, which reduces stalls. With the reduction in stalls on the group and global ports, the figures show that local ports transition to an almost entirely active or idle state.
C. Potential for power and energy savings
One of the goals of this paper was to examine potential power and energy savings on large scale dragonfly networks for relevant workloads. In this section we have several research questions, namely 1) What are the energy savings of a power proportional network? 2) What are the power savings from tapering global links? 3) What are the power savings from a static reduction to link width? 4) Is there potential for dynamic power saving solutions?
1) Methodology:
Since we measure the per-port idle time for each workload, we can derive upper bounds on energy savings if a network was power proportional. A network is power proportional if the network only consumes power corresponding to the amount of data in transmission. While modern networks are not power proportional, a large body of work has proposed dynamically and statically altering the width and frequency of network links to reduce the amount of wasted energy. The established work varies from shutting down the link completely [11] , [20] to approaches that only partially reduce link frequency or width [14] , [18] , [5] . The success of each approach is dependent on a number of parameters, but dynamic solutions which adaptively alter links must ensure they can disable/enable a link within a window of idle time. Idle time windows smaller than the disable/enable time must be forfeit as opportunities for power savings.
In order to explore these questions, we require an estimate of how much power links utilize at different widths and frequencies. Many of the existing power estimates in this area are either theoretical or pertain to architectures that are more than a decade old. While we are not suggesting these estimates are invalid, we believe it is worthwhile to take empirical measurements of network power and include our findings in this section. Beginning with optical global links, 3W of power is the commonly used assumption for transceiver power [18] , [1] . We were unable to perform empirical measurements on any optical interconnects for this work, so rely on industry datasheets.
Considering electrical interconnects, Soteriou and Peh [19] reported a 0.3W and 0.2W power consumption for IBM Infiniband 12X LPE TX and RX, respectively. This measurement can be used to determine potential power savings when reducing the width of an individual link. In our measurements we used WattsUp! power measurement device to record switch (Mellanox MTX3600) power as we adjust link widths and frequencies. In order to measure power savings on the NIC we used PowerInsight [13] to measure power of Qlogic QDR Infiniband NICs. When we adjusted the network from a 10Gbps 4X network to a 2.5Gbps 1X network we found savings of 1W per port on the Mellanox switch and 0.57W for the Qlogic NIC. At a reduction from 4X to 2X the NIC saw a reduced savings of only 0.29W. Each reported power savings is the average of 40 measurements with a standard deviation of 0.05W and 0.16W for switch port and NIC, respectively. Our measured power savings are significantly less than the theoretical savings commonly cited in literature. This is in large part because the Serializer/Deserializers are not disabled on our measured hardware to optimize power savings. Regardless, these numbers provide a lower bound of what we would expect to save on future systems and we can provide an upper bound using models and measurements of previous literature.
Given these power estimates we can assume that electrical switch ports on our network (local and intra group ports) consume 0.65MW of power for a 4X link width or almost 2MW for 12X link width. The optical global links take up an additional 0.33MW of power for our half bisection topology. The NICs increase this by another 0.76MW (6.83W/NIC in our measurements). Total estimates for the power consumed by the fabric would be between 1.73 to 3.02 MW of power 2 . Fig. 9 . This figure shows the normalized runtime (averaged across all the motifs of a given network simulation) against the total link power costs of the network. Networks with a good balance of power and performance include the 25GBps 6-global and 3-global, as well as the 12GBps 12-global networks. For the power estimates we assume 4X links The per port power costs in this figure are set to 0.5W per RX+TX for electrical ports and 3W per transceiver for optical links.
2) Outcomes -power proportional network: The best case power savings would be if the network was power proportional, so that we only paid power costs for active time. Looking across all of the motifs evaluated, Halo3D26 had the largest amount of active and stalled ports, so we use this motif as a lower bound on the amount of power that could be saved from a power proportional network. For this motif, the average percent of idle time for global ports was 84%, which would be a power savings of 0.28MW just for global ports. If we consider electrical ports, they average 82% idle which is an additional power savings of 0.71MW to 2.13MW (4X and 12X, respectively) of additional power savings. This totals 0.99 to 2.41 MW of potential power savings within the network of our simulated system for the most communication intensive motif we evaluated. Other motifs like Sweep3D use less network resources and leave links 99% idle on average for our simulations.
3) Outcomes -reducing global links: As shown in previous sections, most of the motifs simulated do not require half bisection bandwidth to achieve satisfactory performance. By reducing the number of global links we not only save money on the cost of the initial system procurement, but save on power costs throughout the lifetime of the system. Specifically for the simulated 25GBps network, we can reduce global links by 50% and save 164KW of power. A further reduction to 1/4 or 1/12 global links results in a savings of 246KW and 300KW, respectively. However, as demonstrated, a reduction to 1/12 the number of global links is impractical for performance reasons. Reducing global links to 12.5% of full bisection incurs some performance penalty for Halo3D and FFT3D motifs. However, this reduction may be practical for 75 GBps HDR networks, projected for 2017.
4) Outcomes -reducing link width:
A static reduction in link width for the entire network is a transition that could be enacted at low frequency from a power aware resource manager, dependent on the characteristics of the workload. Given our simulations of 12.5 GBps networks, we can estimate the amount of power saved by reducing link widths. Because we keep latency constant throughout our simulations, this provides an accurate mapping to a reduction in link widths for small messages, where link width reductions do not significantly impact the performance (latency) of small sparse message communication. Link width reductions are not being done in combination with link speed reductions, so the impact of width reductions is mostly limited to medium/large messages. Here, we present a pessimistic estimate given our empirical measurements of power reduction and a more optimistic estimates, derived from the power savings in literature.
First, lets examine a pessimistic model of static power savings. Given 327,168 electrical group and local switch ports, each of which at minimal would save 0.5W per port for a 2X reduction in link width, our group and local switch ports could save 164KW. Adding the 110,592 NICs, which minimally might save 0.29W for a reduction from 4X to 2X link width, we gain an additional 32KW of savings. If we consider the global links for a network that has quarterbisection bandwidth, there are an additional 109,440 ports to derive savings from. If we assume each of these ports could save 1.5W for a reduction to 2X link width, we gain an additional 164KW of power. In total, our lower bound for power savings, given a static reduction to link widths totals 0.36MW. This is around 1.8% of the projected exascale power budget. If we consider a network built with half bisection bandwidth and more optimistic models of power savings (mentioned in §IV-C) we can increase this estimate to 0.60MW or 3% of an exascale power budget.
5) Outcomes -dynamic link width:
The difference between the static 0.60MW power savings and the 0.99MW power savings possible in a 4X power proportional network may be reclaimed by dynamic network power strategies such as those proposed in [11] , [18] , [19] , [20] , [5] . However all of these dynamic strategies require idle intervals of network ports sufficiently long to disable or slowdown some portion of a network link and bring it back up before it becomes active. While we are not proposing any new solutions to predict idle and active intervals in this work, we present summary statistics of the duration and percentage of time that the network is idle for our simulations, which informs future endeavors in this area. Typically, clock-matching Phase-Locked Loops are viewed as the bottleneck to increase a link's width after it has been decreased (aligning input and output phases takes around 400ns [6] ). For our work, we consider any idle interval greater than 1μs as an opportunity for dynamic power savings strategies. Additionally, we present the percentage of idle events whose duration is longer than 1ms.
In Table I we report the median idle time for 4 workloads as well as the percentage of idle events greater than 1μs and 1ms in duration. It's clear that most of the network ports remain idle for a majority of the runtime. These results suggest that a large portion of the idle events (8-66%) could be targeted by dynamic power savings strategies. Only Sweep3D has idle periods longer than 1ms (0.4% of idle events).
V. RELATED WORK
A. Exascale Network Simulation
In [2] , Ahn, et al. present largescale simulations of the HyperX network topology, comparing it to other popular topologies. While we are interested in comparing dragonfly networks with other topologies (e.g. Clos-trees), it is not the focus of this work. Another simulator, LogGOPSim [9] is a LogGOP based simulator that runs on inputs of MPI traces. While it is sufficient in many cases, LogGOPSim simulates a fully connected, single hop network that does not allow for a detailed study of port-level statistics. XSim is another largescale simulator developed at Oak Ridge National Laboratory. While xSim has achieved considerable scale in simulation, it utilizes a simpler model of the network and has only recently incorporated models for network congestion [7] . The Rensselaer Optimistic Simulation System (ROSS) has been used to simulate large scale systems. Lui et al. have examined torus networks at exascale size, strong scaling ROSS to use 128K cores on a Blue Gene/P system [15] . More recently the ROSS simulator was extended to included dragonfly networks, but simulation was limited to evaluation of MPI collectives [16] .
B. Evaluating Power and Performance of HPC Workloads
Dickov, et al explored link idle times in [5] and the potential for power savings using the Venus-Dimemas simulator, given fat-tree networks. Our work has a broader focus than just network power savings, examining best practices of large scale topology design for dragonfly networks. Work by Bhatele et al. [3] , explored how nearby jobs create contention in meshbased networks. Our work is interested in similar phenomena, but focuses on dragonfly topologies which are significantly less prone to delays caused by fragmented job placementsince the diameter of a dragonfly network is constant, whereas the diameter of a mesh grows with the number of nodes. Work by Laros, et al. [14] examined how static reductions in fabric link width impacted application performance and energy. While their work was done on real systems rather than simulation, simulation allows us to take a more detailed look at the network fabric, as well as examine larger systems with newer topology designs. Zahn et al. used OMNET++ simulations for 64 node runs with an integrated power model to study the link utilization of networks running the Graph500 benchmark and NAMD on a 3D torus network [22] . They found that there were significant portions of idle time that could be exploited for power/energy savings. Unlike the work in this paper, they studied an integrated NIC/switch network, EXTOLL (Tourmalet), with different workloads and at smaller scale.
VI. CONCLUSIONS
This paper has explored the potential for network power savings at Exascale through simulation of dragonfly networks for a variety of workloads. We have found that significant power savings can be realized by scaling back links during idle periods, such that 3-10% of the total system power budget may be reclaimed. We assessed how link width as well as global tapering impacts motif performance. While some motifs were sensitive to these reductions, we observed 7 out of 11 motifs were able to withstand significant reductions available bandwidth with only minor impact to runtime. In systems in which applications are not run across the whole machine, heterogeneous networks may be a possibility, which merits further study for organizations that do not run demanding application types at a whole system scale. In addition, we have shown what configurations of network bandwidths and global link counts provide the best balance between power costs and execution time for the workloads studied.
