As scientific applications target exascale, challenges related to data and energy are becoming dominating concerns. For example, coupled simulation workflows are increasingly adopting in-situ data processing and analysis techniques to address costs and overheads due to data movement and I/O. However it is also critical to understand these overheads and associated trade-offs from an energy perspective. The goal of this paper is exploring data-related energy/performance trade-offs for end-to-end simulation workflows running at scale on current high-end computing systems. Specifically, this paper presents: (1) an analysis of the data-related behaviors of a combustion simulation workflow with an insitu data analytics pipeline, running on the Titan system at ORNL; (2) a power model based on system power and data exchange patterns, which is empirically validated; and (3) the use of the model to characterize the energy behavior of the workflow and to explore energy/performance tradeoffs on current as well as emerging systems.
INTRODUCTION
The path to exascale computing poses a number of significant challenges as researchers and potential exascale vendors attempt to deliver a hundred times performance improvement relative to today's fastest supercomputers. While simulations running at these extreme scales have the potential for achieving unprecedented levels of accuracy and providing dramatic insights into complex phenomena, they are also Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. presenting new challenges. Key among these are the challenges related to the large data volumes and data rates, and the costs (power and latencies) associated with transporting, managing, and processing this data.
For example, the anticipated changes in system architectures and the anticipated tight power budgets are motivating a shift in scientific simulation workflows, away from a postprocessing paradigm towards a framework in which simulation output is analyzed concurrently as it is generated. A variety of potential workflow options are being explored, including in-situ processing, in which data is analyzed on compute resources shared with the simulation, in-transit processing, in which secondary compute resources on the system are leveraged, and hybrid in-situ/in-transit processing that combines the approaches. Recent efforts [8] have explored trade-offs between these workflow options from a latency and performance perspective for a variety of algorithms on current architectures.
However, as we look to the future, it is imperative to consider energy-performance behaviors and trade-offs, determine how to optimize codes for future processor designs, and provide computer architects with a clear understanding of the characteristics, complexities, and power needs of the applications that will be deployed on these systems.
The goal of this paper is exploring data-related energy/performance trade-offs at extreme scales. Specifically, we analyze the behavior of a combustion simulation workflow with an in-situ data analytics pipeline, running on a current highend computing platform (i.e., the Titan Cray-XK7 systems at Oak Ridge National Laboratory) and use machine-independent code characteristics (e.g., computational profiles, data access and exchange patterns, messaging profiles, etc.) to develop a power model, which is then empirically validated using an instrumented platform. We use this model to explore energy/performance trade-offs on current systems, to help answer system design questions, and to analyze the power requirements for emerging architectures. Our key contributions include the following:
• a power model based on machine-independent algorithm characteristics validated through empirical studies, • a case study of the impacts of system architecture, algorithm design choices, and deployment options (e.g., node mapping) on data exchange patterns and overall energy-performance profiles, and • a discussion of how our methodology can be extended to help answer questions related to the design of future architectures. The rest of this paper is organized as follows. Section 2 presents related work. Section 3 presents background material including the specific use cases, as well as the scientific rationale for the paper. Section 4 presents our power model and Section 5 presents a validation of this model. Section 6 uses our model to investigate power/performance behaviors at scale, as explore power-related co-design issues. Section 7 concludes the paper and outlines directions for future work.
RELATED WORK
In-situ Analysis. Early in-situ analysis workflows focus largely on visualization for monitoring purposes [30, 50, 58, 67, 69, 74] . More recent efforts allow for the coupling of simulation codes with popular visualization and analysis toolkits, such as VisIt [11] and ParaView [26] , exposing a broader suite of analytics tools to simulation scientists. As we look ahead to exascale, performance design issues and trade-offs associated with these workflows are becoming increasingly important, leading to the investigation of latency and run-time performance trade-offs between in-situ and intransit [2, 3, [20] [21] [22] 70 ,75] workflows as discussed by Bennett et al. [8] . In this paper we investigate the roles of system architecture, algorithm design choices, and deployment options on overall energy performance profiles for a particular system, and present a methodology that can be extended to answer co-design questions for emerging architectures. Topological Analysis. Topology-based techniques have proven useful in the analysis of a wide variety of simulation data due to their efficient representation of the feature space of scalar functions [7, 12, 33, 43, 52, 53] . One class of algorithms, which includes Reeb graphs [62] and their variants, contour trees and merge trees [17] , encode the level set behavior of a function. These structures are ideally suited for encoding threshold based feature definitions and have been used successfully in a number of large scale science applications including the analysis of extinction regions in non-premixed turbulent combustions [52] , a study of lean pre-mixed hydrogen flames [12] , and detection of bubbles in turbulent mixing [43] . Current techniques for computing these structures for large-scale data fall into two categories: 1) streaming out-of-core computation [14, 60] and 2) divide-and-conquer parallel approaches [56, 59] that rely on successive k-nary merging of regions of the domain. Characterization Tools. The most common way to identify how an application is utilizing the underlying hardware is to monitor hardware performance counters. The PAPI library [15] abstracts hardware-specific counters into a consistent set of named counters. Unfortunately, the set of available counters are predefined and typically only a certain number may be active at any one time. Furthermore, the details of the behaviors and features can differ across processor generations and vendors and can also be inconsistent across runs of an application [71] . Static analysis using source-to-source translation tools such as ROSE [61] have the disadvantage that a substantial amount of application behavior (e.g., iterations needed for convergence) is not known at compile time. In addition, advanced compileroptimization techniques can significantly impact code characteristics, and the details of code-generation for a specific target architecture are typically difficult to capture. Binarycode modification -patching an executable to include instrumentation code -as embodied in tools such as DynInst [16] and Pin [36, 49] is too far removed from application source code; it can be difficult to draw connections between the application behavior as understood by the developer and the architecture-specific behavior that such tools report. The approach we take in this work is to instrument code at compile time (using a view of the application close to the developer's) but gather data at run time (using run-time knowledge of what actually gets executed). The implementation is via a custom LLVM [1, 44] compiler pass that operates on the compiler's intermediate representation of the application. This enables us to maintain both programming language and hardware-architecture independence, while simultaneously providing a wealth of information that characterizes the application in a manner meaningful for energy analysis. Power and Energy Modeling. There exists a large body of literature on power and energy modeling at different levels, ranging from the microprocessor to entire systems. Our models are built using a traditional coarse-grained system level energy/power formulation [10, 24, 31, 39, 64, 65, 68] . These models are simple, fast, have low overhead, and are accurate enough to characterize energy at the system level. Power models often rely on performance counters and power is usually correlated with these events or estimated using accurate linear power models [5, 19, 37, 38, 45, 47, 51, 55] . In our work, we estimate subsystem activity based on applicationcentric metrics instead of hardware counters. Detailed subsystem power models have also considered the use of local events to represent power [32, 34, 35, 40, 46, 63] as well as dynamic processor frequency scaling approaches [25, 27, 28, 41] and thermal considerations [6, 9, 66] . Our work focuses on system-level power. Note that we do not use fine-grained power metering data in our extrapolations to Titan as such data is not typically available on leadership class computing facilities. However, our approach does provide us with a framework to explore relative behaviors and costs and associated tradeoffs. Our recent work has also focused on characterizing and quantifying energy/power behaviors of data-centric scientific workflows [29] , but this is not at scale yet. Existing work by Shalf et al. [23] has also explored energy efficiency for extreme-scale scientific applications and addressed software/architecture co-design by comparing different architectural alternatives such as multi-cores, GPUs and many-cores [42] . However, there are still challenges on this front and, to the best of our knowledge, the performance/energy tradeoffs and co-design aspects that we address in this paper (i.e., combining software, runtime and architectural issues) have not been addressed at petascale (i.e., on Titan) and beyond.
BACKGROUND
This section describes the specific use case chosen for our study, the different degrees of freedom inherent in the particular algorithm, as well as the scientific rationale behind it. Workflow. In this paper we consider an in-situ analysis workflow integrated with S3D [18] , a massively parallel turbulent combustion code. S3D performs first-principlesbased direct numerical simulations of turbulent combustion in which both turbulence and chemical kinetics associated with burning gas-phase hydrocarbon fuels introduce spatial and temporal scales spanning typically at least five decades. S3D is used to glean insights into fundamental turbulencechemistry interactions and therefore, for production simulations, the time steps taken to advance the solution are smaller than the smallest time scales. In this paper we consider two different S3D simulations. The first is a turbulent auto-ignitive mixture of di-methyl-ether and air under typical homogeneous charge compression ignition (HCCI) conditions [4] . This simulation is aimed at understanding the ignition characteristics of typical bio-fuels for automotive applications and has a domain size of over 175 million grid points. The second simulation describes a lifted ethylene jet flame [73] , involved in a reduced chemical mechanism for ethylene-air combustion, with a domain size of 1.3 billion grid points.
Of interest to scientists is the behavior of small, intermittent features within the simulations. Understanding the structures' shape and size distributions and tracking their interactions over time provides key insights into complex turbulence-chemistry interactions such as autoignition and extinction. Examples of such features include structures defined by the the scalar dissipation rate in the lifted ethylene jet simulation and structures defined by the reaction rate of OH in the HCCI simulation.
Topological analysis tools, such as merge trees have been used to define these features of interest, provide statistical summaries of their characteristics, and track them over time [7, 13, 14, 54, 72] . The advantages of using such topological analysis tools are twofold. First, features of interest are typically defined in terms a specific input parameter value (e.g., threshold or isovalue), and topological techniques provide a multi-resolution representation that captures these feature definitions over a range of input parameter values. Moreover, the results are stored in a compact fashion, allowing for dramatic data reductions while still maintaining complete flexibility in representing the features of interest.
Post-processing analysis workflows are not adequate as S3D currently saves only approximately every 500th timestep of the simulation to disk to mitigate I/O costs. However, due to their time scales, a frequency at least twice this is required to track such features of interest, motivating a shift to in-situ topological techniques. The in-situ merge tree analysis code in this work comprises three main stages: 1) a local compute stage in which families of features are identified locally on each processor and a local tree is generated to capture feature interactions over the range of input parameter values; 2) a merge stage in which the local trees from each processor are joined in a k-nary merge pattern to capture global relationships; and 3) a correction phase, in which the local features at each processor are updated to reflect the global relationships uncovered during the various merge stages. Figure 1 shows a corresponding flow graph for four local processors using a binary merge.
Here we explore two degrees of freedom in setting up the algorithm and study their impacts on the power profile of the overall in-situ workflow. First, each node in the diagram of Figure 1 represents an independent compute kernel which could be placed on an arbitrary MPI rank/compute core. In practice, both the local compute and the local corrections need access to the original data. To avoid excessive data movements these are therefore typically co-located with their corresponding sub-domain. However, there are exceptions: For example, since node-internal memory access is relatively fast and cheap one could gather the data of all cores onto fewer or even a single core for processing. This reduces the overall core count of the analysis code, which enables it to run in a more scalable regime. In this case the dataflow would gain an initial setup phase to collect the local data. However, the merge computation would be less dependent on data locality since all messages are typically small. There exist different strategies for the placement of the various compute kernels. For example, one can ensure that either the early merge phases or the late merge phases happen among cores on the same node, which would prevent MPI traffic in the corresponding portion of the algorithm. The second major aspect of the algorithm's performanceand a controllable input parameter-is the degree of fan-in during the merge stage. Larger fan-ins produce a more shallow merging hierarchy with fewer dependencies and shorter chains of messages. However, larger fan-ins also reduce the number of active cores, which more quickly results in an unbalanced load, potentially introducing performance problems. Moreover, the fan-in can have effects on the overall amount of on-vs. off-node communication. For example, choosing a fan-in larger than the available core count on each node forces significantly more messages to travel on the network which slows down the algorithm and increases its power consumption. Below, we empirically investigate the impact these factors can have on the amount of on-vs. off-node communication required as well as their impact on total overall power efficiency. Profiling and Characterization. As detailed in Section 4, our power model accounts for both on-node and off-node communication characteristics of the analysis code. One of the primary goals is to develop a methodology that is not limited to the performance characteristics of a single machine, but rather is flexible enough to map to potential future architectures. Because envisioned exascale architectures are not yet available, we cannot use hardware performance counters or binary instrumentation techniques. Furthermore, architecture simulators are prohibitively slow for characterizing complete applications at scale. To address this challenge, we use compiler-based application analysis using Byfl [57] to identify key data-centric operations in an architecture-independent way. These include counts of the total number of memory accesses as well as the total number of operations performed. To capture communication patterns, we have instrumented our code to generate trace information that encodes the size as well as the rank IDs of the source and destination of all messages sent during application execution.
MODEL FORMULATION
This section describes our analytical model inspired by traditional coarse-grain system level energy/power formulations typically based on performance counters or linear models. In contrast, we model energy/power based on architectural characteristics and architecture-independent information provided by Byfl and MPI messages (see Section 3) instead of architecture-dependent performance counters. Other analysis methods (e.g., using data staged on disk) can also be studied using our framework, by modeling the involved subsystems behavior and power cost. Note that the goal of this formulation is to use global, coarse-grained, machine-independent information so we can extrapolate performance/energy behavior and trade-offs at scale (see Section 6). Energy modeling. Since system-level information provided by Byfl is global for the entire application execution (i.e., can be averaged) and data communication is available for each MPI message, we consider the system and data communication energy consumption separately:
Energy results from integrating power over time. We consider both static (or idle) and dynamic power, i.e.,
Static system power can be decomposed into the power of each contributing subsystem at idle state, which are processor (cpu), memory (mem), network interface (nic) and other components such as control circuitry, power distribution, etc. (misc). We consider P idle cpu , P idle mem , P idle nic and P idle misc as architecture constants that can be estimated using hardware specifications or measured empirically. Thus,
Dynamic power only includes processor and memory contributions because the network interface's dynamic power is accounted for in Ecomm. Dynamic power can be determined using dynamic power dissipation models. For example, processor dynamic power can be formulated from its capacitance (C), an activity factor or switching activity (α), and the operational voltage (V ) and frequency (f ), i.e.,
Analogous to α in Equation 4, we approximate dynamic processor and memory power dissipation using an activity factor extracted from the Byfl data:
These activity factors are computed from the number of operations per second (mips) and number of memory accesses per second (mem bw ) as well as a normalization factor that represents the maximum capacity.
The energy consumption from data communication depends on the number of MPI messages (M ). Each MPI message is composed of source MPI rank (src), destination MPI rank (dest) and data size of the message. Thus, M = ({src1, dest1, data1}, . . . , {srcm, destm, datam}) (7)
MODEL VALIDATION
In order to empirically validate our model, we have conducted a set of experiments using a cluster platform at Rutgers ("Dell cluster", hereafter). The cluster consists of two Dell PowerEdge M1000E blade enclosures maximally configured with sixteen PowerEdge M610 nodes (blades), each node having two Intel Xeon E5504 Nehalem processors at 2.4 GHz. Each node also has 6 GB RAM (6×1 GB DIMM Hynix DDR3-1333 ECC) and 73 GB of local disk storage. The network infrastructure comprises an integrated 16-port Mellanox M2401G InfiniBand switch within each blade chassis (running at 50W, 36.8W dissipated + 1.65W per port), each switch having eight uplink ports and linked via eight InfiniBand lanes to the uplink ports on the switch in the other chassis. All blades have Mellanox ConnectX MT26428 Quad-Data-Rate (QDR) InfiniBand interface cards. There is also an integrated (redundant) 1 Gigabit Ethernet within each chassis, with two pairs of 10 Gigabit uplink capabilities in each chassis. Each blade enclosure also includes metering functionality that can be queried on runtime using the Simple Network Management Protocol (SNMP). Table 1 lists the system parameters that are used in our model. Note that P static system , BWnet and BWmem were obtained empirically using standard benchmarking. The rest of parameters are theoretical from the vendors' specifications. We separately validate our energy model and the associated data communication (Ecomm; see Equation 8 ) because data communication is an important aspect of our study. Data communication model validation. We ran a MPI variant of ping-pong tests using 2, . . . , 32 nodes (i.e., by pairs) five times. The experiments were run with the whole cluster idle, and provided us with an empirical measurement of the network bandwidth (BWnet in Table 1 ). Figure 2 (a) plots the average power dissipation of the system when running the ping-pong tests. We plot the average power and the variability in the experiments since power is independent of the data size. We also plot the power (P transf er ) obtained using our model (see Equation 8 ) for the Dell cluster, where Pover is the power from processor and memory associated with system overheads:
Figure 2(b) shows the accuracy of our model. The figure plots variability as well as the average value (99.43%). Model validation at small scale. We ran the in-situ topological analysis application using one and two blade enclosures of the dell Cluster (i.e., 256 and 512 MPI processes, respectively) with the HCCI and Lifted data sets and fan-in={2, 4, 8, 16}. We instrumented the application as described in Section 3 in order to collect data about architecture-independent operations and MPI communications. However, we also ran the same set of experiments ten times to measure real execution times and collect power readings. Figure 3 (a) displays the energy consumption and variability with the set of experiments described above, from empirical measurements (observed) and from our model. The error bars correspond to the standard deviation. Figure 3(b) shows the accuracy of the model for the different experiments, which is 98.59% on average.
EXPLORING POWER/PERFORMANCE TRADE-OFFS AT SCALE
This section describes how we extrapolate our power model to Titan, and presents result of experiments in which we have used our model to explore the impacts of system architecture, algorithm design choices, and deployment options on data exchange patterns and overall energy performance.
Power Model Extrapolation
System power model. We extrapolate our system power model to Titan based on its known power requirements and hardware vendor specifications with the following reasoning. Titan drains 8,209kW at full capacity (source: http://www. top500.org), which results on 439W per node (from a total of 18,668 nodes). Each node has one Opteron 6274 processor (Pcpu), 32GB of RAM in four 8GB DIMMs (Pmem), one NVIDIA K20x GPU (Pgpu ), and each two nodes share a Cray Gemini interconnect ASIC (power information not publicly available). Consequently, the remaining 89W of power (i.e., P dynamic system − Pcpu − Pmem − Pgpu) comes from the Gemini interconnect and other components (e.g., control circuitry). We have fixed the amount of power associated to the network interface to 95% of this remaining power (i.e., Pnic = 85W) as a rough estimation. However, we address this problem from a co-design perspective as discussed in Section 6.2. Gemini interconnect model. Titan contains 9,344 Gemini SoC ASIC controllers, each of them attached to two compute nodes via its two NICs (see Figure 4) . The Gemini controllers are arranged following a 25×32×24 3D torus topology with 6 links per controller to its 6 surrounding neighbors. In contrast to switched networks such as InfiniBand, Gemini is a direct network, which means that the processors are integrated directly into the network fabric. As a result, the energy required to transfer data depends on locality. Our communication energy model (P transfer in Equation 8 ) therefore takes into account the number of Gemini controllers that are traversed to send a message from a source MPI rank to a destination MPI rank (i.e., number of hops).
Parameter Value
We have implemented a simulator that, given a set of MPI messages M (see Equation 7 ) and MPI rank-core mapping policy, returns the number of hops required by each message using the steps listed below:
1. Creates the given {NX , NY , NZ } 3D torus network topology ({25×32×34} in our case), 2. Maps MPI ranks onto the 3D torus (e.g., onto consecutive cores to improve memory locality). 3. Finds the shortest path (in number of hops) from the source to the destination MPI ranks for each message. We obtain the shortest path between two MPI ranks by dividing the three-dimensional space associated to the 3D torus into three simpler uni-dimensional spaces. Specifically, to find the shortest path from a MPI rank mapped onto {XS, YS, ZS} to a MPI rank mapped onto {XD, YD, ZD}, we follow two main steps as below.
Find shortest path for each dimension, individually:
• PX , i.e., from {XS , 0, 0} to {XD, 0, 0}
• PY , i.e., from {0, YS, 0} to {0, YD, 0}
• PZ , i.e., from {0, 0, ZS} to {0, 0, ZD} 2. Sum the distance of paths PX , PY and PZ . Each dimension (X for example) contains NX nodes connected linearly as a ring, (i.e, the first node is connected with the second node, the second to the third, etc., and the last (NX ) node is connect to the first node). Consequently, we need to consider the following two paths in order to find the shortest distance (D).
D2,X = min(XS, XD) + NX − max(XS, XD)
As a result, the shortest path DX will be the minimum of Equations 10 and 11. The same reasoning can be applied to the Y and Z dimensions.
Experimental Results
We performed a number of experiments using different fan-in parameters, domain decompositions, and node layouts, and the results of our analysis are captured in Table 3 and Figures 5, 6 , and 7. Figure 5 summarizes the macro communication behavior of the topological analysis of the lifted ethylene jet data set. The information was obtained through a collection of experiments performed using algorithmic variations, including total number of cores, domain decomposition, and merge fan-in, as well as varying the node-mapping strategies. Each column in the graphic corresponds to a different node-mapping configuration, while each row in the graphic corresponds to variations in fan-in merge values. Each column of each subfigure displays a histogram of the number of hops that MPI messages require to reach the destination rank from the source rank. The x axis of each subfigure corresponds to the number of cores involved in the run, which in turn determined the domain decomposition used. The default mapping algorithm maps MPI ranks consecutively across allocated cores, while the three subfigures in the rightmost column use a random mapping algorithm (within a partition of the torus). The thickness of each point represents the number of messages that required the corresponding number of hops to be transferred, and the color of the point represents the relative number of messages with respect to the maximum number of messages for the set of experiments. Algorithmic effects. Figures 5 and 6 show the amount of local communication, which gives an estimation of the number of MPI messages (i.e., network data communication) for each configuration. On examination of the figures, we see several clear trends. First, total overall network data communication increases as the fan-in increases. Second, the total number of messages increases as the number of cores increases but the impact on the number of hops is not significant unless the random mapping algorithm is used. This suggests that the algorithm is highly scalable in terms of network communications, which is expected as data com- munication is structured as a tree. The mapping algorithm can significantly impact the amount of local ("on-chip") communication, and therefore choosing the optimal fan-in will depend mainly of the number of MPI ranks that can be allocated per node and, consequently, the number of available cores. This has implications on architecture co-design as discussed at the end of this section. Effects of runtime MPI rank mapping. As we can see from Figure 5 , the mapping algorithm clearly impacts the number of hops required to send messages. Specifically, as the number of ranks per node increases, there is a corresponding decrease in the total data communication while the number of ranks per node is greater than the fanin merge parameter. When the number of ranks/node is equal or larger than the fan-in merge parameter, most of the communication is local. As can be observed by looking at the diagonals of the graphic in Figure 5 from top-left to bottom-right, the mapping algorithm and fan-in are actually highly correlated (hops ∼
fan-in ranks per node
). Since the communications are structured in a hierarchical merge pattern, when the fan-in is twice the number of ranks/node, this results in approximately half of the messages requiring off-node communication. For example, if we map 8 ranks per node, with a fan-in of 8, most of the communication will be local. However, with fan-ins of 16 and 32, around 50% and 25% of the communications respectively will be remote (see Figure 6 ). Moreover, as would be expected, the consecutive mapping algorithm presents better data locality than the random MPI mapping algorithm, regardless of the number of ranks per node. Random allocation results in a larger number of hops, which is not scalable from both a latency and energy perspectives as can be observed in Figure 8 (note the changes in scale along the Y axis across the various plots). This supports the argument that data locality is essential for energy efficiency at scale. It is also interesting to note that in Figure 5 , it is clear that under a random distribution of MPI rankings, communications require more hops on average than with one MPI rank allocated per node, in spite of the fact that under the random distribution there are 16 ranks per node supported. However, upon examination of Figure 6 , we note that the percentage of local communication is similar for these two configurations. System architecture effects. Detailed power information is not always available for every component of the system. For example, we have little information about the power drawn by the Gemini NIC. Furthermore, it is often desirable to examine co-design questions such as, "What would be the change in full-system power if a future generation NIC were to consume less power than current-generation NICs?" Our power-modeling approach is robust to both these types of missing information. Specifically, we can examine ranges of power consumption to bound the impact of any component on the power consumed by the whole system. Figure 7 presents an example of such a study. The x axis represents the number of cores, and the y axis represents system power. Each curve represents a different value of NIC power (from 0% to 100%) as a percentage of the power remaining after subtracting the other subsystems' power consumption, as discussed in Section 6.1. As Figure 7 shows, the model indicates, for example, that if NIC power were reduced from 100% to 0%, the system power when running the in-situ topological analysis at 30,000 cores would drop by about 40%. This can be invaluable information when co-designing a system because it helps examine power/performance trade-offs in a system-and applicationcentric manner. Furthermore, we believe the current network interface operational power region to be within the dark shadowed area of Figure 7 (i.e., between 33% and 66% of the remaining node power is used by the NIC). Architectural solutions could also mitigate algorithm and runtime rank mapping effects shown below. For example, Figure 8 (a) shows that the percentage of energy consumed for data communication is high when the fan-in is larger than the number of ranks mapped per node. An algorithmic solution to this problem is to use a lower fan-in but would incur the associated penalties in execution time and energy. However, a potential architectural solution would be to consider using two shared memory multi-processors instead of four independent processors to mitigate the problem (see Figure 8(b) ), which would also result in a reduction of the percentage of energy consumed for data communication.
CONCLUSION AND FUTURE WORK
As scientific applications target exascale, challenges related to data and energy are becoming dominating concerns. In this paper we explored data-related energy/performance trade-offs for end-to-end simulation workflows running at scale on current high-end computing systems. Specifically, we have 1) developed and validated a power model based on machine-independent algorithm characteristics, 2) used the model to explore the impacts of system architecture, algorithm design choices and deployment options on data exchange patterns and overall energy performance, and 3) discussed how to extend our model to help answer design questions for emerging architectures. For example, as Figure 7 demonstrates, it may be possible to utilize our model to co-design applications and networks for power efficiency. That is, given network power/performance trade-offs from a network vendor, we can model the overall system power consumption of in-situ topological analysis to jointly determine how to maximize the ratio of code performance to power con- sumption. While our current model considers macro system behaviors, we plan to extend it to account for deep-memory hierarchies, and to compare and contrast the behaviors of a larger set of use-case scenarios on potential architectures to further study the impact of various co-design trade-offs. Our ongoing work also includes the use of finer-grain modeling and system-specific parameters when they are available. subcontract number 4000110839 from UT Battelle, by the US National Science Foundation (NSF) via grants numbers ACI 1339036, ACI 1310283, DMS 1228203 and IIP 0758566, and by an IBM Faculty Award. The research at Rutgers was conducted as part of the NSF Cloud and Autonomic Computing (CAC) Center at Rutgers University and the Rutgers Discovery Informatics Institute (RDI2). The authors wish to thank the members of the the ExaCT Center for Exascale Simulation of Combustion in Turbulence for useful discussions and support. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000. Los Alamos National Laboratory is operated by Los Alamos National Security LLC for the U.S. Department of Energy under contract DE-AC52-06NA25396.
