Data intensive applications comprise a considerable portion of HPC center workloads. Whether large amounts of data transfer occur before, during or after an application is executed, this cost must be considered. Not just in terms of performance (e.g. time to completion), but also in terms of power consumed to complete these necessary tasks. At the system level, scheduling and resource management tools are capable of recording performance metrics and other constraints, and making performance aware decisions. These tools are a natural choice for making power aware decisions, as well. More specifically, power aware decisions about data transfer costs for the entire application workflow. This research focuses on developing data motion power cost models and integrating these models into a task scheduler framework to enable complete power aware scheduling of an entire HPC workflow. We have taken an incremental approach to developing a hierarchical, system wide power model for data motion that starts with core data motion and will eventually encompass data motion across facilities. In this paper, we discuss our current research which addresses multicore data motion and data motion between nodes.
INTRODUCTION
Data sharing and data manipulation applications have become an integral part of the HPC science application workflow. At petascale computing levels with a trajectory toward exascale, HPC applications generate, store and share terabytes of data. As we approach exascale, the research community is preparing for petabytes to exabytes of data that will need to be stored and shared [12, 2, 1] . Whether large amounts of data transfer occur before, during or after an application's execution, this cost must be considered. Not just in terms of performance (e.g. time to completion), but also in terms of power consumed to complete these necessary tasks. With a proposed annual power budget of 20MW per exaflop for HPC facilities [1] , the power consuming characteristics must be known for all significant activity, and every management level within a system must be power aware. This includes monitoring and understanding the power consumed by components related to data motion.
At the system level, scheduling and resource management tools are capable of recording performance metrics and other constraints, then making performance aware decisions. These kinds of tools make decisions about the workload on an HPC system at any given time based on constraints and requirements provided by the users and constraints that are inherent in the system. These tools are a natural choice for making power aware decisions. More specifically, power aware decisions about data transfer costs for the entire application workflow.
While power aware computing research and tools development in HPC have been going on for at least the past decade, very few research efforts and tools have targeted scheduling and resources management tools for power aware data transfer. This research focuses on developing data motion power cost models and integrating these models into a task scheduler framework to enable complete power aware task scheduling of an entire HPC workflow.
In order to achieve this mission, we have taken an incremental approach to developing a hierarchical, system wide power model for data motion. This hierarchical characterization of data motion is based on the distance data has to be transferred, which is analogous to the hierarchical descriptions of memory architectures where memory components are categorized based on their distance from the cpu (i.e. average cycles to access data). Our hierarchy begins with core level data motion and will extend up through the hierarchy to eventually encompass data motion across facilities. Figure 1 illustrates our characterization of the data motion hierarchy. This hierarchy allows us to focus on one layer of Figure 1 : Characterization of data motion hierarchy the hierarchy at a time to incrementally develop a data motion cost model that will ultimately incorporate every level.
The current stages of this research has addressed core and node level data motion. At core level, we assume data is being accessed by multiple cores on the same node that share three levels of cache and DRAM, and that critical data motion power consumption is a result of cache and DRAM memory accesses. To develop power models that represent cache and DRAM accesses, we began by measuring the power consumed by these memory units during the execution of multi-threaded parallel applications in the PAR-SEC Benchmark Suite [4] . We used the Sniper multi-core, micro-architectural simulator [5] to isolate, measure and collect power consumption data for memory accesses. With this data, we derived linear functions that represent average data motion power consumption, as well as bounding functions to represent error and fluctuations in the model.
Because every level of the data motion hierarchy is partly an aggregation of the levels below it, we devise node level models as an accumulation of core level data motion power consumption with power consumed by network cards and switching devices that connect multiple nodes. The accrued data motion power consumed by all cores on a node is categorized as intra-node data motion power, and power consumed by data that is transferred across multiple nodes is called inter-node data motion power. For inter-node data motion, we assume critical power consumption is a result of data transferred through Network Interface Cards (NICs) and switch devices. The models for representing inter-node data motion are derived from electrical characteristics of switches and NICs, their specification data, and related simulation research.
In this paper, we discuss details of the core and node level data motion power consumption models. In Sections 2 and 3, we describe our approach to deriving core and node level data motion power models, respectively, and present our models. Section 4 gives a brief discussion of related work, and we conclude this paper with a review of our current research and future directions in Section 5.
MULTICORE DATA MOTION POWER MODEL
In many HPC environments, data motion is initiated by a processor core. At this level, data travels through the memory hierarchy from L1 cache to main memory, inducing costs [6] in execution time as well as in power consumption. The execution time costs and relative latency of transferring data through the memory hierarchy is well understood, where as there is much less understanding of the power consumption costs for moving data through the memory hierarchy. In this section we describe our approach to understanding power consumption costs of data accesses by many cores, and present prototype cost models that can be used to estimate data motion power for a given multi-core architecture.
Approach to Multicore Data Motion Power Model Formulation
We began our study of core level data motion by measuring power consumption for cache and DRAM data accesses on a multi-core cpu. We accomplished this using the Sniper multi-core, x86 micro-architectural simulator. Sniper extends the Graphite [11] simulator, and implements an interval core model for simulating multi-core and multiprocessor systems. Sniper also incorporates the Multicore Power, Area, and Timing (McPAT) framework [8] for power and area modeling. Figure 2 provides an illustration of the integrated Sniper, McPat framework. We use this simulation framework to measure power consumed in watts (W ) for L1 -L3 cache, and DRAM memory accesses. To understand the implications of memory protocol decisions on power consumption, we setup experiments for various cache and prefetching protocols across a range of core counts on a chip.
The parallel applications we chose for our power data collection experiments are a part of the Princeton Application Repository for Shared-Memory Computers (Parsec) Benchmark Suite [4] . Parsec is a suite of multithreaded applications that are representative of a variety of shared memory workloads and data access patterns. We selected three applications from Parsec to simulate. The first application, called bodytrack, tracks 3D poses from multiple cameras by dividing the body into multiple segments and extracting edge and foreground information. The second applications, called canneal, selects routing paths for connecting components on a cpu chip to minimize delay. The last application, dedup, is a deduplication compression algorithm for large scale data streams. Each of these applications have distinct data access patterns that are also commonly occurring in parallel applications which allows us to further characterize our core level power models by the data access pattern of the application.
Once the data was collected, we performed regression analyses to derive functional representations for our power cost models. With regression analysis, we infer some initial properties of the function based on observations of the data. We then iteratively refine the function until the function provides a reasonable estimate of the original data points. Because power measurements may not be exact for any two experiments with identical input parameters, we want a function that is an approximation of the data set under analysis and not an exact representation. This approximate function is thus a practical representation of power consumed for core level data motion under similar conditions.
In our initial inspections of the core level power data, we observed linear trends in the power consumption measurements for cache and RAM accesses as core counts increased for various memory protocols. This led us to perform linear regression analysis where we derived a linear model of core level data motion for a given architecture and memory protocol. Because the model is an approximation, we also derive upper and lower bounds to serve as comparative margins of error. Details of the model are provided in the next section.
Core Level Experimentation and Model Development
As was mentioned in the previous section, we performed multicore experiments using the Sniper simulator to collect measurements of the power for L1 -L3 cache and DRAM accesses. We simulated the intel ivybridge microarchitecture with 32KB L1 cache, 256KB L2 cache, and 20MB L3 cache per core, a total DRAM bandwidth of 59.7 GB/s, and a maximum memory latency of 45ns. The experiments on the Parsec benchmarks were performed for a range of 8 to 48 cores per node, and varied memory coherence and prefetching protocols of MESI with or without prefetching and MESI or MSI without prefetching. The benchmark suite defines three input sets for each benchmark that are suitable for implementation in a simulator environment (simsmall, simmedium, and simlarge); of the three, we chose simlarge. Figures 3, 4 and 5 show the power consumption of cache and DRAM accesses plotted for each of the benchmarks for MESI memory coherence and prefetching enabled for 8 to 48 cores.
As can be seen in the plots, each of the plots exhibit linear trends in the data. These linear trends indicate that a linear model could provide a suitable approximation of power consumption for multicore processors in a similar environment. In our linear regression analysis, we determine three parameters to be critical to the construction of our data motion power consumption model. They are the number of cores, n, steady state power consumption (watts), P0, and memory power consumed per core (watts/core), Pc. Our analysis resulted in the formation of the general model in equation 1, and for each of the memory configurations, Tables 1, 2 and  3 give values for steady state power and memory power per core for L1 -L3 cache and DRAM.
Further observations of the power data revealed consis- Table 2 : Parameters for core cache and DRAM power model for MESI coherence and no prefetching tent oscillations in the plotted data that could be used to facilitate a comparative definition of error in the form of linear upper and lower bound functions. The upper and lower bound functions have the same form and set of parameters as the core data motion power consumption model. To construct the bounding functions, we plot both the measured power consumption and the core power model for each of the memory configurations. For each plot, we determine the points from our power measurements that fall above (greater than) and below (less than) the approximation from the model. Using the points above the model, we select one to three points that are farthest from the model and derive a linear function for the upper bound, such that the function is greater than or equal to all measured power consumption data. For the lower bound, we similarly select one to three points farthest below the model and derive a linear function that is less than or equal to all measured power consumption data. Tables 4, 5 and 6 list the parameter values for the upper and lower bound functions for the memory configurations in our experiments. Figure 6 also provides an illustration of the upper and lower bound functions for two memory configurations.
NODE LEVEL DATA MOTION POWER MODEL
At the node level, we consider data motion that occurs on the node and data motion between nodes. We characterize this as intra-node data motion, for data access on a node, and inter-node data motion, for data transfers that occur between nodes. This means we continue to consider core data motion, as well as, additional components relevant to inter-node data motion. We provide details of both intranode and inter-node data motion in the next sections.
Intra-Node Data Motion Power Cost Model
A node is partly a higher abstraction of the cores it contains. The compute and memory architecture of the cores on the node is also the compute and memory architecture of the node. This means a cumulative model of core data motion is representative of data motion within a node (i.e. intra-node data motion). To this extent, we use the model described in Section 2 to define an intra-node data motion power model. Very simply put intra-node data motion power consumption is equivalent to the power consumed by cache and DRAM. This is shown in equation 2.
It is important to note that when selecting parameter values for this model, not only is similarity of architecture important, but application similarity is also important. Our core model is most appropriately used to approximate data motion power for applications with similar data access patterns and input data sizes as the applications selected in our experiments. Therefore, our core model parameters form a database that is expanded as we perform new experiments.
Inter-Node Data Motion Power Cost Model
In addition to nodes being a higher abstraction of their cores, nodes are also connected to create a network. Within a network of nodes, devices are used to manage the transfer of data between the nodes. The two primary devices that enable this data transfer are Network Interface Cards (NICs) and switches. As the principal devices for data transfer be- Table 6 : Parameters for upper and lower bound of power model for MSI coherence and no prefetching 
We were enabled to define a more scalable approach to discovering PNIC and P switch , as well as utilize previous research on power consumed by network interface cards. The research in [13] provides a generic linear formulation of PNIC that is shown in equation 4, where P idle is the power consumption of a NIC in an idle state, Pmax is the maximum power consumption of the device, and pps is the packets per second value of the device.
For our purposes, we replace pps with l which is a ratio of the average data transfer rate to the maximum transfer rate of the NIC. We also define an easily extensible method for determining P idle and Pmax and define a linear formulation of P switch .
Defining Inter-Node Parameters
The basis of our approach to defining the idle and max power values lie in understanding the characteristics of the NIC and switch devices, and utilizing the data provided in the device specifications.
Switch Power
To demonstrate our methodology we begin with an example of a switch specification shown in Figure 7 [10] . The critical information for this spec is circled. As you can see, Pmax is given as max watts (W ), but the remainder of the information is essential to determining P idle . The first of this essential information is VAC. VAC is known as apparent power, and is commonly referred to as volt-ampere (VA) for ac-circuit devices. Apparent power is the potential power consumption of the device and can be viewed as input power. Whereas, W is the true power or power that is actually consumed by the device. The ratio of true power to apparent power gives us what is known as the power factor (pf ) (shown in equation 5).
Since the power factor is a consequence of the circuit design, once determined, it can be used to compute the true power, W, of a device for a given VAC. In the example specification, we are given a range for VAC. The lower number in the range can be interpreted as the minimum operating input power, and the higher number the maximum input power. Since we are provided the maximum VAC and maximum W, we compute pf with these values. Once we have pf, we take the minimum VAC and compute P idle (see equation 6) .
As indicated in the example specification, the true power consumption is also dependent on the type of cable that is used. An active interconnect cable has an embedded chip that enhances data transmission, and a passive cable does not have this chip. So, the selection of your maximum W used to compute pf will be based on the type of interconnect cable used in your switching device.
The last piece of essential information provided in a switch specification is the percent usage of the power supply (PS) unit fan. This percentage can be interpreted as power consumed as a result of fan operation. In our model, we compute this a a percentage of the maximum W represented as P f an . So, we provide equation 7 as a model of P switch .
NIC Power
Network interface cards are powered through the PCI express bus (PCIe), and as a result, it has power characteristics that are partially determined by PCIe power management protocols. When examining NIC specifications, we are only provided with one essential piece of information, and that is the true maximum power consumed by the device. The remainder of the information is determined by examining the PCIe specification [14] .
In the PCIe specification, the maximum input power is determined by the size and functionality of the device it powers. According to the specification, the maximum input power of a NIC is 25VA. With this information, we compute pf which is used to compute P idle for the NIC.
Just as the PCIe protocol determines the maximum input power, the protocol also defines power management states for the various operating conditions of devices it powers. In order to determine P idle of a NIC, we first determined the power attributes of the relevant idle state. For the purposes of our model, the relevant state is the logical idle state. In the idle state, the NIC is awaiting active data in the form of a packet or special ordered set that needs to be transmitted or received, and is transmitting and receiving idle data until such an event occurs. Idle data is the encoded data byte 0. The input power requirement of this state is a maximum idle differential voltage of 100mV. There is not an explicit requirement for idle current, but the maximum transmission current for any state is 90mA. We use this information along with the power factor to compute a worst case P idle for a NIC. This calculation is shown in equation 8, where N is the number of transmit and receive lanes.
RELATED WORK
There has been numerous research efforts for power-aware scheduling. However, most of these efforts have primarily focused on computing rather than data motion.
In [3] , [7] , [15] and [16] , researchers developed techniques that exploit dynamic voltage and frequency scaling (dvfs) capabilities for scheduling tasks in distributed systems. [3] reduces cpu energy consumption using intertask voltage scheduling. This scheduling technique includes three phases: an offline computation of optimal speed given a worst-case workload, online speed reduction based on the actual workload, and online adaptive and speculative speed adjustment based on average-case workload information. In [7] , space-shared and time-shared policies are used to schedule bags-of-tasks with deadlines. The scheduling algorithm sets the supply voltage for a processing element and tasks are only scheduled if they can finish within the deadline. In [15] , a framework is developed that integrates scheduling and task voltage assignments to minimize the energy consumption of the system. In [16] , virtual machines are allocated in clusters with minimal processor frequencies set.
Some other power-aware scheduling frameworks that do not directly rely on dvfs are [17] and [9] . [17] proposes to save energy by reducing the number of speed changes which in turn reduces overhead. This framework also schedules tasks to reduce shared memory contention which optimizes performance and potentially optimizes energy consumption, and implements scheduling decisions to reduce idle time (i.e. unused/wasted energy). In [9] , a constraint graph is formulated that initially encompasses tasks, timing constraints, schedules and slack properties of the schedule. Power profiles are applied to schedules based on the constraint graph that represents the power consumption of a task execution. Then, the scheduler makes decisions to maintain minimum power consumption to retain some desired level of activity, and also, ensure the total power consumption of all running tasks does not exceed a maximum power constraint.
All of these solutions are based on power models that are relative to the compute time and performance of a workload. The goal of this work is to develop a set of power models for data motion so that schedulers will be capable of making accurate power-aware decisions relative to data motion.
CONTINUING RESEARCH AND DEVELOPMENT
In order to develop a power-aware scheduling and resource management framework for data intensive applications, we must be able to adequately determine the power costs for moving data across systems. To create this knowledge base, we propose to develop data motion power models for each level of the data motion hierarchy in Figure 1 . To date, this research has addressed the core and node level power models for data motion. In this paper, we presented models for cache and DRAM accesses on multicore nodes, as well as, models for approximating power consumed on NIC and switch devices for data transferred between nodes. We will continue our research efforts, and leverage our experience and methods used developing these models to construct power models for the remaining levels in the hierarchy.
ACKNOWLEDGMENTS
This work was supported by the United States Department of Defense and used resources of the Extreme Scale Systems Center at Oak Ridge National Laboratory. This manuscript has been authored by UT-Battelle, LLC under Contract No.DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paidup, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
