Energy efficiency and energy proportional computing have become major constraints in the design of modern exascale platforms. Dynamic voltage and frequency scaling (DVFS) is one of the most commonly used and effective techniques to dynamically reduce power consumption based on workload characteristics. Many energy-saving strategies, however, employ DVFS without considering its applicability level in a given multicore processor. The present work demonstrates that disregarding the DVFS granularity as a design constraint while developing energy-saving strategies may reduce the efficacy of the strategy for certain application classes. Specifically, DVFS applicability levels (called granularities) are determined here for three different Intel processor microarchitectures. Then, the degradation in energy savings is evaluated if the DVFS granularity is disregarded. Experiments were conducted on the widely used quantum chemistry packages, General Atomic and Molecular Electronic Structure System (GAMESS) and NWChem, and show that, for granularity unaware energy-saving techniques, the overall energy consumption may increase by as much as 19%.
Introduction
Power consumption has become a major concern in the design of upcoming exascale systems. For the current topmost petascale computing platforms 1 in the world, it is typical to consume power on the order of several megawatts, which at current prices may cost on the order of several million dollars annually. For operational sustenance of the exascale machines, the power consumption growth rate must slow down and deliver more calculations per unit of power. To address this challenge, power and energy optimizations have been proposed in modern computing platforms at all levels: application, system software, and hardware.
It is well established that the CPU and the memory subsystem are the major power consumers in a computer system. For example, the CPU consumes about 50% of the total power as was investigated in Ge et al. (2010) , considering both static and dynamic power consumption. Memory power consumption is a significant component in computer server power profile, which is comparable to or may even surpass processor power consumption for memoryintensive (MI) workloads.
The current generation of Intel processors provides different performance states (P-states) for dynamic voltage and frequency scaling (DVFS). More specifically, the Intel "Haswell" microarchitecture provides a total of 15 P-states. The delay of switching from one state to another depends on the relative ordering of the current and desired states, as discussed, for example, in Park et al. (2010) . The user may write a value to model-specific registers (MSRs) to change the P-state of the processor.
DVFS granularity refers to the grouping of cores in a processor socket with respect to independent frequency and voltage scaling domains. Most of the past energy-saving strategies (Ge et al., 2007; Rountree et al., 2009) , except for one of the most recent strategies (Bhalachandra et al., 2017) , do not consider DVFS granularity as a design parameter while devising the energy-saving strategy. Chemistry applications based on "global arrays" (such as NWChem and GAMESS) tend to segregate the available threads (processes) into two types: the compute processes and data servers. Therefore, the design considerations of DVFS granularity become quite important in energy-saving strategies for these kinds of applications, or in general, for a class of applications using distributed shared memory. This point can be substantiated by examining at microoperations executed per unit time on cores 0 and 4 of a hex-core Haswell-EP system ( Figure 1 ) named Marquez, executing a 31-atom system using the RIMP2 algorithm in NWChem (Valiev et al., 2010) .
The micro-operations metric is being shown here as it is one of the most prominent metrics, used to quantify the variation in performance with respect to frequency scaling. Moreover, the micro-operations retired for cores 0 and 4 are shown because they are on the same socket. The huge variation in the values of micro-operations retired on the two cores emphasizes the need to consider DVFS granularity while designing energy-saving strategies.
This work studies the possible effects of DVFS granularity on the effective energy savings delivered by an energysaving strategy. More specifically, three Intel processors were used along with the quantum chemistry application GAMESS (Schmidt et al., 1993) to demonstrate that obliviousness of DVFS granularities can significantly degrade the potential of an energy-saving strategy. In this work, the DVFS granularity of three different platforms is determined using the Embarrassingly Parallel (EP) benchmark from the NAS suite (Bailey et al., 1991) . The EP benchmark is chosen because it is very sensitive to frequency scaling, that is, it has close to nil off chip time and the execution time scales linearly with changes in processor frequency. Therefore, any change in the effective frequency is reflected immediately in the execution time and node power consumption of EP benchmark. The rest of the article is organized as follows. Section 2 provides an overview of related work with an emphasis on a few energy-saving techniques and their salient features. Section 3 describes the method of determining the DVFS granularity of the different Intel processors used in this work. Section 4 depicts the effect of varying DVFS granularities on an energy-saving strategy using GAMESS as an application benchmark. Section 5 provides conclusions and possible future directions.
Related work
There have been two general approaches for obtaining energy savings during parallel application execution. The first approach focuses on bottleneck determination during application execution through architectural parameters from performance counters as proposed in Ge et al. (2007) , Hsu and Feng (2005) , and Huang and Feng (2009) . Rountree et al. (2009) , apart from using performance counters, performed a critical path analysis to determine which tasks may be slowed down to minimize the performance loss in the parallel execution. Besides communications, Adagio also monitors the computation parts of the application to determine suitable opportunities for the application of DVFS.
The second approach determines the communication phases to apply DVFS as, for example, in Lim et al. (2006) and Freeh and Lowenthal (2005) . Etinski et al. (2009) propose a technique that applies both DVFS and over-clocking to the CPUs to save energy and to improve execution time. In Kandalla et al. (2010) , algorithms to save energy in the collectives, such as MPI_Alltoall and MPI_Bcast, are proposed. The work in Ioannou et al. (2011) describes a runtime system for the Intel singlechip cloud computer (SCC) processor. This system detects repeatable communication phases followed by an application of frequency scaling. To mitigate the effects of voltage and frequency domains on SCC, the system proposes different frequency scaling determination mechanisms such as allmean (Section 4) which have been shown to result in increased energy consumption.
A common shortcoming of all these approaches is that they do not consider the granularity of DVFS available on the underlying processor as a design parameter while devising the energy-saving strategy.
In Ge et al. (2010) , the authors have developed a tool, PowerPack, which estimates power consumption characteristics of a parallel application in terms of various CPU components. Employing a combination of tuned micro-benchmarks and incremental step methodology, Kestor et al. (2013) estimate energy cost of moving data between any two levels of the memory hierarchy. The authors present an analysis in Czechowski et al. (2014) in which it is shown that increased complexity of the core microarchitecture in Intel processors has improved both performance and energy efficiency to execute an application. A roofline model for energy consumption is proposed in the study by Choi et al. (2013) which does not focus on making exact predictions but attempts to explain the relationship between time, energy, and power cost of an algorithm. history-window and trace-based. The former considers a window of past k granularity-unit values (e.g. phases) and predicts the next value as a function of the preceding k ones.
For k ¼ 1, this predictor is termed the last value. The trace-based predictors record traces of past execution in a data structure to predict the future behavior (Isci and Martonosi 2003) . CPU Miser (Ge et al., 2007) assigns an a priori value to the performance loss that will be tolerated by, for example, letting a user provide an input value. CPU Miser divides the execution of an application into intervals of a particular duration (typically 250 ms) and predicts the execution characteristics, such as memory stalls, of the upcoming interval based on similar recent intervals like the history-window predictor. The CPU Miser primarily considers memory accesses, even though it may use the I/O and idle times (provided by the /proc/stat file in Linux), to choose a suitable frequency for a given time slice. The CPU Miser has been shown to attain significant energy gains (Ge et al., 2007) . The CPU Miser strategy attempts to determine the precise number of stalls in an out-of-order execution engine, which is quite involved as compared to other strategies, such as those estimating the change in the application performance based on micro-operations retired. Although CPU Miser has been shown to attain significant energy gains, it does not consider contention among the cores while modeling the DRAM access delay. Therefore, it either overestimates or underestimates the number of stalls in an application, which leads to an inaccurate estimation of the frequency. The CPU Miser does not consider the instantaneous power consumption of the UUT when choosing a suitable frequency.
Adagio ( Rountree et al., 2009) , apart from using performance counters, performs critical path analysis to determine which tasks may be slowed down to minimize the performance loss in the parallel execution. This analysis appears beneficial when applications have computation or communication imbalances among participating processes, which is typically not the case for a highly efficient parallel application and was not observed in the NAS benchmarks, for example. The workload prediction mechanism used in Adagio is similar to the last-value predictor but with feedback added, such that the future behavior of a communication call is predicted based on its last invocation, and the resulting error is used as feedback for future predictions. In addition to communications, Adagio monitors the computational profile of an application to determine suitable opportunities to apply DVFS. It does not provide the user with an option to set a desired performance loss constraint.
A strategy by Lim et al. (2006) proposes a runtime system that dynamically applies DVFS during communication phases in Message Passing Interface (MPI) applications. It dynamically determines repetitive communication phases and selects a suitable CPU frequency to minimize energy consumption. This strategy does not apply DVFS in computation portions of an application and, therefore, may not save a significant amount of energy for an application that has a relatively low communication activity while being MI, that is, featuring many memory accesses during which the CPU is lightly loaded or idle. Its workload prediction mechanism is the last value, which is applied in a manner similar to its application in Adagio barring the feedback.
Earlier work by the authors (Sundriyal et al., 2014) proposed a novel DVFS application strategy, termed PhIT (for "DVFS application in Phase and Interphase with Throttling"). PhIT maximizes energy savings by selecting appropriate values for DVFS and throttling based on the predicted communication phases considering both the CPU offload (provided by the network protocol, such as Infiniband) during communication and the architectural stalls during computation. PhIT relies on an a priori set (e.g. defined by the user) of performance loss constraints and chooses the frequency level for the next phase based on the current phase (as in the last-value predictor type).
In their previous work (Sundriyal and Sosonkina 2013) , the authors identify potential pitfalls of the performance loss-based approaches and propose a strategy that depends on the instantaneous power consumption of the computing platform. The workload prediction is still of the last-value type. The strategy, termed here InstPA, is augmented with regression analysis to estimate the instructions retired at the available frequencies and, eventually, to choose a suitable frequency to minimize the energy consumption.
In a recent work by the authors (Sundriyal and Sosonkina, 2016) , a runtime system termed ProcMem is proposed which addresses both processor and memory frequency scaling based on a detailed performance model that minimizes the energy consumption of an application. The runtime system operates on time slices and uses a history-window approach to predict the future behavior of the application. Table 1 summarizes the discussed energy-saving strategies (shown in the column labeled Name) according to the seven broad features, denoted as P-L based, P-C aware, Profiling, W-P type, Topology, DVFS granularity, and DRAM, respectively. In the column labeled W-P type, HW stands for "history window", LV for "last value", and LVF for "last value with feedback".
Test bed for DVFS granularity
Three different test beds are used in this work, each of them having a particular DVFS granularity:
1. Bolt 2 comprises 18 Infiniband quad data rate (QDR)-connected compute nodes, each with 32 GB of main memory and an Intel Xeon CPU E5-1650 "Sandy Bridge" six-core processor (single socket). The Intel Xeon CPU E5-1650 provides 15 P-states ranging from 1.2 to 3.2 GHz. 2. Dynamo 3 comprises 35 Infiniband dual data rate (DDR)-connected compute nodes, each of which has 16 GB of main memory and two Intel Xeon E5450 quad-core processors arranged as two sockets with four P-states ranging from 2 GHz to 3 GHz in steps of 0.33 GHz and eight levels of throttling from T 0 to T 7 . 3. Marquez 4 comprises a single node with two Intel Xeon CPU E5-2630 v3 "Haswell-EP" eight-core processors (two sockets) with 32 GB of main memory. The Intel Xeon E5-2630 v3 processor provides 13 P-states ranging from 1.2 to 2.4 GHz.
For measuring the node power and energy consumption, a Wattsup 5 power meter is used with a sampling rate of 1 Hz.
Wattsup accuracy
To verify the accuracy of Wattsup meter measurmentsbecause of the Wattsup low resolution-five NAS benchmarks (Bailey et al., 1991) were picked and compared regarding their energy consumption obtained from Wattsup and from the running average power limit (RAPL) energy metering. Specifically, the energy consumption for both the sockets running an application on Marquez was obtained using RAPL at a resolution of 200 ms. Then, the total energy consumption E a for the application was calculated as
where E m i , E p i , and E s i are the RAPL measurements at a time slice iði ¼ 1; . . . ; N Þ specifying the energy consumptions, respectively, for the energy status meter, for only CPU and RAM when idling, and for the rest of the system when idling. Table 2 depicts the percentage of difference in the energy consumption as obtained from equation (1) with respect to the Wattsup measurements. It can be observed that the average difference across all the five benchmarks is 0.24%, which is negligible if system variations are considered. Therefore, the low resolution of the Wattsup meter does not hamper the accuracy of the measurements. Bolt: Figure 2 shows the change in execution time and power consumption of the EP benchmark executed on a single core when its frequency is scaled down from 3.2 GHz to 1.2 GHz and the rest of the cores were operated at 3.2 GHz. It can be observed from Figure 2 that the execution time of the EP benchmark increases linearly with the frequency decrease, while the node power consumption decreases by 15%. Hence, when only a single task is active, per core frequency scaling is supported.
Next, all of the cores are kept active, with each executing the EP benchmark while reducing the frequency of each core one by one with a certain interval between reductions. Figure 3 shows the change in node power consumption when the frequency of each core starting from core 0 is switched from the highest (3.2 GHz) to the lowest (1.2 GHz) after every 5 s. The power consumption stays nearly the same till 25 s, demonstrating that the effective frequency stays at 3.2 GHz even though the frequency of cores 0-4 have been switched to 1.2 GHz. Once the frequency of core 5 is switched to the lowest, the power consumption dips by 36%, and only at this point only the effective frequency becomes 1.2 GHz. Thus, it can be concluded that the granularity of DVFS on Bolt is "per socket." Dynamo: The DVFS granularity of the Dynamo CPU was determined in Sundriyal et al. (2012) as operable only in domains comprising pairs of cores that share an L2 cache. Denote such a DVFS domain as "twin-core."
Marquez: The Intel Xeon CPU E5-2630 v3 ("Haswell-EP") processor on Marquez has multiple fully integrated voltage regulators providing an individual voltage for each core which results in per core P-states (PCPS) (Hackenberg et al., 2015) . This enables independent frequency scaling "per core" rather than "per-socket" or "twin-core" as seen in Bolt and Dynamo, respectively.
The same experimental steps followed as in Figures 3  and 4 show the change in node power consumption when the frequency of each core starting from core 0 is switched from highest (2.4 GHz) to lowest (1.2 GHz) after every 5 s. It can be observed that the power consumption of the node decreases after each 5 s giving a staircase-like power curve confirming the presence of PCPS on Marquez.
Effective frequency: The effective frequency f eff can be defined as the frequency experienced by a multicore node, when each core i ði ¼ 0; . . . ; n À 1Þ is in a certain P-state f i .
It can be noted here that the change in frequency by modifying the IA32_PERF_CTL MSR is basically a request to the hardware to switch to that particular frequency which depends on various factors. On a platform having n cores, where n is even, and supporting a specific level of the DVFS application, the effective frequency f eff may be expressed in one of the following ways: 1. For the socket level (as in Bolt)
2. For the twin-core level (as in Dynamo)
For the PCPS level (as in Marquez)
f eff ¼ f ið4Þ
Experimental evaluation
In this section, the topology and DVFS capabilities of the Bolt and Dynamo platforms are emulated on Marquez to determine the loss in energy savings due to the absence of PCPS. For the experimental evaluation, two kinds of scenarios are considered. In the first scenario, the threads executing on a platform originate from different applications and can have variable workload behavior, that is, a socket has a mix of processor-intensive (PI) and MI threads.
Since the latency bound computations are considered memory/input-output (IO)/stalls bound, the energy-saving strategy and the consideration for DVFS granularity within a processor socket will not be affected. In the other scenario, the mix of PI and MI threads originates from the same application. These two scenarios are relevant to the workloads used in data centers and high-performance computing, respectively.
The performance loss for an application was set to 10%, so that an appropriate processor frequency would keep the performance degradation below 10% and maximize the energy savings under that envelope. The appropriate frequency for an application executing on each core was determined statically based on an off-line analysis. The effective frequency scaling determination mechanism for the socket (Bolt) and the twin cores (Dynamo) level was determined in three ways. In the emulated case, each core on Marquez determines an appropriate frequency as per the off-line analysis, but the effective frequency of the core(s) is determined by the DVFS granularity of the platform under test, that is, Bolt (emu-socket) and Dynamo (emu-twin). Next, the allmean case takes the average of all the individual core frequencies in the same time slice and executes the whole socket at that value. Finally, the native case executes the application natively based on the PCPS capability of the Marquez platform.
Case of threads from multiple applications
The workload for this scenario is constructed using the SPEC CPU TM 2006 benchmarks. 6 More specifically, two single-threaded benchmarks, namely hmmer (Eddy, 1995) and mcf (Lobel, 2000) are used as they are processor and MI (Sundriyal and Sosonkina, 2016) , respectively.
Bolt: Figure 5 depicts the difference in energy consumption for the two effective frequency scaling determination mechanisms when the socket-level DVFS characteristic of Bolt is emulated. A negative value points to an overall increase in energy consumption, compared to the scenario where all the cores are executing at the maximum frequency of the processor (denoted as the maxfreq mechanism). Using the two benchmarks (hmmer and mcf), different combinations of workloads are created in which a configuration is denoted as W(x, y), where x and y denote the number of processor and MI threads, respectively. Therefore, in the case of emulating Bolt, x ¼ 6 À y, and for Dynamo, x ¼ 8 À y, which follows from the number of cores in a socket on each of these platforms.
For all of the workloads shown in Figure 5 , the emusocket mechanism results in nil energy savings. This is due to the fact that Bolt supports only socket-level DVFS and the whole socket executes at the maximum of the frequencies that all of the cores are executing on. Therefore, even though the strategy chooses 2.4 and 2.2 GHz, respectively, for the processor and MI workload throughout the execution, the socket always executes at 2.4 GHz, and therefore, no energy savings are obtained. As long as there is one PI thread in the workload, no energy savings are obtained.
For the native mechanism, the highest energy savings are obtained for all the workloads. More specifically, the maximum and minimum energy savings of 2.3% and 0.5%, respectively, are achieved using the native mechanism. Since each core is able to scale its frequency and voltage independently of the other cores, the PI threads execute at the maximum platform frequency (2.4 GHz), whereas the MI threads execute at frequencies ranging from 2.2 GHz to 1.4 GHz on workloads W(5,1) and W (1, 5) , respectively. The chosen frequency for the MI threads keeps getting reduced as their number increases in the workload leading to slower memory accesses, due to contention.
Lastly, the allmean mechanism performs the worst, and not only negates energy savings but also produces energy losses for all the workloads by an average of 5.3%. This is primarily due to the linear increase in execution time of the PI thread which is not compensated by the corresponding decrease in power consumption.
Dynamo: process mapping: Unlike Bolt, the different mappings of the processor and MI processes matter in the case of the emu-twin mechanism for Dynamo, because it supports twin-core DVFS granularity. Since a node of Dynamo has four sets of twin cores, a particular mapping of a given number of processor and MI thread decides the energy-saving potential of a strategy. Therefore, the mechanism emu-twin in the case of Dynamo only is further divided into the cases: (1) mapping the processor and MI threads onto the processor cores in a continuous (linear) Figure 5 . Energy difference for the two effective frequency scaling determination mechanisms namely native and allmean when socket-level DVFS is emulated (Bolt) and threads from different applications execute on a node, compared to the maxfreq mechanism. The emu-socket mechanism is not shown here since it did not change the energy consumption.
manner so that they stick together (emu-twin-c) and (2) mapping the processor and MI threads are scattered on twin cores ((emu-twin-s-1) and (emu-twin-s-2)). Figures 6(a) and 7(a) depict the emu-twin-c mapping for the W(4, 4) and W(6, 2) workloads, respectively. Similarly, emu-twin-s-1 is shown in Figures 6(b) and 7(b) and emutwin-s-2 in Figure 7(c) . Figure 8 shows the energy difference for the three effective frequency scaling determination mechanisms when the socket-level DVFS characteristic of Dynamo is emulated, compared to the maxfreq mechanism. For the W(7,1) workload mix, only one mapping (emu-twin-c) is possible. For the emulated mechanism, since in any permutation three twin cores will have six PI threads and one twin core will be hosting a processor and MI thread. The chosen frequency for the single MI thread is 2.0 GHz but because of the twincore DVFS granularity, the whole socket executes at 2.4 GHz for the continuous case and no energy savings are obtained, similar to allmean. Only the native mechanism saves 1.4% energy for the W(7,1) configuration.
In the case of the W(6,2) configuration, two different mappings of the processes are possible in which the emutwin-c mapping enables the twin cores frequency scaling of the MI processes (Figure 6(a) ) and emu-twin-s-1 spreads them out (Figure 6(b) ). Therefore, emu-twin-c and native end up saving 1.43% energy, whereas allmean and emutwin-s-1 execute the whole socket at 2.4 GHz and do not save any energy.
Energy savings of 1.52% are obtained for the W(5,3) configuration operated under the emu-twin-c mechanism, whereas no energy savings are obtained for the emu-twin-s-1 case owing to the placement of MI processes on the twin cores. The allmean mechanism chooses 2.0 GHz for the whole socket resulting in significant performance degradation for the PI thread and resulting in an energy loss of 6%.
The W(4,4) configuration has three different mappings as shown in Figure 7 , where the emu-twin-c, emu-twin-s-1, and emu-twin-s-2 mappings place 4, 2, and 0 MI processes on the twin cores, respectively. Accordingly, the mechanisms emu-twin-c (Figure 7(a) ), emu-twin-s-1 (Figure 7(b) ), and emu-twin-s-2 ( (Figure 7(c) ) end up saving 1.61%, 1.18%, and 0% energy, respectively. The allmean mechanism again results in an overall energy loss of 3.6%.
The next three configurations namely W(3,5), W(2,6), and W(1,7) are symmetrical to W(5,3), W (6, 2) , and W(7,1), respectively, with the difference being that the number of processor and MI threads are interchanged with the number of possible mappings remaining the same. As the number of MI threads keeps increasing in the workload configuration, the energy savings obtained by the native mechanism increase as well. The allmean mechanism ends up saving 2% energy for the W(2,6) and W(1,7) configurations since the proportion of PI threads is very low and the execution time is dominated by MI threads.
Case of threads from single application
For obtaining a workload with variable processor and MI behavior, the widely used quantum chemistry package GAMESS (Gordon and Schmidt, 2005; Schmidt et al., 1993) was used, which is capable of performing molecular structure and property calculations by a rich variety of ab initio methods. A wide range of quantum chemistry computations may be accomplished using GAMESS, ranging from basic Hartree-Fock (HF) and density functional theory to high-accuracy multi-reference and coupled-cluster methods.
The initial parallel model in GAMESS was based on replicated-data message passing and later moved to MPI-1. Fletcher et al. (2000) developed the Distributed Data Interface (DDI) in 1999, which has been the parallel communication interface for GAMESS ever since. Later, DDI was adapted to symmetric multiprocessor environments featuring shared memory communications within a node (Olson et al., 2003) and was generalized as GDDI in Fedorov et al. (2004) to form groups out of the available nodes and schedule tasks to these groups. GDDI ensures that all the processes independently access and modify any element in a logically global but possibly physically distributed data array. GDDI implements the Partitioned Global Address Spaces (PGAS) programming model in which an additional process, called data server, is created for each compute process of GAMESS. While the compute process performs electronic structure calculations, the data server services requests for the data associated with the distributed arrays.
Depending on the installation options, communications between the compute and data server processes occur either via Transmission Control Protocol (TCP)/IP or MPI. A data server responds to data requests initiated by the corresponding compute process, for which it constantly waits. If this waiting is implemented with MPI, then the CPU is polled continuously for the incoming message, thereby being always busy. Therefore, in multicore platforms, it is preferred that a compute process and data server do not share a core to avoid significant performance degradation. Now, assume a multicore platform has n cores (n is even). Then, each core c i ði ¼ 0; . . . ; n 2 À 1Þ with a compute process has a corresponding core d j ðj ¼ n 2 ; . . . ; n À 1Þ with a data server, such that j ¼ i þ n 2 . The central task of quantum chemistry is to find an (approximate) solution of the Schrödinger equation for a given molecular system. An approximate (uncorrelated) solution is initially found using the HF method via an iterative self-consistent field (SCF) approach and then is improved using various electron-correlated methods, such as second-order Møller-Plesset perturbation theory (MP2). The SCF-HF and MP2 methods are implemented in two forms, namely direct and conventional, which differ in the handling of (typically, a very large number of) electron repulsion integrals (denoted as ERIs, also known as twoelectron integrals). Specifically, in the conventional mode, all ERIs are calculated once at the beginning of the interactions and stored on the disk for reuse, whereas, in the direct mode, ERIs are recalculated for each iteration as necessary. The SCF-HF iterations and the subsequent MP2 correction find the energy of the molecular system that is followed by evaluation of energy gradients.
To reduce the computational complexity of quantum chemical calculations, a fragmentation approach is used, in which a large molecular system is divided into fragments and a quantum chemical method is applied to each of the fragments, after which the fragment interactions are taken into account. In the fragment molecular rbital (FMO) method (Gordon et al., 2012) , the long-range (electrostatic) interactions between fragments are accounted for by the iterative calculations of each fragment in the electrostatic field of all the other fragments (the FMO1 approximation). The short-range interactions are accounted for by explicit computations of fragment dimers (i.e. pairs of fragments considered as a single fragment) and optionally trimers (i.e. triples of fragments considered as a single fragment). The calculation of short-range interactions via dimer and trimer computations is referred to as the FMO2 and FMO3 approximations, respectively.
As a test case, a cluster of 20 water molecules ( Figure 9 ) has been considered. FMO3 approximations were used with MP2 calculations performed for each trimer and the conventional mode SCF is used.
Bolt: Figure 8 . Energy difference for the three effective frequency scaling determination mechanisms namely native, allmean, and emulated (further broken down into emu-twin-c, emu-twin-s-1, and emu-twin-s-2) when twin-core DVFS granularity is emulated (Dynamo) and threads from different applications execute on a node, compared to the maxfreq mechanism.
a node, compared to the maxfreq mechanism. In the six processes executing on Bolt, the first three ranks are compute processes and the next three are data servers. Since data servers do not actually take part in any computation, their frequency can be simply set to the minimum value (1.2 GHz) to achieve energy savings without any performance degradation. Furthermore, when the native mechanism is used, the compute processes and data servers execute at 2.4 GHz and 1.2 GHz, respectively, for nearly the whole execution time. Therefore, no performance degradation is observed, and 9% energy savings are achieved. The allmean mechanism executes all the processes at a low frequency of 1.8 GHz resulting in a performance loss of 31% and, hence, an energy loss of 15%. As for the emulated mode, only one (emu-socket-c) mapping of processes is possible since GAMESS compute processes and data servers are always ordered consecutively, such that the first half of the allocated cores are occupied by compute processes and the second half by data servers. Therefore, the emu-socket mechanism does not result in any energy savings due to the socket-level DVFS feature of Bolt.
Dynamo: In the case of Dynamo, the data servers are mapped to the twin cores in a way that four of them reside on two twin cores. Therefore, each of the native and emulated (emu-twin-c) mechanism ends up saving nearly the same amount of energy (9%) since, in each case, the data servers execute at 1.2 GHz, whereas the compute processes remain at 2.4 GHz. In the case of the allmean mechanism, all the compute processes and data servers execute at the same 1.8 GHz resulting in a performance loss of 33% and, hence, producing an energy loss of 19.2%.
Requested versus real frequency
A change in processor frequency is requested by writing a specific value to the IA32_PERF_CTL MSR; whether or not the processor cores will execute on that particular requested frequency depends on various factors, such as the number of active cores and power limits. Intel processors expose hardware coordination of P-states of different cores through IA32_MPERF and IA32_APERF MSRs (Intel). Reading the values provided by these MSRs gives the actual frequency the processor is executing at. Hence, it might be a good idea to verify whether or not the hardware did execute at the requested frequency. To get the value of the actual frequency during the application execution, turbostat 7 command was used, which provides the real frequency of the processor cores as the Avg_MHz column, referring to the ratio of the total number of cycles executed to the time elapsed. Figure 10 depicts the change in real (Avg_MHz reported by turbostat) with respect to the requested (through IA32_PERF_CTL) frequency for a 25-s trace of the execution of a W(4,4) workload on core 0. It can be observed from Figure 10 that there is virtually no difference between the requested and actual frequency. Similar results were obtained for other cores as well, and also when other workload configurations were tested. Since all the cores remain active during the experiments conducted in this work and no power limits were enforced, no difference between the requested and actual frequency of the processor was observed.
Conclusions
In this article, the effects of DVFS application granularity on the efficacy of an energy-saving strategy are studied on Table 3 . Energy difference as percentage of original execution for the three effective frequency scaling determination mechanisms namely native, allmean, and emulated (emu-socket-c and emu-twin-c) when socket-level (Bolt) and twin-core level DVFS (Dynamo) are emulated and threads from the same GAMESS application execute on a node, compared to the maxfreq mechanism. a given set of modern Intel processors. Specifically, first, the DVFS granularity is determined for three Intel processor types. The chosen processors depict the generational change in Intel processors and are somewhat representative of the variable DVFS granularities which might be present on the Advance Micro Devices (AMD) side, for which the proposed analysis can be easily extended. Then, using a variety of single-and multi-threaded workloads, the effects on energy savings are recorded by emulating the socket and twin-core level DVFS granularity on a platform with PCPS. It has been observed that the absence of PCPS along with an energy-saving strategy disregarding this absence may hamper energy savings. For example, in the worst-case scenario considered here, an energy-saving strategy increased the energy consumption by 19%.
Future work would consider implementing novel energy-saving strategies, which would take into account DVFS granularity of the different processor types while choosing a ceiling frequency. DRAM frequency scaling also needs to be analyzed accordingly since all the memory modules execute at the same frequency.
Additionally, the performance benefits from having finer DVFS granularity are to be analyzed for user-accessible DVFS frequencies, turbo frequencies, and the mixedfrequency case that occurs under RAPL power limits.
