Energy consumption is an important concern in modern multicore processors. The energy consumed during the execution of an application can be minimized by tuning the hardware state utilizing knobs such as frequency, voltage etc. The existing theoretical work on energy minimization using Global DVFS (Dynamic Voltage and Frequency Scaling), despite being thorough, ignores the energy consumed by the CPU on memory accesses and the dynamic energy consumed by the idle cores. This article presents an analytical energyperformance model for parallel workloads that accounts for the energy consumed by the CPU chip on memory accesses in addition to the energy consumed on CPU instructions. In addition, the model we present also accounts for the dynamic energy consumed by the idle cores. We present an analytical framework around our energy-performance model to predict the operating frequencies for global DVFS that minimize the overall CPU energy consumption. We show how the optimal frequencies in our model differ from the optimal frequencies in a model that does not account for memory accesses. We further show how the memory intensity of an application affects the optimal frequencies.
INTRODUCTION
Energy consumption and performance turn out to be the two most important and contradicting design criteria for modern multicore processors. The power consumption of a CMP(Chip Multi Processor) is an increasing function of the operating frequency of the chip and thus can be reduced by reducing the frequency. Dynamic Voltage and Frequency Scaling (DVFS) is a popular energy minimization technique. Most of the theoretical work on the energy-delay trade off deals with the local DVFS [3] , where frequency can be set Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
SPAA '16 July 11-13, 2016, Pacific Grove, CA, USA separately for each core. We study the problem of energy minimization under a performance constraint using global DVFS where all the cores on the chip are set to run on the same frequency. While local DVFS has more freedom in choosing clock frequencies and can therefore save more energy, it is not easy to implement. Often when we think of parallelism, we think of performance gains and we tend to ignore its ramifications on energy consumption. The fact that we can gain performance by increasing parallelism allows us to save energy by reducing the frequency without violating a performance constraint. The relationship of parallelism with energy and performance was first studied by Sangyeun and Melhem in [2] . Gerards et al further formalized the problem for task graphs [3] . They showed that using a single clock frequency during the execution of an application does not lead to optimal energy consumption. They presented an approach for varying the frequency according to the variations in the amount of parallelism where a separate frequency is assigned to each number of active cores. The analytical model of [3] , however, completely ignores the energy consumed by the CPU while it waits for data accesses to the main memory. Since it does not account for the time overhead of the access latency of memory, it can lead to an imprecise estimate of the slack between the time to completion and the given performance budget. The CPU energy optimization techniques that save energy by decreasing the operating frequencies of the cores at the cost of an increased delay need to be tuned to account for the memory access latencies. Another assumption in [3] is that the frequency of the idle cores can be brought down to zero. This is not always possible in reality, the idle cores can't be completely shut down and do consume some dynamic energy. In this article, we present a new model for the energy and performance of multicore systems that accounts for the energy consumed by the CMP while waiting for memory accesses in addition to the energy consumed on CPU instructions without ignoring the dynamic energy consumed by the idle cores.
Application Model: We consider a parallel application consisting of a set of N tasks. The application can be depicted as a DAG (Directed Acyclic Graph) where the nodes represent the tasks and the (Directed) edges represent the precedence constraints. A task is characterized by its compute workload and the data workload. The compute work load of a task is the number of clock cycles required to perform its computations and its data workload is the number of memory accesses it has to make. We assume an applica-tion wide parameter called data to CPU quotient d which is the ratio of data to compute workloads of the application.
Power Model: We consider two components of power, the dynamic power and the static power. Assuming f is the frequency of the CMP at some time t, the dynamic power of an active core at time t can be expressed as pDynamicActive(f ) = c1f α . The constant c1 > 0 is a characteristic of the computing platform and 2 ≤ α ≤ 3. At any given point in time, an inactive core consumes relatively less dynamic power owing to its reduced activity factor. We model this difference in dynamic power of active and inactive cores by assuming that the constant c1 for inactive cores is less than the c1 for active cores. Assuming c1 to be the constant for inactive cores such that the ratio K = c1 c1 < 1, the dynamic power of an inactive core can be expressed as pDynamicInactive(f ) = c1 f α. The static power can be expressed as pStatic(f ) = c2f + c3. The equation for the total power at a given point in time, with m active cores at frequency f is:
From this point on, we will denote [m + k(M − m)] as m . Dividing equation 1 on both sides by f gives energy per CPU cycle which we denote aspm, wherē
The overall energy consumption of an application can be expressed in terms of its amount of parallelism. We refer the reader to go through [3] to fully appreciate the concept of power modeling in terms of parallelism. For an application with N tasks running on a CMP with M cores, its amount of parallelism for a given schedule can be defined formally as a vector [w1, w2, ...wm...wM ], where wm is the total number of CPU cycles for which exactly m cores are active. Using the idea that a constant frequency for a fixed number of active cores leads to an optimal energy consumption, the task of energy optimization is reduced to finding a vector f = [f1, f2, ...fm...fM ], where fm is the optimal frequency for m active cores . For a given amount of parallelism wm, (wmd)ta is the duration for which the CPU waits for memory accesses, where ta is the memory access latency. The additional cycles expended per core on memory accesses for wm is thus wmdtafm. The total energy consumed by the CPU in terms of parallelism and the corresponding frequencies can be expressed as E total = 
MAIN RESULTS
We investigate how do the optimal frequencies relate to the memory intensity of an application and whether and how the relationship between optimal frequencies and the number of active cores changes in the presence of memory accesses. We present our main results in the form of three lemmas. Lemma 2. for an optimal solution f = [f 1, f 2, ....fM ] to the constrained energy optimization problem without the static energy, the following holds for every pair n, m ∈ {1, 2, ....M } with wm, wn > 0:
Lemma 2 shows the relationship between the frequencies for two different parallel regions. This is in contrast to the corresponding relationship in [3] , which is,
Including d and k, makes the relation more precise.
The next step is to analytically relate an optimal frequency for any parallel region to the optimal frequency of the serial region. Lemma 3 gives such a relation for α = 2.
Lemma 3. For α = 2, the ratio xm = fm f 1 , of the optimal frequency fm for a parallel region of the schedule with m active cores and the optimal frequency f1 for the serial region is a solution to the following cubic equation:
where 1 is a constant equal to KM + (1 − K)
Note that it is possible for a schedule to have no serial region at all. The purpose of expressing xm in terms of f1 is to help understand by how much does the frequency for a given parallelizm differ from f1. One can think of f1 as a ref erence frequency for a given hardware and application combination. For the sake of analysis, we call the term 2dtaf1, the memory overload factor. A careful analysis of equation 3 in relation to the memory overload factor reveals that, as memory overhead increases, the optimal frequency for m active cores tends to be inversely proportional to 3 √ m. Without accounting for the memory accesses, the optimal frequency for m active cores is inversely proportional to 2 √ m. Thus, accounting for memory accesses does not allow as much reduction in frequencies for parallel regions as predicted in [3] . This confirms that the energy savings predicted by [3] are over optimistic especially in the case of memory intensive applications running on a slow hardware on a tight performance budget. In general, from theorem 1 and generalizing the above exposition, it can be established that the optimal frequency for m active cores for a memory intensive application (with αdTaf1 sufficiently large) is inversely proportional to α+1 √ m.
