Due to the increasing power consumption in modern computing systems, energy management has become an important research area in the last decade. Recently, multicore has emerged to be an energy efficient architecture that exploits parallelisms in modern applications. However, as the number of cores on a single chip continues to increase, it has been a grand challenge on how to effectively manage the energy efficiency of multicore-based systems. In this paper, based on the voltage island and dynamic voltage and frequency scaling (DVFS) techniques, we investigate the energy efficiency of block-partitioned multicore processors, where cores are grouped into blocks with the cores on one block sharing a DVFSenabled power supply. Depending on the number of cores on each block, we study both symmetric and asymmetric block configurations. We develop a system-level power model (which can support various power management techniques) and derive both block-and system-wide energy-efficient frequencies for systems with block-partitioned multicore processors. Based on the power model, we prove that, for embarrassingly parallel applications, having all cores on a single block can achieve the same energy savings as that of the individual block configuration (where each core forms a single block and has its own power supply). However, for applications with limited degrees of parallelism, we show the superiority of the buddy-asymmetric block configuration, where the number of required blocks (and power supplies) is logarithmically related to the number of cores on the chip, in that it can achieve the same amount of energy savings as that of the individual block configuration. The energy efficiency of different block configurations is further evaluated through extensive simulations with both synthetic as well as a real life application.
Introduction
The performance improvements in computing systems have resulted in drastic increases in power densities and energy has been considered as a first-class system resource, especially for the battery-powered embedded computing devices. Therefore, energy management has become an important research area in the past decade with many hardware and software power management techniques having been proposed for various computing systems, from battery-powered devices [1] [2] [3] to high performance servers connected directly to the power grid [4] [5] . However, effective energy management remains one of the grand challenges for the research and engineering community, both in industry and academia [6] . One common strategy to save energy in computing systems is to operate system components at low-performance (thus, low-power) states, whenever possible. For instance, as the popular and widelydeployed power management technique, dynamic voltage and frequency scaling (DVFS) scales down both the processing frequency and supply voltage of processors to save energy [7] , which exploits the convex relation between processor's dynamic power consumption and its processing frequency and supply voltage [8] . Based on the DVFS technique, there have been many research studies for managing the power consumption of processors in various systems [1, 3, 9] . Moreover, considering the power consumption in other components (such as main memory [2] and I/O devices [10] [11] ) and the increasing static/leakage power in processors due to scaled feature sizes [12] , executing applications at the lowest frequency and supply voltage may consume more system energy and system-wide energy management becomes a necessity [13] [14] [15] . As an energy-efficient high-performance architecture, chip-multiprocessor (CMP) has been introduced that integrates multiple processing cores on a single chip [16] . Such a technique has been quickly adopted by major chip manufacturers and many multicore processors (e.g., Intel Core2 Quad [17] , AMD Phenom [18] and Sun Niagara [19] ) have been developed, which have become the popular and powerful computing engines for modern computing systems. In general, multicore processors exploit the parallelism in applications for energy efficiency. For instance, to achieve the same level of performance (such as response time), an application can be executed in parallel on multiple processing cores with each core running at a lower frequency and supply voltage to save energy. Several recent research work has explored such features of multicore processors to save energy for different applications [20] [21] [22] [23] [24] . The available multicore processors normally integrate various advanced power management features, such as power saving states (e.g. Halt, Sleep and Off [25] [26] ) and DVFS. However, most state-of-the-art commercial multicore processors have only one common power supply voltage for all cores on the chip [17, 27] , which may require the cores to run at the same processing frequency and limit the flexibility for power management (and thus result in sub-optimal energy savings). With the advancement of voltage island techniques [28] [29] [30] and fast on-chip voltage regulators [31] , it is expected that future multicore processors can have multiple supply voltage domains on a chip [19] . However, the increased design complexity and associated area overhead of the additional supply voltages and on-chip voltage regulators [32] would make it very costly to provide a separate DVFS-enabled supply voltage for each individual core, especially considering that the number of cores on a single chip continues to increase (where extensive research activity is underway to build chips with potentially tens and even hundreds of cores [33] [34] [35] ). Therefore, the problem of how to optimally place the cores on different voltage islands in multicore processors for effective system-level power management is not trivial and warrants some investigations.
In this paper, based on the voltage island and DVFS with on-chip voltage regulator techniques, we study the block-partitioned core configurations for multicore processors and evaluate the energy efficiency for different block configurations. Specifically, the cores on a multicore chip are grouped into several blocks (with each block being essentially a DVFS-enabled voltage island) and the cores within each block share a common supply voltage (and thus will have the same processing frequency). Depending on how the cores are partitioned to blocks, we study both symmetric and asymmetric block configurations, which contain the same and different number of cores on each block, respectively. In particular, a buddy-asymmetric block configuration is studied for its great features. In addition to its flexibility to support different system workloads with energy efficiency, the number of blocks (and required power supplies) is only logarithmically related to the number of cores.
For such block-partitioned multicore systems, we develop a system-level power model that can effectively support various levels of power management techniques and derive both block-and system-level energy-efficient frequencies. Based on the power model, we also analyze the energy efficiency of different block configurations. First, for embarrassingly parallel applications, we prove that the common block configuration that has all cores on a single block (thus needs only one power supply) can achieve the same energy savings as that of the individual block configuration (where each core forms a single block and has its own power supply). Then, for applications with limited degrees of parallelism, we show the superiority of the buddy-asymmetric block configuration in that it can achieve comparable energy savings as that of the individual block configuration. Moreover, the energy efficiency of block-partitioned multicore systems is further evaluated through extensive simulations with both synthetic as well as a real life application, and the results confirm our analysis.
To the best of our knowledge, this is the first work that systematically analyzes and evaluates the energy efficiency of both symmetric and asymmetric block configurations of multicore systems for applications with various degrees of parallelism. To summarize, the main contributions of this paper are threefold:
• We propose block-partitioned core organizations for multicore processors, where both symmetric and asymmetric block configurations are considered; In particular, a buddy-asymmetric block configuration is studied.
• We develop a system-level power model for systems with such block-partitioned multicore processors and derive both block-and system-wide energy-efficient frequencies.
• We analyze the energy efficiency of blockpartitioned multicore systems for both embarrassingly and limited parallel applications, where the analysis results are confirmed through extensive simulations.
The remainder of this paper is organized as follows. The closely related work is reviewed in Section 2. Section 3 presents block-partitioned core configurations for multicore processors and the corresponding system-level power model. In Section 4, we analyze the energy efficiency of different block configurations in multicore systems for applications with various degrees of parallelism. The simulation results are presented and discussed in Section 5. Section 6 concludes the paper.
Related Work
Power aware computing has become an important research area that attracts extensive attention in the last decade. As the dynamic energy consumption of CMOS devices is quadratically related to its supply voltage [8] , dynamic voltage and frequency scaling (DVFS) technique that slows down the supply voltage (and corresponding processing frequency) of CMOS devices can lead to significant energy savings [7] . Based on the DVFS technique, various power management schemes have been developed for uniprocessor real-time systems with different scheduling policies [1, 3, 36] . However, the research on power-aware scheduling for multiprocessor systems is comparatively limited, especially for multicore-based systems.
Based on the partitioned scheduling policy, Aydin et al. studied the problem of how to partition real-time tasks to processors for minimizing energy consumption for multiprocessor systems [37] . They showed that, for earliest deadline first (EDF) scheduling, balancing the workload among all the processors evenly gives the optimal energy consumption and the general partition problem for minimizing energy consumption in multiprocessor real-time system is NP-hard [37] . The work was extended to consider rate monotonic scheduling (RMS) in their later work [38] . Anderson and Baruah investigated how to synthesize a multiprocessor real-time system with periodic tasks such that the energy consumption is minimized at run-time [39] . Chen et al. proposed a series of approximation scheduling algorithms for maximizing energy-efficiency of multiprocessor realtime system, where both frame-based tasks and periodic tasks are considered, with and without leakage power consideration [40] [41] [42] [43] . In our previous work, based on global scheduling, power-aware algorithms have been developed for real-time multiprocessor systems, which exploits the slack reclamation and slack sharing for energy saving [9] . More recently, Choi and Melhem studied the interplay between parallelism of an application, program performance, and energy consumption [44] . For an application with given ratio of serial and parallel portions and the number of processors, the authors derived optimal frequencies allocated to the serial and parallel regions in an application to either minimize the total energy consumption or minimize the energy-delay product.
For soft real-time applications running on multicore systems, Bautista et al. recently studied a novel fairness-based power aware scheduler that utilizes the global frequency for all cores at the same time and evaluated the power efficiency of multicore processors [20] . The scheduler pursues to minimize the number of DVFS transitions by increasing or decreasing the voltage and frequency of all the cores at the same time. Along the same line of assuming all cores on a chip share the same frequency, Seo et al. also studied one dynamic re-partitioning algorithm for real-time systems which dynamically balances the task loads on the cores to optimize overall power consumption [24] . For non-real-time applications, Donald et al. concluded that the most effective approach for energy management in multicorebased systems is the combination of core throttling (i.e., powering off the unused cores) and per-core DVFS that adjusts the processing frequency and supply voltage of cores independently [45] . Although it is possible to provide a separate power supply voltage for each individual core to get the maximum flexibility, as the number of cores on a chip continues to increase (with tens [46] or even hundreds of cores [34] ), the associated overhead can be prohibitive [32] in addition to additional circuit complexity, stability, and power delivery problems [33, 35, 47] .
In [47] , based on detailed VLSI circuit simulations, Herbert et al. found that the potential energy gains of per-core DVFS are likely to remain too modest for justifying the design complications. Similar conclusion has been reached independently in [48] , which states that the additional energy gains from percore DVFS would not be substantial for reasonablybalanced workload-to-core distributions. Therefore, several recent studies [29] [30] 33] have focused on the voltage/frequency island technique, where the cores on a multicore chip are partitioned into a few groups with each group being placed on a voltage island that has a separate power supply and independent voltage regulator. For the cores on the same voltage island, they will have the same supply voltage and processing frequency that can be adjusted simultaneously. However, the cores may independently enter low-power sleep states to save energy when idle [49] . Some very recent experimental studies indicate that innovative techniques could facilitate the implementation of fast on-chip voltage regulators [31] .
Following this line of research, in this work, we investigate energy efficient organizations of cores on multicore processors. Specifically, we study blockpartitioned multicore processors, where the processing cores are grouped into different blocks and each block is essentially a DVFS-enabled voltage island. However, different from the existing work that focused on only voltage islands containing the same number of cores [29] [30] 33, 47] , we consider both symmetric blocks (where the number of cores is the same on each block) and asymmetric blocks (where the number of cores on the blocks are different), and evaluate the energy efficiency of these configurations for applications with various degrees of parallelism.
Block-Partitioned Multicore Processors
In this section, we first present both symmetric and asymmetric block configurations for the cores in multicore processors. Then, for such block-partitioned multicore systems, we develop a simple system-level power model which at the same time can effectively support various power management techniques. Based on the power model, we derive the energy efficient frequencies, which form the foundation and will be utilized to analyze the energy efficiency of different block configurations for applications with various degrees of parallelism in Section 4.
Partition Cores to Blocks
As previously mentioned, in terms of flexible energy management, the most efficient approach is to have a separate DVFS-enabled supply voltage for each individual core in multicore systems [45] . However, such configuration requires a separate voltage island and on-chip voltage regulator for each core, where the increased design complexity and associated overhead cost can be prohibitive (e.g., a single on-chip voltage regulator can take 12.7% of the whole chip area for the 90 nm technology [32] ). Moreover, with reasonably-balanced distributions of workload to cores, additional energy savings from such a per-core DVFS feature can be limited [47] [48] . Therefore, it would be more appropriate to place several cores on one voltage island and allow them to share the supply voltage (thus have the same processing frequency) [29] [30] , especially considering the fact that the number of cores on a chip will continue to increase [33] [34] [35] . In general, having more cores on a single voltage island can reduce the number of required power supply voltages (and associated on-chip voltage regulators) and thus reduce the design complexity and overhead cost, which may limit the energy management opportunities, and vise versa. Therefore, there is an interesting tradeoff between the overhead cost (i.e., the number of required power supply voltages and associated on-chip voltage regulators) and energy efficiency for multicore processors.
In this work, it is assumed that the multicore processor under consideration has n homogeneous processing cores, where the cores are identical and have the same processing functionalities. Moreover, for simplicity and ease of discussions, we further assume that n is a powerof-two value (i.e., n = 2 k ; k 1). To reduce the design complexity and the number of required power supplies, the cores will be partitioned into blocks, where each block is essentially a DVFS-enabled voltage island. For the cores on the same block, they will share the same supply voltage. Although it is possible for the cores to run at different lower processing frequencies with the same high supply voltage, doing so is not energy efficient and the additional circuits needed will make the design more complex. For simplicity, we will not consider this aspect in this paper. That is, we assume that all cores on the same block will have the same highest allowable processing frequency for a given supply voltage. Therefore, without introducing ambiguity, we will use frequency scaling to stand for adjusting both voltage and frequency simultaneously for the remaining part of this paper.
When cores are grouped into blocks, we can have the same number of cores on every block or each block can have a different number of cores. In what follows, depending on the number of cores on blocks, we study both symmetric and asymmetric block configurations.
Symmetric Blocks
First, we consider symmetric blocks, where all blocks have the same number of cores. However, for a multicore processor with a given number of cores, depending on the number of blocks to be adopted, we can have different symmetric configurations, which will require different number of power supply voltages and have different design complexity and overhead cost. Note that, to precisely quantify the exact design complexity and overhead cost for the supply voltages and associated on-chip voltage regulators under different block configurations requires detailed architecture information and is well beyond the scope of this paper. For simplicity, in this paper, we use the number of blocks (and thus the number of required supply voltages and on-chip voltage regulators) to represent such complexity and overhead for different block configurations.
As a concrete example, suppose that we have a multicore processor that has eight cores. One case is to have a separate power supply voltage for each core that forms an individual block (denoted as Symm-1) as shown in Fig.1(a) . Here, the dotted rectangles represent the blocks. In this case, depending on the workload on each core, the cores can run at different processing frequencies through independent DVFS. Moreover, when some cores finish their work and become idle, they can be powered off without affecting other cores for more energy savings. On the other hand, we can have all cores sharing a single power supply to form a common block (denoted as Symm-8) as shown in Fig.1(b) . Here, the supply voltage will be determined by the most loaded core and all cores will operate at the same high processing frequency. Although the idle cores can be put to power saving sleep states, we cannot power off the block for better energy savings even if there is only one core that is actively processing its workload. Therefore, we can see that the Symm-1 configuration can provide the best flexibility for power management, where each core can be independently managed. However, it will require 8 separate power supply voltages (and associated on-chip voltage regulators), which will lead to high design complexity and overhead cost. For comparison, the common block configuration (i.e., Symm-8) needs only a single power supply voltage with the lowest design complexity and overhead cost. However, having all cores sharing one power supply will limit the power management opportunities and thus result in sub-optimal energy savings. For a better tradeoff between power management flexibility and design complexity/cost, we can have either two blocks for the processor with four cores on each block (denoted as Symm-4 as shown in Fig.1(c) ) or four blocks with two cores on each block (denoted as Symm-2), which will require 2 and 4 power supply voltages, respectively. Once again, the most loaded core will determine the lowest supply voltage and all cores within a block have the same processing frequency.
Recall that the number of core n is assumed to be a power-of-two value. Therefore, for multicore processors configured with symmetric blocks, the number of blocks b will be a power-of-two value as well. For multicore processors with a given number of cores, the energy efficiency of different symmetric block configurations will be analyzed and evaluated in Sections 4 and 5, respectively.
Asymmetric Buddy Blocks
Instead of having the same number of cores on each block, we can have asymmetric blocks, where each block has a different number of cores. Note that, for multicore processors with a given number of cores, the number of different asymmetric block configurations can be extremely large, which makes it difficult (if not impossible) to analyze and evaluate all different asymmetric block configurations. With the objectives of reducing the number of blocks (thus the number of power supply voltages and associated on-chip voltage regulators) while providing flexible power management support for various workloads, we consider in particular the asymmetric buddy blocks in this work. It follows the similar idea of buddy memory allocation in operating systems [50] . For the above example with 8 cores, Fig.1(d) shows the asymmetric buddy configuration, where there are 4 blocks and the number of cores on the blocks are 4, 2, 1, and 1, respectively. That is, in the asymmetric buddy configuration, the first block contains half of the cores, and the second block contains half of the remaining cores, and so on. The last two blocks will contain one core each.
In general, for a multicore processor with n = 2 k (k 1) processing cores, there will be (k + 1) blocks for the asymmetric buddy configuration and the number of cores on the blocks will be 2
Therefore, the number of blocks, which is also the number of required supply voltages (and onchip voltage regulators), under the asymmetric buddy configuration N b buddy is logarithmically related to the number of cores on a multicore chip:
Note that, for multicore processors with the asymmetric buddy configuration, to satisfy the performance requirement of various workload, we can adopt exactly p (1 p 2 k ) cores by appropriately selecting the blocks to be powered on (while the remaining blocks/cores can be powered off for energy savings). More specifically, the block selection process is as follows: the first block to be selected will be the largest block B i with its number of cores n i being no more than p. If that block contains exactly p cores (i.e., n i = p), only the block B i will be selected. Otherwise, more blocks are needed to provide (p − n i ) cores. The next block to be selected will be the one (from the remaining blocks) that has the largest number of cores n j being no larger than (p−n i ). The above steps will be repeated until the total number of cores from the selected blocks is exactly p. When 1 p n (where n = 2 k ), from numerical theory, we know that the above process will always be able to find the appropriated blocks that have p cores in total.
For the above example, if a given workload needs 5 processing cores to obtain the maximum energy savings, we can use the blocks B 0 and B 2 for the buddy asymmetric configuration in Fig.1(d) . For comparison, with the Symm-4 configuration as shown in Fig.1(c) , we need to use both blocks to satisfy the performance requirements, which could be sub-optimal for energy savings. Although it is also possible to select exactly 5 cores for the Symm-1 configuration as shown in Fig.1(a) , the Symm-1 configuration requires 8 power supply voltages compared to 4 power supply voltages needed by the asymmetric buddy configuration. Furthermore, in Sections 4 and 5, we will show that the resulting energy efficiency of the asymmetric buddy configuration is always the same as that of the Symm-1 (i.e., individual block) configuration for various workload with different degrees of parallelism.
System-Level Power Model for Block-Partitioned Multicore Systems
To effectively evaluate the energy efficiency of different block configurations, we develop in this subsection the power model for computing systems with a block-partitioned multicore processor. Note that, power management schemes that focus on individual components may not be energy efficient at system level. For instance, to save the energy consumption of processors, DVFS tends to run a system at the lowest processing frequency that satisfies a given performance requirement to save processor's energy consumption [7] . However, when a processor runs at low processing frequencies, it will need more time to execute the application under consideration and thus incur more energy consumption from memory and I/O devices. Therefore, system-wide power management becomes a necessity and has caught researchers' attention recently [13, 15, [51] [52] .
For systems with a traditional single-core processor, by dividing the power consumption into three parts, we have studied in our previous work a simple system-level power model [13, 53] , where the power consumption of a computer system running at frequency f can be modeled as:
Here, P s is the static power that includes the power to maintain the basic circuits of a system (e.g., keeping the clock running) as well as part of the power consumed by memory and I/O devices . It is assumed that P s can only be removed by powering off the whole system. Whenever the system is executing some workload, it is said to be active (i.e., = 1) and the active power, which has two parts P ind and P d , will be consumed. Here, P ind denotes the frequencyindependent active power that includes part of the processor static/leakage power as well as any power that can be effectively removed by putting the power manageable components into sleep states. P ind is assumed to be a constant. P d denotes the frequency-dependent active power, which includes the processor's dynamic power as well as any power that depends on system supply voltages and processing frequencies [8, 54] . The switch capacitance C ef and the power exponent m are assumed to be system dependent constants. In general, there is 2 m 3 [8] . Despite its simplicity, this power model incorporates all essential power components of a system and can support various power management techniques (e.g., DVFS and sleep states).
Following the similar idea and extending the above power model, we develop in what follows a system-level power model for systems with a single block-partitioned multicore processor. Note that, the processing cores in modern multicore processors can be efficiently (e.g., in a few cycles) put into power-saving sleep states [25] [26] . With the processing cores being partitioned into blocks and each block has a DVFS-enabled supply voltage, there are several places at different levels to manage the power consumption of a block-partitioned multicore system. First, at the processing core level, we can exploit DVFS techniques and scale down the processing frequency (and corresponding supply voltage) for all cores within a block to save energy provided that the performance requirement of the most loaded core(s) can still be met. Second, whenever a processing core finishes its workload and becomes idle, we can put it into Similar models have been proposed by other researchers as well [15, 51] . More recently, assuming multiple components (such as CPU, system bus and memory) have individually adjustable frequencies, a more accurate but complex power model has been studied in [52] .
Although it is possible to manage power consumption of I/O devices independently in a fine-controlled manner [11, 52] , exploring such possibilities is beyond the scope of this paper and will be considered in our future work.
sleep states for more energy savings. Third, at the block level, if all cores on a block are in sleep states and are expected to remain in sleep states in the near future, we can switch off the power supply for the block and put it to the off state to save part of the static and leakage power. Finally, at the system level, we may completely power off the whole system when it is not in use and all power consumption (of processor, memory and I/O devices) will be removed. However, considering the excessive time overhead for completely powering on/off a computing system (e.g., tens of seconds [4] ), for the system under consideration, we assume that it is never powered off completely and thus focus on, in this paper, the power management techniques at the core and block levels.
To effectively support these different power management opportunities in block-partitioned multicore systems, following the principles of the power model shown in (1), we also divide the system power consumption into several distinct components. First, associated with each core, there is frequency-dependent active power that depends on the processing frequency (and supply voltage) of the block. However, when a core is idle, it assumes that its frequency-dependent active power can be efficiently removed by putting the core into power saving sleep states. The frequency-independent active power is assumed to associate with a block, which is proportional to the number of cores on the block (as the leakage power normally depends on the number of circuits) and can be removed by switching off the power supply for the block. Finally, there is a static system power that includes power consumption of memory and other I/O devices and is assumed to be a constant. Before formally presenting the system-level power model, we first define a few important terms for systems with a block-partitioned multicore processor:
• b: the number of blocks on the multicore processor in the system under consideration;
• B i : the i-th block of the multicore processor, where
• n i : the number of cores on the block B i . We have
• f i (t): the processing frequency for the cores on the block B i at time t;
• x i (t): a binary variable to indicate the state of block B i at time t. If the block is powered off, x i (t) = 0; otherwise, x i (t) = 1;
• y i,j (t): a binary variable to indicate the state of the j-th (j = 1, . . . , n i ) core on block B i at time t. If the core is in sleep state, y i,j (t) = 0; otherwise, y i,j (t) = 1.
It can be seen that the power consumption of a block-partitioned multicore system depends on the states of its blocks as well as individual cores. For a given run-time state of the system at time t, which is defined by x i (t), f i (t) and y i,j (t) (i = 0, . . . , b − 1 and j = 1, . . . , n i ), the power consumption P (t) of a system with a block-partitioned multicore processor can be modeled as:
where P s is the system static power. With the assumption that the system is never powered off completely due to the prohibitive switching overhead, P s will be always consumed. That is, we focus on in this paper managing the active power of block-partitioned multicore systems. Recall that all cores on the block B i run at the same frequency f i due to the limitation of block-wide power supply. For ease of presentation, the maximum frequencydependent active power for one core at the maximum frequency f max is defined as P
In addition, we further assume that each core contributes a portion to the frequency-independent active power for the block it resides in, which is assumed to be β · P max d , where β is a parameter that captures the relationship between the frequency-independent and frequencydependent active powers. The values of β are system related and depend on different processor technology. For the single-core processor, the values of 0.1 ∼ 0.5 have been exploited [42, 53] , and we will use values in the same range in this paper. That is, for the block B i with n i cores, its frequency-independent power will be
Note that, for the available modern DVFS-enabled processors (e.g., Intel
[17] and AMD [18] ), they normally have only a few frequency levels. In this work, we assume that there are k frequency levels: f 1 , . . . , f k , where f 1 = f min is the minimum available frequency and f k = f max is the maximum frequency. Moreover, we use normalized frequencies in this work and it is assumed that f max = 1. For simplicity, the time overhead of adjusting frequency (and supply voltage) for the blocks is assumed to be negligible .
Energy-Efficient Frequencies for Block-Partitioned Multicore Systems
From previous discussions, we can see that, for block-partitioned multicore systems, processing Such overhead can be easily incorporated into the execution time of the applications under consideration when exploiting slack time to scale down the processing frequency [1, 9] . frequencies for cores will be determined at block level. That is, depending on the allocated workload and the number of active cores on each block, blocks may have different processing frequencies for better energy savings. Focusing on a single block B i that has n i processing cores, we first derive in what follows the optimal frequency settings for the block to achieve the maximum energy savings, which will be exploited in Section 4 to analyze the energy efficiency of different block configurations for various parallel applications.
For the cores on the block B i , not all of them will be active all the time due to, for example, the limited degrees of parallelism in applications. Note that energy is the integral of power over time. Therefore, although running at a lower frequency can reduce the energy consumption due to the frequency-dependent power for active cores on the block B i , the increased execution time will lead to more energy consumption from the frequency-independent power for the block. That is, similar to the energy-efficient frequency for systems with a single-core processor [13, 53] , a block-wide energy-efficient frequency will exist and the active cores should not run below such a frequency as doing so will consume more energy. From (3), we can see that the frequency-dependent power component is associated with cores and depends on the number of active cores that have workload to process. Intuitively, when more cores are active, we can have a lower energyefficient frequency as more energy savings can be expected from frequency-dependent power (which can compensate the additional energy consumption due to frequency-independent power) and vice versa.
Recall that there are n i cores on the block B i . Suppose that the number of active cores is a i ( n i ) and each active core has the same amount of workload w i , which will be executed at the scaled frequency f i . The time needed for the active cores to finish processing their workload will be t = wi fi . After putting the idle cores to power saving sleep states, the active energy consumption for the block B i to execute the required workload can be given as:
From (5), we can see that the active energy consumption of the block B i to execute the required workload is a convex function of f i . Differentiate E i () with f i and set the resulting equation to be zero, we can find out that E i () is minimized when f i equals the following energy-efficient frequency:
Even if there are more available time to execute the workload, we should not have the active cores run at a frequency lower than f ee,i (a i ), as doing so will consume more energy. Moreover, as the number of active cores a i becomes smaller, the energy-efficient frequency f ee,i for the block B i will become larger and the active cores need to run faster for the better energy savings. When all cores on the block B i are active (i.e., a i = n i ), (6) can be simplified as:
That is, for any block that has all its cores active, the same energy-efficient frequency f ee can be obtained, which also denotes the system-wide energy-efficient frequency. In addition to the number of active cores within a block, from (6), the energy-efficient frequency also depends on β (i.e., the frequency-independent active power). Assuming that C ef = 1 and m = 3 [8] , for different frequency-independent active power (i.e., different values of β), Figs. 2(a) and 2(b) show the energyefficient frequencies of blocks with 8 and 16 cores, respectively, as the number of active cores changes. From the figures, we can see that as the number of active cores decreases, the energy-efficient frequency becomes higher and the cores need to run faster to achieve the best energy savings. If there are only a few (e.g., 2) active cores, having them run at f max can be the most energy efficient approach, especially when the block's frequency-independent power is large (i.e., for larger values of β). In what follows, the energy-efficient frequency will be utilized to evaluate the energy efficiency of different block configurations for applications with different degrees of parallelism.
Energy Efficiency of Block-Partitioned Multicores for Parallel Applications
For parallel applications running on a system with a single block-partitioned multicore processor that has n cores, we analyze the energy efficiency for different block configurations, which provide different flexibility to scale down the processing frequency and/or power off individual cores for energy savings. For applications that have unlimited degree of parallelism and enough workload, all cores will be utilized all the time and no power management can be applied. For such cases, the system will consume the same power (and energy) regardless different block configurations of the cores. However, for applications (or different phases in an application) that have limited degree of parallelism, not all cores will be employed all the time and the system may consume different amount of energy when the processor has different block configurations. For ease of presentation, in what follows, normalized energy consumptions are reported, where the system energy consumption for the case with the common block configuration (where all cores are on a single block and share a common power supply) is used as the baseline.
To simplify the analysis, the workload W of the application under consideration is assumed to have the degree of parallelism DP . For applications that have varying degrees of parallelism in different phases, the same analysis can be applied to each phase separately. Moreover, we assume that the application needs to be processed within a period of time T , which can be derived from the user specified performance requirements or the inter-arrival time of a repetitive application. The system load is defined as
which is the ratio of application's workload W over the total computation capacity of the system at the maximum frequency f max . Note that, the system load does not take the workload's parallelism into consideration. When the degree of parallelism is limited, an application may not be able to complete its workload in time even if δ 1. Depending on different degrees of parallelism, we focus our analysis on two types of applications: the embarrassingly parallel (i.e., DP = ∞) applications and the ones with limited degree of parallelism (i.e., DP n).
Embarrassingly Parallel Applications
For applications with embarrassing parallelism, it is assumed that the workload W can be arbitrarily divided and executed in parallel on any number of cores (for instance, the large number of small requests to be processed in a large interval in Web applications) [55] . That is, for a system with n cores, we can evenly distribute the workload W among all cores. Therefore, the minimum amount of time required to process the workload at the maximum frequency f max (= 1) will be t min = W n . If t min > T (i.e., δ > 1), the workload cannot be processed in time. Otherwise (i.e., t min T and δ 1), we can scale down the processing frequency of the cores to save energy. In what follows, we prove that the same amount of energy savings can be obtained under different block configurations for embarrassingly parallel applications with δ 1. Theorem 1. For an embarrassingly parallel application with workload W to be executed on a multicore system with n cores, if the system load δ 1, the same amount of energy will be consumed (i.e., the same energy savings can be obtained) regardless of different block configurations.
Proof. Recall that the system will not be completely powered off and the static power component P s is always consumed, which will be the same for all block configurations. In what follows, we focus on the energy consumption from the frequency-dependent and frequency-independent active power.
For an embarrassingly parallel application with system load δ 1, the minimum amount of time to complete its workload W will be t min = W n T when the application is executed on all n cores at the maximum frequency f max = 1. Instead of executing the application at the maximum frequency f max , we can scale down the processing frequency (and corresponding supply voltage) of all cores to save system energy consumption. The lowest processing frequency for the application to complete in time can be found as f
Recall that there is a system-wide energy-efficient f ee , which limits the lowest frequency to execute an application (see Subsection 3.3). Depending on the value of f = δ, there are two cases:
Case 1 (f f ee ). In this case, there is enough workload and we show that all n cores should be utilized to minimize system energy consumption (thus to maximize energy savings).
When all cores are utilized to execute the application at the scaled frequency f , we can pack the execution on the cores one after another as shown in Fig.3(a) . It can also be seen as executing the application on a single core for time n · T . The total energy consumption will
Suppose that k (< n) cores are utilized to execute the application. The scaled frequency will be f = W k·T > f. Such an execution can also be seen as executing the workload on one core for time k · T (as shown in Fig.3(b) ). Here, the total energy consumption will be E(f , k)
Note that the idle cores are put to sleep for better energy savings. Due to the convexity of the system power function, we can get that E(f , k) > E(f, n). Therefore, to minimize the energy consumption and maximize the energy savings, all cores should be utilized and execute the application at the scaled frequency f within the interval of T . That is, regardless of different block configurations, all blocks will be powered on and the same amount of energy savings will be obtained. Case 2 (f < f ee ). In this case, to maximize the energy savings, the application should be executed at the energy-efficient frequency f ee .
If all cores are utilized, the workload can be completed in time t = W n·fee < T . After that, all cores can be put to sleep (and thus all blocks can be powered off) to further save the energy consumption from the frequency-independent active power. Again, we pack the execution of the application on the cores one after another sequentially as shown in Fig.4(a) . Here, the overall system energy consumption can be calculated
cores are utilized to execute the application (while other cores are put to sleep for energy savings), we can get the packed execution as shown in Fig.4(b) . Note that, using fewer than k cores will force them to execute at a frequency higher than f ee , which is not energy efficient. Here, the total execution time in the gray area will be W fee , and the energy consumption will be E(k, f ee ) = P s ·T +(β ·P
, which is the same as that of executing the workload on n cores.
That is, for the case of f = δ < f ee , by enforcing the execution of the workload at f ee (i.e., with no fewer than k = W fee·T cores), the same minimum energy will be consumed regardless of different block configurations.
Therefore, for embarrassingly parallel applications to be executed on a block-partitioned multicore system, the same energy saving can be obtained regardless of how the cores are grouped into different blocks.
Applications with Limited Degree of Parallelism
Due to various data and control dependencies, most applications only have limited degree of parallelism. For applications with limited degree of parallelism DP n, where n is the number of available cores, we discuss in this subsection how different block configurations affect system energy efficiency. Here, the maximum number of cores can be utilized will be DP , the degree of parallelism of an application. When such an application with workload W to be processed on DP cores, the minimum amount of time needed will be t min = W DP ·fmax . That is, we assume that the workload of the application can be evenly distributed among the deployed cores.
If t min > T , it will be impossible to process the application within the given time interval T . Thus, we consider in this work the case with t min T . When the workload W is low, we may utilize fewer cores and/or scale down their processing frequencies for the maximum energy savings. However, thanks to the limitation of energy efficient frequency, utilizing fewer than DP cores does not have advantages on more energy savings, as stated in the following theorem. The reasonings are similar to that for Theorem 1 and are omitted for brevity.
Theorem 2. For an application with limited degree of parallelism DP
n, if the application can be processed in time on DP cores (i.e., W DP ·fmax T ), utilizing DP cores can always lead to the minimum energy consumption for processing the application, regardless of its workload W .
Therefore, in what follows, for applications with degree of parallelism DP, we will execute them with DP cores and scale down their processing frequency accordingly to f = tmin T · f max for energy savings. However, for different block configurations, the flexibility to choose exactly DP cores is different, which will lead to different amount of energy consumption.
From previous discussions, we know that, it is always possible to have exactly DP ( n) cores by selecting appropriate blocks in the buddy-asymmetric configuration. Moreover, for the individual block configuration where each block consists of only one core, we can select DP blocks to get the needed DP cores. For other symmetric block configurations where each block has x ( n) cores, the number of blocks that should be selected to provide DP cores will be DP x blocks. Note that, due to the fact that some symmetric block configurations cannot exactly provide DP cores, for applications with extremely low workload W , it is possible to use fewer cores (and thus fewer blocks that exactly provide those cores) for better energy savings, which is not considered in this paper for simplicity.
For applications with different degrees of parallelism, we present in what follows the analysis results on the normalized energy consumption of a system with different block configurations, where the energy consumption of the system with a common block (i.e., all cores on the same block) is used as the baseline for comparison. Here, we consider a system with a 64-core processor, which can be configured as Symm-x (x = 2 k and k = 0, . . . , 6), which corresponds to the case of having "x-core/block" as well as the buddy-asymmetric block configuration. Focusing on the system active power, the following parameters for the power model is used in the analysis: P s = 0.01, C ef and m = 3.
The system load of the application is assumed to be δ = 0.25, which indicates that the minimum degree of parallelism should be DP min = 16. That is, when 16 cores are utilized to process the application at the maximum frequency f max , the application can finish its execution just in time. Assuming that the application has different degrees of parallelism DP (16 DP 64), Figs. 5(a) and 5(b) show the normalized energy consumption of the system for β = 0.1 and β = 0.5 (i.e., different frequency-independent power), respectively. In addition to even values of DP (e.g., 16, 24 and 32 etc.) , odd values of DP (e.g., 19, 27 and 35 etc.) are also considered.
Note that, for some values of DP of the application (e.g., DP = 16, 32 and 64), in addition to individual and buddy-asymmetric block configurations, most other configurations (such as Symm-4, Symm-8 and Symm-16) can also provide exact DP cores. For such scenarios, from the figure, we can see that those configurations can lead to the same minimum energy consumption, which confirms our conclusion in Theorem 2. However, when the application has other degree of parallelism (i.e., other values of DP), only the buddy-asymmetric block configuration can always provide exact DP core and achieves the minimum energy consumption (in addition to the individual block configuration, which is not shown in the figure). Moreover, for systems with higher frequency-independent active power (e.g., β = 0.5), as shown in Fig.5(b) , other configurations that have fewer cores on each block can generally achieve more energy savings compared to that of the common block configuration. The reason is that, such configuration can power off blocks for smaller values of DP , which results in more energy savings from the frequency-independent active power.
Evaluations and Discussions
In addition to the analysis results discussed in Subsection 4.2, extensive simulations have also been conducted to evaluate and validate the energy-efficiency of different block configurations. For such a purpose, we have built a discrete-time event simulator in C. For a given (synthetic or real-life) application with the known workload and degree of parallelism, the simulator will select the optimal number of blocks and appropriate scaled frequency that lead to the minimum energy consumption under different block configurations.
Moreover, as in the analysis, the following parameters of the system power model are adopted: P s = 0.01, C ef = 1 and m = 3. Note that, similar parameters have been utilized in other work, which can closely model systems with modern processors [41] [42] . Also, we assume that there are five normalized frequency levels for the cores considered, which are {0.2, 0.4, 0.6, 0.8, 1.0}. Furthermore, the same as in the analysis, normalized energy consumption is reported and the energy consumption under the common block configuration (where all cores form a single block) is used as the baseline.
Synthetic Workload
First, we consider a set of synthetic applications. To generate the workload for an synthetic application, we define DP min and DP max as the minimum and maximum degrees of parallelism, respectively. We assume that the workload will be executed on a 64-core system and the time period is T = 1000 time units. For the workload within each time period, its degree of parallelism DP is first randomly generated between DP min and DP max . Then, the workload W is generated within the range of [1, DP · T ] following the uniform distribution. That is, on average, the scaled frequency of the deployed cores will be f = 0.5. For each point in the results, we generate the workload for 1000 000 time periods and the average result is reported.
When the minimum degree of parallelism of the application is fixed as DP min = 1, Fig.6 (a) first shows the normalized energy consumption for different block configurations when we vary the maximum degree of parallelism DP max . Here, we have β = 0.5 (similar results have been obtained for β = 0.1, which is not shown in the paper). From the figure, we can see that, when the degree of parallelism in the workload is very limited (i.e., smaller values of DP max ), the energy savings for configurations with larger blocks (e.g., 32 cores per block) is rather limited. The reason comes from the fact that at least one block will be select, which limits the energy savings from the frequency-independent active power. However, for configurations with smaller blocks (e.g., 4 cores per block), compared to that of the common block configuration, significant energy savings can be obtained as the unused blocks can be effectively powered off to save the energy consumption from the frequency-independent active power. When PD max becomes larger, the energy savings decrease as more cores will be needed to process the workload and number of blocks that can be powered off for energy savings becomes small. Again, due to its flexibility to support various workload, the asymmetric buddy configuration always leads to the maximum energy savings, which is essentially the same as that of the individual block configuration (which is not shown in the figure). To illustrate the variance in the normalized energy consumption, Fig.6(c) shows the standard deviation of the normalized energy consumption for the symm-16 configuration (i.e., 16 cores per block). Note that, for smaller values of DP max , the variation in the degree of parallelism of the workload is limited. Thus the difference of required number of cores (and blocks) is small, which in turn leads to rather stable normalized energy consumption for the symm-16 configuration and smaller standard deviation. However, for larger value of DP max , the workload has large variance in its degree of parallelism for different periods. And the number of cores (and blocks) required in each period has bigger difference as well, which leads to much different normalized energy consumption and larger standard deviations.
When the maximum degree of parallelism is fixed as DP max = 64, Figs. 6(b) and 6(d) further show the normalized energy consumption and its variation of different block configurations for varying DP min . Following the same reasonings, as the degree of parallelism increases, less energy can be saved for different block configurations. However, as DP min increases, the variance in the degree of parallelism in the work becomes less, which results in smaller standard deviation of the normalized energy consumption. When DP min = 64, the workload always has the same degree of parallelism (which is 64) and all cores will be utilized, which results in the same amount of energy consumption for all block configurations and the standard deviation will be zero.
Automated Target Recognition (ATR)
In this subsection, we consider a real-life application: automated target recognition (ATR). ATR has been widely used (e.g., in military systems such as missiles) and usually requires multiple processing units to obtain real-time processing [56] . The normal steps of ATR are as follows: first, ATR searches regions of interest (ROI) in an image frame; then, if ROIs are detected, each ROI will compare to some specific templates (e.g., tanks or civilian vehicles) for a match. Therefore, it is composed of a prescreen phase, a detection phase, and a template comparison phase as shown in Fig.7(a) . Note that, each phase of ATR has a different degree of parallelism, which depends on the number of ROIs detected and the number of templates that need to be compared with.
For our trace data, we have instrumented the ATR application to record their execution time for each parallel section and processed 179 consecutive frames on a Pentium-III 500 MHz with 128 MB memory. Fig.7(b) shows the statistical execution time information for different tasks in ATR. Here, the maximum number of ROIs detected in one image frame is 8 and there are 3 different templates that have to be compared with. Therefore, the maximum degrees of parallelism are 8 and 24 in the detection phase and the template comparison phase, respectively. We use a 32-core system to evaluate the energy efficiency of different block configurations for the ATR application. Note that, on a 32-core system, the maximum processing time of one image frame can take up to t max = 2.621 ms, when every phase takes its maximum execution time. In the simulations, we assume that the time period T = γ · t max to process one frame varies from 1.2 · t max to 2.4 · t max , which emulates different system loads and allows the cores to get different scaled frequencies. Within each frame, all deployed cores are scaled down to an initial frequency of f = tmax T . Note that, uniformly scaling down the execution of all phases with different parallelism is not optimal for energy savings [44] . However, exploring the optimal slack/time allocation to the phases with different degrees of parallelism for maximizing energy savings is orthogonal to the evaluation of energy efficiency for different block configurations.
Note that, the actual execution time of prescreen phase is usually a fraction of its worst case execution time. Moreover, the degrees of parallelism of the second and third phases are usually smaller than the maximum possible degree of parallelism. Therefore, within the processing of each frame, after the prescreen phase, we will further slow down the second and the third phases as well as select an appropriate number of cores/blocks to achieve more energy saving since the degree of parallelism of the second and third phases will be known at that moment. Fig.8 shows the results based on the trace data of the ATR application. Although the maximum degrees of parallelism for the second and third phases are 8 and 24 respectively, the number of ROIs in each frame (i.e., the degree of parallelism in the second phase) is usually only 2 to 4. If all 32 cores form a common block, the frequency independent power of all cores will be consumed. For configurations that have smaller block size, unused blocks can be effectively powered off to save substantial energy from the frequency-independent power. For instance, when β = 0.1, with only 2 blocks each of which has 16 cores, the energy consumption can be reduced by up to 30% compared to the common block configuration. The energy savings can be more significant for large value of β = 0.5 (i.e., higher frequencyindependent power) as shown in Fig.8(b) . Moreover, the results further confirm that the asymmetric buddy block configuration can achieve the same level of energy efficiency as that of the individual block configuration. When the amount of static slack increases, the cores can generally run at a lower frequency and less frequencydependent active power is consumed. However, with higher frequency-independent power (e.g., β = 0.5), the energy savings from powering off unused blocks dominate, which leads to almost the same energy consumption for different amounts of slack. 
Conclusion
Energy management has become an important research area in the last decade. As an energy efficient architecture, multicore has been widely adopted. However, with the number of cores on a single chip continuing to increase, it has been a grand challenge to effectively manage the energy efficiency of multicorebased systems. In this paper, based on voltage island and dynamic voltage and frequency scaling (DVFS) techniques, we investigate the energy efficiency of block-partitioned multicore processors, where cores are grouped into blocks and each block has a DVFS-enabled power supply. Depending on the number of cores on each block, we study both symmetric and asymmetric block configurations, which contain the same and different number of cores on each block, respectively. In particular, a buddy-asymmetric block configuration is studied for its flexibility to efficiently support different system workloads. The number of required blocks (i.e., power supplies) in buddy-asymmetric block configuration is logarithmically related to the number of cores on the chip.
For such block-partitioned multicore systems, we developed a system-level power model that can effectively support various levels of power management techniques and derived both block-and system-level energyefficient frequencies. Based on the power model, we proved that, for embarrassingly parallel applications, having all cores on a single block can achieve the same energy savings as that of the individual block configuration (where each core forms a single block and has its own power supply). However, for applications with limited degrees of parallelism, we show the superiority of the buddy-asymmetric block configuration, in that it can achieve the same amount of energy savings as that of the individual block configuration. The energy efficiency of block-partitioned multicore systems is further evaluated through extensive simulations with both synthetic and real-life applications.
