Several studies and recent real-world designs have promoted sharing of underutilized resources between cores in a multicore processor to achieve better performance/power. It has been argued that when utilization of such resources is low, sharing has a negligible impact on performance while offering considerable area and power benefits. In this article, we investigate the performance and performance/watt implications of sharing large and underutilized resources between pairs of cores in a multicore. We first study sharing of the entire floating-point datapath (including reservation stations and execution units) by two cores, similar to AMD's Bulldozer. We find that while this architecture results in power savings for certain workload combinations, it also results in significant performance loss of up to 28%. Next, we study an alternative sharing architecture where only the floating-point execution units are shared, while the individual cores retain their reservation stations. This reduces the highest performance loss to 14%. We then extend the study to include sharing of other large execution units that are used infrequently, namely, the integer multiply and divide units. Subsequently, we analyze the impact of sharing hardware resources in Simultaneously Multithreaded (SMT) processors where multiple threads run concurrently on the same core. We observe that sharing improves performance/watt at a negligible performance cost only if the shared units have high throughput. Sharing low-throughput units reduces both performance and performance/watt. To increase the throughput of the shared units, we propose the use of Dynamic Voltage and Frequency Boosting (DVFB) of only the shared units that can be placed on a separate voltage island. Our results indicate that the use of DVFB improves both performance and performance/watt by as much as 22% and 10%, respectively.
INTRODUCTION
Several studies have promoted sharing of large but underutilized resources between cores in a multicore processor [Dolbeau and Seznec 2002; Kumar et al. 2004; Butler et al. 2011 ] to reduce the silicon area at a marginal loss of performance. For example, AMD in its BullDozer architecture [Butler et al. 2011] has implemented sharing of the entire floating-point unit including reservation stations and execution units. Several research publications [Dolbeau and Seznec 2002; Kumar et al. 2004 ] go beyond FP units and suggest sharing of caches, crossbars, branch predictors, and large latency units. Most previous work only explores the performance impact of such sharing leaving the following questions unanswered:
(1) What is the impact of sharing on performance/power? While sharing clearly results in power savings, for certain workloads, performance loss may be too large. (2) What are the most important parameters affecting performance and performance/power in sharing? We show that latency and throughput of the shared resources are dominant determinants of performance and performance/power, while most previous studies paid limited attention to these. (3) How does sharing of resources play out for Big cores or Small cores? Mainstream computing can be broadly classified into performance efficient (Big cores) and power efficient (Small cores). It is thus necessary to study the impact of sharing resources in both such architectures. (4) What is the impact of sharing in Simultaneously Multithreaded (SMT) processors?
In particular, does sharing in SMT make performance or performance/power better or worse? Given that most mainstream cores are SMT capable, 1 studying impact of increased resource utilization due to sharing is important.
In this article, we investigate the performance and performance/watt implications of sharing large and underutilized resources between a pair of cores in a multicore processor. We first study sharing of the entire floating-point datapath by two cores, similar to AMD's Bulldozer [Butler et al. 2011] , where the issue queue (ISQ) and the FP execution units are shared. Using a combination of workloads from various benchmarks, we analyze both the performance and performance/watt when compared to the baseline architecture that does not involve sharing. Our findings show that while sharing results in considerable power savings, the performance penalty may be high (up to 28%) for certain workload combinations.
To mitigate the impact on performance while still retaining some of the power benefits of sharing, sharing should be limited to the underutilized execution units. For most workloads, FP instructions are not frequently encountered. Hence, we first explore sharing of just the FP execution units while the individual cores retain their reservation stations. This modification yields higher performance compared to the previous scheme. Still, a worst-case performance loss of 14% is observed. Integer divide and multiply instructions are also encountered infrequently. Therefore, we extend our study to include the corresponding units. We find that sharing the integer divide and multiply units has only a small impact on both performance and performance/watt. An overview of the resource sharing options explored in this article is shown in Figure 1 .
The utilization of the shared units depends on the width of the fetch and execution path. Accordingly, we target cores at opposite ends of the power/performance spectrum. On the higher end of the performance scale, we consider a superscalar processor analogous in resources to Intel's Nehalem and AMD's K10 architectures (Big core). At the lower end of the power scale, we consider a processor similar in resources to Intel's Atom and AMD's Bobcat architectures (Small core). Our study includes both single-threaded and SMT processor architectures. We also analyze the sensitivity to communication latency between the cores and the shared units. Our results show that while architectures that share execution units do provide power benefits at a negligible performance penalty (∼5% on average), such benefits hold only when the shared units have low latency and are highly pipelined. Performance and performance/watt losses are observed for workloads that exhibit high contention for the shared execution units. To reduce the performance loss due to contention, we propose increasing the throughput of the shared resources via Dynamic Voltage and Frequency Boosting (DVFB) based on the observed occupancy rate. Our results show that such dynamic boosting not only overcomes losses due to contention but also results in significant increases in both performance (up to 13%) and performance/watt (up to 14%) while achieving considerable savings in area (∼7%-10% per core).
The following are the key contributions of this article:
(1) We analyze the performance and performance/watt implications of three resource sharing alternatives for a dual-core processor. (2) We study the performance and performance/watt implications of resource sharing in SMT cores. (3) We analyze the sensitivity of resource sharing architectures to latency and performance of the shared resources. (4) We show that while execution unit sharing has a negligible impact on performance and positive impact on performance/watt for most benchmark combinations, there are cases where resource contention results in a penalty as high as 22%. (5) We present a DVFB scheme for the shared resources to mitigate the impact of resource contention that not only compensates for the loss but also increases the performance of most workload combinations. (6) Finally, we describe a novel hardware-based feedback control mechanism for DVFB that automates the dynamic control process.
The rest of the article is organized as follows. Recent work on shared resource architectures is reviewed in Section 2. An overview of the studied architecture is presented in Section 3. The experimental setup is described in Section 4, followed by the results on static execution unit sharing in Section 5. In Section 6, performance and power analysis of sharing resources in SMT processors is presented. Results on the proposed DVFB are presented in Section 7. Practical implementation of the dynamic boosting mechanism is presented in Section 8. Estimated area savings possible due by resource sharing is outlined in Section 9. Finally, we present conclusions in Section 10.
17:4 R. Rodrigues et al.
RELATED WORK
The idea of sharing resources for performance or performance/watt in a multicore has seen several manifestations. SMT [Tullsen et al. 1995; Levy et al. 1996 ] was introduced to improve the utilization of resources in microprocessors. In SMT, multiple threads are run on the same core and threads share and compete for core resources. Thus, dynamic resource sharing occurs naturally in SMT processors. Dolbeau and Seznec [2002] explore intermediate design points between the CMP and SMT architectures where the sharing of the caches, branch predictors, and long latency execution units is explored. A similar study was presented in Kumar et al. [2004] , where the caches, crossbars, and floating-point units were shared. Significant area savings at a minor loss of performance were reported. Both of these schemes focus on performance and do not consider performance/watt. In addition, the impact of the shared resource access latency or the effects on SMT processors were not studied. Watanabe et al. [2010] explore flexible sharing of a pool of "execution engines" among various processor cores. By ensuring that the producer and immediate consumers are sent to the same engine, efficient usage of the shared units was made possible. Still, each engine requires a queue and other data to keep track of producers and consumers, which results in a complex design. Borodin et al. [2011] , propose the sharing of functional units across cores in a 3D stacked die for online testing and/or performance improvement. A similar approach to 3D resource sharing was proposed in Homayoun et al. [2012] , where the Reorder Buffer (ROB), register file, instruction queue, and load/store queues were shared.
Dynamic exchange of execution units between pairs of cores was investigated in Rodrigues et al. [2011 Rodrigues et al. [ , 2013 . There, depending on the current workload characteristics, the cores may exchange execution units to maximize performance/watt. The major advantage of such an architecture is that resource contention between the two cores is avoided, but the design of the two cores is complicated. Furthermore, this scheme incurs the hardware and power overheads of two sets of execution units compared to the single set in our scheme.
The first resource sharing architecture we study is similar to AMD's Bulldozer design [Butler et al. 2011] . In it, the fetch, decode, and entire FP execution (reservation stations and execution units) are shared between the two cores in a dual-core processor. In our study, we also analyze a design that involves the sharing of the FP execution only.
SHARED RESOURCE MULTICORE ARCHITECTURE
We now describe the target of our study: the shared resource multicore architecture. Hardware modifications necessary to support such an architecture are also presented. A high-level view of the studied architectures is shown in Figure 1 . We consider the following three resource sharing alternatives.
Sharing the FP Issue Queue (ISQ) and Execution Unit (S_FP_QX)
Here, the FP ISQ and FP execution units are shared between two cores as depicted in Figure 1 (a). This architecture is similar to AMD's Bulldozer, but note that the Bulldozer design also shares the fetch and decode units. Sharing leads to contention for resources, and the first point of contention in the S_FP_QX scheme is the FP ISQ. Whenever FP instructions are ready to be scheduled, the control logic first checks to see if there is a slot available in the shared ISQ. Since the ISQ is shared, the number of entries available per core is reduced. Hence, whenever both cores sharing the ISQ run FP intensive applications, the ISQ is expected to become a bottleneck, which may lead to pipeline stalls and performance loss. Another source of stalls is the shared execution units. Just like the ISQ, the effective number of execution units available is reduced in the dual-core architecture. Hence, a higher number of stalls is expected when FP-intensive applications are run on the cores that share FP units.
Sharing the FP Execution Unit Only (S_FP_X)
This scheme supports sharing of only the FP execution units as shown in Figure 1 (b) . Unlike the previous case, the only source of contention here is the availability of the FP execution units. Hence, we expect a lower performance loss but also lower power savings compared to the previous scheme.
Sharing the FP Execution Units as Well as the Integer Divide and Multiply Units (S_FP_INT)
Here, in addition to the FP execution unit, the integer divide and multiply units are also shared (see Figure 1(c) ). The number of stalls for this scheme is expected to be higher than for the S_FP_X architecture, but greater power savings are expected. Since resources are shared in all three architectures, there is a need for a centralized control mechanism that will grant access to the requesting core. This is accomplished by means of an arbiter shown in Figure 1 . The arbiter accepts requests and, depending on the availability of the shared resource, grants access. Note that for all three schemes, accesses to the shared execution units are independent, and hence multiple requests may be sent to them at the same time. Once execution is complete, the execution result must be forwarded to the core that generated the request. This is accomplished by another arbiter that forwards the result to the rightful owner. We do not provide implementation details of the arbiter, which is fairly straightforward.
EXPERIMENTAL SETUP
To evaluate the idea of sharing infrequently used execution units for a wide variety of architectures, we considered processor cores at the two ends of the performance/power spectrum, that is, a high-performance core (Big) and a low-power core (Small). These cores are representative of the Intel Nehalem/AMD K10 and the Intel Atom/AMD Bobcat architectures, respectively. In the rest of this article, we will refer to them as Big and Small.
In Tables I and II , we list the resource sizes and execution resource characteristics for the two core types. The parameters were inspired by commercial architectures [Fog 2012] .
SESC was used for architectural performance simulation [Renau 2005 ]. We made significant modifications to the simulator to enable shared resource execution with arbitration. Power was estimated using Wattch [Brooks et al. 2000] and Cacti [Shivakumar In the experiments, we targeted 15 benchmarks: seven from the SPLASH-2 [Woo et al. 1995] (barnes, cholesky, fmm, lu, radix, raytrace, water) and eight from the SPEC 2000 benchmark suite [SPEC2000 2000 . These workloads were chosen for their instruction distribution and performance diversity. Several combinations of workloads were considered for the two cores running single threads. We also considered the case of SMT, where each core runs two threads from a set of four. Homogeneous workload combinations were created using multiple threads from the SPLASH-2 workloads. We also created heterogeneous workloads by combining threads from the SPEC 2000 suite. The created workload sets are summarized in Tables III and IV for the single and SMT experiments, respectively. We thus tried to evaluate the studied architectures over a broad spectrum of potential workloads. Each workload was run until the sum of the instructions retired on the two core types equaled 500 million instructions. The instruction distribution of each individual thread run is shown in Figure 2 .
ANALYSIS OF RESOURCE SHARING IN SINGLE-THREADED PROCESSORS
We first present results for two cores that share resources according to the S_FP_QX, S_FP_X, and S_FP_INT schemes (described in Section 3), with each core executing Table III . The performance and performance/watt of the studied architectures relative to the one where no sharing takes place are presented. Sensitivity to the shared resource access latency is also analyzed. To compare the resource sharing architectures with the one that does not, three speedup metrics are used, namely, weighted, geometric, and harmonic speedup. For the sake of brevity, only the results using the harmonic metric are presented. This metric also happens to be the most conservative of the three.
The harmonic speedup metric for performance is calculated as follows:
Here, baseline refers to the case where the cores do not share any unit. The performance/watt speedup/slowdown is calculated similarly.
Sharing the FP ISQ and Execution Units (S_FP_QX)
5.1.1. Performance Analysis. The performance of the Big and Small cores in the S_FP_QX configuration relative to the nonsharing architecture is shown in Figure 3 . Shared resource access latencies of zero, one, and two cycles were considered. The communication latency of zero cycles represents the ideal case where the design has been optimized to support sharing. It can be seen that even in this scenario, a significant performance loss is observed for both core types. Specifically, a worst-case performance penalty of 28% and 18% (workload cholesky_cholesky when run on both the cores) is observed for the Big and Small cores, respectively. This architecture shares the FP ISQ and the FP execution units. Thus, two potential bottlenecks exist in the system, yielding a large performance penalty. Increasing the communication latency results in an even larger performance penalty, as expected. This clearly shows the sensitivity of such a resource sharing architecture to communication latency. On average, an ∼5% to 10% performance penalty is observed for both the core types, which increases with access latency. These results show that when sharing resources between cores, special consideration must be given to the resource access latency. The workloads that do not experience a slowdown are the ones with little or no FP instructions in the mix (e.g., equake, art, gzip, gcc). Interestingly, the Small core does not suffer as much as the Big core with respect to performance. The Small core is moderately sized when compared to the Big core, and consequently, the experienced bottleneck has a greater effect in the case of the Big core. 5.1.2. Performance/Watt Analysis. The performance/watt resulting from the sharing of the FP ISQ and the FP execution units (S_FP_QX) relative to the nonsharing architecture is shown in Figure 4 for both core types. It can be seen that performance/watt improvements are achieved for most workloads on both core types, especially for the ones with no FP instructions. In general, FP instructions are not as frequently encountered as integer ones and hence, for a majority of the workloads, this architecture will result in power savings. However, there are workloads where the performance/watt of the Big core degrades by as much as 10% (e.g., cholesky_cholesky) even with access latency of zero cycles. This indicates that even though, in general, this architecture results in power savings, for workloads that compete for the shared resources, the performance/watt will degrade. On average, a 2.5% improvement for the Big core and a 3.5% improvement for the Small core were observed when the access latency was set to zero cycles. Increased latency reduces this improvement.
Even though the S_FP_QX architecture results in power savings in general, the experienced performance penalty can be very large (∼28%). This results in poor performance/watt and hence, we explored alternative sharing schemes to help mitigate the performance penalty.
Sharing Only the FP Units (S_FP_X)
5.2.1. Performance Analysis. The performance of the S_FP_X architecture relative to the one where each core has its own execution units is shown in Figure 5 for the Big and Small cores. For zero-cycle access latency, there is no notable performance penalty for the Big core. Even for cases where both threads highly utilize the shared units, no performance penalty was observed (e.g., cholesky_cholesky, radix_radix, flops_fbench). This is because the Big core has large and fast execution units that are fully pipelined, and unless contention takes place in the same cycle, no performance penalty will be experienced. This indicates that for a high-performance core, contention-related performance loss will rarely be a problem when the considered execution units are shared. The worst-case performance penalty has dropped to less than 1% for the Big core, which is a significant improvement when compared to the S_FP_QX architecture (∼28% performance loss in the worst case). This indicates that in the Big core, the major bottleneck is the FP ISQ. Increasing its size may help mitigate the performance penalty but may consume more power. Such an analysis is, however, out of the scope of this article. With an increase in access latency, there is a notable drop in performance. Still, for small latencies (one to two cycles), the performance penalty is well within reasonable limits (within 5% even for access latency of two cycles). Note that access latency of zero to one cycles is realistic; see for example, Dolbeau and Seznec [2002] , Kumar et al. [2004] , and . Hence, for cores such as the Big core, for small shared resource communication latencies, the performance loss is acceptable if FP execution units are shared between pairs of cores.
The results obtained for the Small core do show notable performance penalties, even for the ideal case of zero access latency. This is due to the nonpipelined and relatively higher latency execution units present in the Small core (see Table II ). Nonpipelined shared execution units have a greater chance for contention. For example, a nonpipelined multiplier with a latency of 10 cycles cannot accept a second request during these 10 cycles. The performance loss is the worst for barnes_barnes and flops_fbench (13%-14%). In both these cases, the workloads running on each core exhibit a significant proportion of FP instructions, and as a result, contention for the shared resources is very high. The average performance loss is about 8% for a two-cycle access latency. It is thus clear that for cores with nonpipelined and large latency execution units, sharing may result in significant performance loss. When compared to the S_FP_QX architecture, for the Small core the average performance loss drops from the observed 7% (for a zero-cycle communication latency) to around 3%. Hence, this architecture certainly results in lower performance penalty.
5.2.2. Performance/Watt Analysis. Sharing the large and infrequently used execution units results in static power savings. This is expected to improve performance/watt especially for the cases where no notable performance penalty is observed. However, power savings are not as large as that observed for the S_FP_QX architecture. The performance/watt results obtained for both core types are shown in Figure 6 . We have already seen that for the Big core, there is no notable performance loss even for a communication latency of two cycles. Performance/watt improvements larger than one were observed for the Big core with the communication latency of one cycle. We may conclude that for the Big core, sharing of large execution units results in performance/watt gains when considering realistic shared resource access latencies.
For the Small core, performance loss due to sharing even in idealized conditions (access latency of zero cycles) results in significant performance loss for several workloads. As a result, performance/watt gains, if any, are very modest with a few workloads experiencing performance/watt loss. Still, on average, the performance/watt gains are larger than one for zero-cycle access latency. Just like the Big core, increasing this latency beyond one cycle results in overall performance/watt loss. It is important to note that apart from two workloads (barnes_barnes, flops_fbench) all other workloads show a small improvement in performance/watt. From Figure 5 , it is observed that apart from those two workloads, there were also others such as fmm_fmm, raytrace_raytrace that showed performance loss but, when considering performance/watt, show improvements over the baseline. Hence, execution unit sharing architectures do in general improve performance/watt.
Based on these results, we can conclude that for Big cores, sharing FP execution units results in almost no performance loss but can result in small performance/watt gains. In contrast, for Small cores, even though there is a small performance/watt gain for low shared resource access latencies (between the core and the shared units), performance and performance/watt losses observed for a few workload combinations make the sharing of FP execution units between such cores questionable. This architecture provides slightly lower performance/watt than the S_FP_QX architecture without considerable performance penalties, which is a significant advantage.
Extending the Sharing to Include INT Divide and Multiply Units (S_FP_INT)
Most prior work has explored the sharing of only the FP units between pairs of cores [Dolbeau and Seznec 2002; Kumar et al. 2004 ]. However, from Figure 2 , it can be seen that apart from the workload lu_lu, no other workload shows any notable INT divide or multiply instructions. Thus, sharing these units in addition to the FP units is a natural extension. We call the resulting architecture the S_FP_INT sharing architecture. We analyzed such additional sharing, and the average results obtained over all workloads when run on each core type following the S_FP_X and S_FP_INT sharing schemes with respect to performance and performance/watt are plotted in Figures 7(a) and 7(b), respectively. All results are shown relative to the architecture that does not share execution units. In general, it can be seen that for both core types, with respect to performance, S_FP_X sharing is slightly better than the S_FP_INT sharing architecture, and the opposite trend is observed with respect to performance/watt. However, the differences are too small to prefer one architecture over the other. But since INT divide and multiply are relatively large execution units, sharing them certainly yields area savings (details to follow later). Hence, we conclude that S_FP_INT sharing enhances the benefits of S_FP_X sharing architectures.
ANALYSIS OF SHARING IN SMT PROCESSORS
We now present results on the effect of sharing resources in SMT processors. In these experiments, each core runs two threads. The various workload combinations considered are shown in Table IV . For the sake of brevity, only average and minimum speedup over all the considered workloads for each of the three resource sharing architectures relative to the baseline (where no sharing is implemented) are presented.
Performance Analysis
The average and minimum performance achieved by the three resource sharing architectures relative to the one with no sharing is shown in Figure 8. 6.1.1. The S_FP_X and S_FP_INT Architectures. In general, we found that the architectures that only share execution units result in about the same level of performance for both the Big and Small core types. For the Big core, ignoring communication latency, a 1% performance loss is observed in the worst case and an even smaller penalty is seen on average. This result is similar to that observed when running only a single thread per core. This indicates that even when up to four threads compete for the execution resources of the Big core, limited performance penalty will be experienced, which is mainly attributed to the large and fully pipelined execution units.
For the Small core, a larger performance penalty was observed when compared to the Big core. A worst case of 22% performance loss was observed for the workload barnes+barnes_barnes+barnes, which constitutes an increase of 8% over the observed 14% when running the workload barnes_barnes in the earlier experiments. There were also some low IPC workloads such as raytrace+raytrace_raytrace+raytrace for which the performance penalty was smaller than that obtained while running raytrace_raytrace. For such workload combinations, stalls in the execution core mitigate the impact on performance of resource sharing. On average, a 3% performance penalty was observed for the Small core.
In summary, we find that even in SMT processors, sharing execution resources between cores is expected to result in a negligible performance penalty in Big cores and sometimes a notable performance penalty in Small cores.
6.1.2. S_FP_QX. From Figure 8 , it is clear that the S_FP_QX architecture results in a larger performance penalty than S_FP_X and S_FP_INT for both core types. Ignoring access latency, we have observed that an average performance loss of 4% and 5% and a worst-case loss of 22% and 25% were observed for the Big and Small cores, respectively. This performance loss increases with an increase in access latency as expected.
On the Big core, in the single-threaded experiments, the workload cholesky_cholesky experienced the worst case of 28% performance loss. The loss was reduced to 16% when running the workload cholesky+cholesky_cholesky+cholesky in SMT mode. The reason for this drop in penalty is that in SMT mode, a system IPC of 0.35 was observed, which was a drop from the observed IPC of 0.5 in the single-threaded experiments. Thus, additional stalls due to resource sharing do not have a high impact on the performance. A worst-case performance loss of 25% was observed for the workload water+water_water+water. This constitutes an 8% increase in the observed 17% performance penalty when running the workload water_water. Hence, for the workload water, increasing the number of thread contexts per core results in an increased penalty for the Big core. The performance loss is higher by 4% on average when compared to the S_FP_X and S_FP_INT architectures for the Big core.
For the Small core, just as for the S_FP_X and S_FP_INT architectures, the worst case was observed for the workload barnes+barnes_barnes+barnes. Another workload that exhibited a significant (17%) performance penalty was radix+radix_radix+radix. No performance penalty was observed for the same workload when running on the S_FP_X and S_FP_INT architectures. This workload suffers mainly from stalls in acquiring reservation station slots on the Small core. Overall, the performance loss goes up by 2% on average when compared to the S_FP_X and S_FP_INT for the Small core.
In summary, performance is expected to degrade for a few workloads in either of the sharing architectures. For the Big core, performance penalty is expected only in the S_FP_QX design. When compared to the experiments where only single threads were run on each core, performance penalty may sometimes be lower for SMT processors. In SMT mode, resource utilization is higher, and as a result, if the IPC is low, the performance penalty due to sharing is also low.
Performance/Watt Analysis
The performance/watt of the various resource sharing architectures relative to the one with no sharing is shown in Figure 9. 6.2.1. S_FP_X and S_FP_INT. For these architectures, little or no performance penalty was observed on the Big core. Consequently, power savings that result from sharing resources lead to performance/watt gains. Such gains drop with an increase in the shared resource access latency. On average, performance/watt gains of 3.1% and 3.5% were observed for the S_FP_X and S_FP_INT designs, respectively, on the Big core. On the Small core, we observed a significant performance penalty for some workloads with a worst case of an 8% loss for barnes+barnes_barnes+barnes. However, on average, small performance/watt gains of around 1.7% and 1.4% were observed for the S_FP_X and S_FP_INT architectures, respectively, for the Small cores. Note that the performance/watt gain does not drop below one for either configuration on both the Big and Small cores, even with a two-cycle access latency. Fig. 9 . Performance/watt of the Big and Small cores in the S_FP_QX, S_FP_X and S_FP_INT configurations relative to a dual core that does not share resources for various communication latencies. Two threads were run on each core.
S_FP_QX.
In general, performance loss on this architecture was larger than for the S_FP_X and S_FP_INT architectures. However, the power savings were far greater. Hence, even though the worst-case performance/watt loss of 8% was observed on the Big core, an average gain of 5% and a maximum gain of 11% were achieved for the workload radix+radix_radix+radix.
A similar result was observed on the Small core, where an average performance/watt gain of 3.5% and a maximum gain of 7% were achieved for the workload raytrace+raytrace_raytrace+raytrace.
In summary, this architecture results in better performance/watt than the other two on average. However, certain workload combinations may suffer significantly.
MITIGATING THE PERFORMANCE IMPACT OF SHARING LOW-THROUGHPUT RESOURCES VIA DYNAMIC BOOSTING
We have seen that some workload combinations experience a significant performance loss due to resource sharing in Small cores. As indicated earlier, there are two reasons for this performance degradation. The first one is contention for the shared resources, and the second reason is access latency between the core and the shared resources. Performance loss due to contention can be mitigated if the shared resources run faster. This may be achieved by replacing the existing high-latency execution units by more powerful and small-latency units [Dolbeau and Seznec 2002] . However, as was observed in Figures 3 and 5 , the performance of most workloads does not degrade by sharing resources. Hence, increasing the strength of the execution units will result in power inefficiency for these workloads. Therefore, we propose the use of Dynamic Frequency Boosting (DFB) or DVFB where, depending on the currently executing workload characteristics, the voltage and/or frequency of only the shared execution units are increased. We only consider boosting of the shared execution units and not the shared ISQ in the case of the S_FP_QX configuration as accelerating the ISQ is not expected to yield any benefit. Selective boosting of the shared execution units is achieved via the establishing of Voltage and Frequency Islands (VFIs) [Lackey et al. 2002; Garg et al. 2009; Jang et al. 2010; Semeraro et al. 2002] . In these designs, two (or more) parts of the processor core may be operated at different voltages and/or frequencies. For example, Ghosh et al. [2010] make use of voltage scalable hybrid arithmetic units for power benefits. Most previous work makes use of this concept for energy savings. Our objective is performance improvement of the shared resources only during periods of resource contention. This may potentially also result in performance/watt improvement. Given that the shared execution units are already separated from the cores (see Figure 1) , placing them in an island is relatively simple. We did not consider full-chip voltage and frequency boosting due to its inherent power inefficiency. Performance boosting may be achieved by increasing the frequency of the shared units. Often, power is the limiting factor that governs operating frequency. The frequency may be increased as long as package thermal limits are not exceeded and the circuit timing margins are not violated. Since the execution units are shared, increasing their operating frequency results in a much smaller power increase than full-chip boosting. Hence, if the circuits allow increasing the frequency of operation on demand, the implementation is simple. We call this mode the High-Frequency Mode (HFM). For some circuits, voltage may also need to be increased to meet the timing requirements. We call this mode the High-Voltage and -Frequency Mode (HVFM), and this mode is expected to incur a higher energy penalty. Note that these two modes are mutually exclusive for a given design and are analyzed here for completeness of the evaluation. Either the circuit allows for HFM and HVFM is not needed or HVFM is the only option. Thus, in the shared resource VFI, three modes are considered; the Nominal Mode (NM) with nominal voltage and frequency, the HFM, and the HVFM. The voltage and frequency levels used for both core types in all the three modes are shown in Table V . These values were obtained from Eyerman and Eeckhout [2011] and from data available on Intel's turbo boost technology.
2,3 The HFMs can potentially mitigate the performance loss due to resource sharing. On the other hand, power overhead is also expected. It is thus necessary to limit the use of these modes to only those instances when the shared resources are overwhelmed.
In order to model the HFMs in our experiments, the latency of the shared execution units was reduced proportionally to the increase in frequency. Latencies are set back to the nominal values when the system returns to the NM mode. Cycles are always measured in the units of the NM frequency. Hence, we continue to use performance/watt as the metric to measure relative speedup even though the shared resource island may switch between NM and HFM/HVFM.
We first present performance and performance/watt results when operating the Small core in the HFM/HVFM throughout the execution. A dynamic scheme to switch between operating modes is then presented. We do not explore boosting the performance of the Big core since the shared execution units in such architectures are not expected to be a bottleneck.
Static Voltage Frequency Scaling
For this study, the shared execution units always operate in the boosted mode (HFM/HVFM) irrespective of the workload characteristics. Such a scheme results in higher power dissipation but is an interesting case to study as a potential upper bound on the performance penalty mitigation possible by frequency boosting. The calculated Fig. 10 . The performance of the three resource sharing designs of the Small core relative to the design that does not share resources, for various workloads when operated in the NM, HFM and HVFM modes. Latency of zero cycles was assumed. Fig. 11 . The performance/Watt of the three resource sharing designs of the Small core relative to the design that does not share resources, for various workloads when operated in the NM, HFM and HVFM modes. Latency of zero cycles was assumed.
harmonic performance and performance/watt speedups for all the considered workloads when executed on Small cores in SMT mode for the NM, HFM, and HVFM operating modes are shown in Figures 10 and 11 , respectively. A shared resource communication latency of zero cycles was considered to get a representative picture without loss of generality.
7.1.1. Performance Analysis. It can be seen that the performance is significantly improved in the boosted modes (HFM/HVFM) for several workloads. In particular, the workloads barnes+barnes_barnes+barnes and radix+radix_radix+radix and any workload running flops and fbench show a considerable performance gain (7%-20%) in the boosted modes of operation. There are also several workloads such as cholesky+cholesky_cholesky+cholesky, fmm+fmm_fmm+fmm, equake+art_gzip+ammp, and mcf+gcc_art+ammp where no notable improvement is observed. There is no difference between the HFM and HVFM modes with respect to performance as is evident from the figures. The boosted modes achieve a 4% to 5% on average and a maximum of 20% improvement in performance over the NM mode. Clearly, from a performance standpoint, operating in the boosted mode is a good option. 7.1.2. Performance/Watt Analysis. With respect to performance/watt, there are workloads that benefit from the HFM/HVFM. For example, barnes+barnes_barnes+barnes and radix+radix_radix+radix show a 6% to 7% improvement in performance/watt. However, there are several workloads where performance/watt in the NM mode is the highest, for example, cholesky+cholesky_cholesky+cholesky, fmm+fmm_fmm+fmm, and workloads containing the combination equake+art_gzip+ammp and mcf+gcc_art+ammp. These were the workloads where no notable performance improvement was observed (see Figure 10 ). Between the HFM and HVFM, the HVFM performs worse, which is expected. This mode requires a higher voltage and hence results in larger power penalty than the HFM. These results clearly show that operating in the HFM or HVFM modes throughout the execution is not desirable with respect to performance/watt for several workloads. A dynamic scheme may yield better results.
Dynamic Voltage and Frequency/Frequency Boosting
A feedback control mechanism is needed in order to determine the best mode to operate in as a function of currently executing workload characteristics. To that end, we developed a simple hardware scheme to enable switching between the NM and HFM/HVFM modes. The shared resources are expected to become a bottleneck whenever the contention for any shared unit increases. Occupancy or utilization of the shared execution units can potentially provide a good estimate of whether the bottleneck exists. By monitoring these, we can decide when to switch between the NM and the boosted modes of execution.
Performance monitoring counters are available in most modern microprocessors [Contreras and Martonosi 2005; Singh et al. 2009 ]. For our purposes, we need as many counters as there are shared units to count the number of busy cycles for each execution unit. Whenever the occupancy for any shared unit exceeds a threshold (upper), the boosted mode is enabled. Switching back to the NM takes place when utilization reduces below a threshold (lower). As the occupancy of the execution units changes over time, it is necessary to keep monitoring the utilization within small intervals. At the end of each interval, all the counters are reset to zero so that counting for the new interval may begin afresh. Furthermore, to avoid too frequent voltage and/or frequency changes, a switch is initiated only if the decision to switch was observed for at least 90% of the last HisD windows, referred to as history depth. For example, if we set HisD = 10, a switch in operating mode is affected only if the decision to switch was observed for at least nine of the 10 recent windows. In the rest of this article, we refer to the scheme that switches between NM and HFM as DFB and the scheme that switches between NM and HVFM as DVFB.
A simple illustration of the mechanism to control the mode switching is shown in Figure 12 . There is a utilization counter for each shared execution unit. Control logic monitors these counters and accepts as input the following parameters: interval length, history depth, and the upper and lower thresholds (introduced later). The utilization during that window is then calculated, and depending on the current operating mode of the VFI and the values of the input parameters, a change in operating mode may be affected. Note that utilization (proportion of busy cycles) is always measured with respect to the cycle time of NM. Since the execution units are accelerated in the boosted modes, this effectively reduces the utilization, potentially mitigating the bottleneck. The following four parameters of the dynamic mechanism need to be determined:
(1) The window or interval length (IntLen) in cycles after which the utilization counters must be sampled. Choosing too small a value may result in noisy behavior, while too large a value may result in missing potential opportunities. (2) The number of intervals to wait until high-confidence decisions may be made. This is called the history depth (HisD). A switch in mode is initiated only if the decision to switch was observed for 90% of the last HisD windows. Here as well, choosing too small a depth may result in frequent mode switches, while too large a depth may result is missing opportunities to perform a mode switch. (3) The threshold to enter HFM/HVFM from NM. We call this Threshold Upper (ThU). A mode switch takes place only when the utilization of one of the shared execution units exceeds the ThU. (4) The threshold to go back into NM from HFM/HVFM. We call this Threshold Lower (ThL). This mode switch takes place only when the utilization of all shared execution units goes below the ThL.
We carried out an exploratory experiment to find the set of values for the aforementioned parameters. In these experiments, we used the workloads barnes+barnes_barnes+barnes, raytrace+raytrace_raytrace+raytrace, and equake+art_ flops+fbench for offline training experiments. Based on these experiments, the selected parameters are IntLen = 20, HisD = 50, ThU = 85%, ThL = 50%.
Performance and Performance/Watt Analysis When Using the Proposed DFB/DVFB Schemes
We now present the performance and performance/watt achieved by the resource sharing architectures equipped with DFB and DVFB. Results for the Big core are not shown as the shared execution units were not found to be a bottleneck.
7.3.1. Performance Analysis. The average, maximum, and minimum relative performance of the DFB scheme over the baseline in which no sharing takes place in the S_FP_X, S_FP_QX, and S_FP_INT configurations are shown in Figure 13 for communication latencies of zero to two cycles. Results are shown for both single-threaded and SMT workloads. A comparison of the relative performance obtained in NM, DFB, and DVFB modes is presented in Figure 14 . Considering the single-threaded workloads, the worst cases observed in the NM for the S_FP_X and S_FP_INT configurations were for the workloads barnes_barnes with a relative performance of 0.86 and flops_fbench with a relative performance of 0.87. The performance of these workloads was significantly increased by 13% to 15% with an observed relative performance of 0.99 and 1.026 for these two workloads, respectively. On average, performance was boosted by 3% for the S_FP_X configuration and by 4.5% for the S_FP_INT configuration when compared to the NM mode. Maximum improvements in performance of 3% and 13% were observed for the S_FP_X and S_FP_INT configurations, respectively, over the baseline. There were instances where integer divide and multiply units became bottlenecks for a few workloads (containing raytrace or lu). Boosting the performance of these units resulted in significant performance gains of as high as 13% for lu_lu. For the S_FP_QX configuration, the worst case was observed for barnes_barnes, cholesky_cholesky, water_water, and flops_fbench, with relative performances of 0.84, 0.82, 0.88, and 0.84, respectively. With DFB, the relative performance of these workloads was increased to 0.96, 0.83, 0.89, and 1.02, respectively, but not all workloads showed such notable improvement. The reason for this is that these workloads suffered more due to stalls in the ISQ and not the execution units. On average, performance improvement of 4% was observed for the S_FP_QX configuration when compared to the NM. Increasing the latency of the shared resources results in a 2% to 3% drop in performance, demonstrating the sensitivity of these architectures to the shared resource access latency.
With respect to the SMT workloads, for all three configurations, the workload barnes+barnes_barnes+barnes showed the worst-case relative performance of 0.78. This was boosted to 0.96 in all three configurations, representing a 23% improvement in performance. On average, performance was improved by 4%, 3%, and 5% for the S_FP_X, S_FP_QX, and S_FP_INT configurations, respectively, relative to the NM mode. These architectures also compare well against the baseline architecture. The S_FP_X, S_FP_QX, and S_FP_INT configurations achieve performance of 1.01, 0.98, and 1.029, respectively, relative to the baseline.
From Figure 14 , we conclude that the benefits of the DFB and DVFB modes are very similar, although they differ in the mode switching overhead (DFB requiring 10 cycles vs. 20 cycles for DVFB).
7.3.2. Performance/Watt Analysis. The performance/watt results are summarized in Figure 15 for the DFB scheme and in Figure 16 for the NM, DFB, and DVFB schemes. Just as was the case with performance, the DFB scheme significantly improves the performance/watt.
For the single-threaded workloads, the worst-case workload combinations for the S_FP_X and S_FP_INT configurations were barnes_barnes and flops_fbench, with relative performance/watt of 0.96. This loss was mitigated with a 5% improvement in performance/watt in the DFB mode. For the S_FP_QX configuration, the workloads barnes_barnes, cholesky_cholesky, and flops_fbench have a relative performance/watt of around 0.98. Among these, the relative performance of barnes_barnes and flops_fbench improved to 1.04 and 1.07, respectively, while that of cholesky_cholesky was only improved to 0.98. Once again, stalls in the ISQ were the reason for this. Maximum improvements of 5%, 11%, and 12% and average improvements of 3%, 5%, and 4.5% were observed for S_FP_X, S_FP_QX, and S_FP_INT, respectively, over the baseline. The corresponding average improvements were 2%, 2%, and 3% for S_FP_X, S_FP_QX, and S_FP_INT, respectively, over the NM mode.
For the SMT workloads, a worst-case relative performance/watt of 0.91 was observed when running the workload barnes+barnes_barnes+barnes on both S_FP_X and S_FP_INT configurations in the NM mode. This was improved to 1.01 and 1.005, respectively, by the DFB scheme. The worst case for the S_FP_QX was a relative performance/watt of 0.94 running the same workload in NM mode. This was improved to 1.039 by running in DFB. Maximum improvements of 8%, 9%, and 8.3% and average improvements of 3.1%, 4.7%, and 4.3% in performance/watt were observed for S_FP_X, S_FP_QX, and S_FP_INT, respectively, over the baseline. This yields an average improvement of 2% to 3% in performance/watt when compared to the NM mode.
From Figure 16 , we note that the benefits of the DFB and DVFB mechanisms are similar, with DFB doing a little better (by 1%-2%) since it does not incur the voltage regulator power overhead.
Percentage of Execution Time Spent in the Boosted
Modes. The boosted modes should not be used all the time. If this is the case, the processor was not properly sized and the results may be biased and misleading. In Figures 17 and 18 , the percentage of time spent in the boosted mode in the DFB scheme is shown for the single-threaded and SMT workloads, respectively. Results are shown for all three sharing configurations for a shared resource communication latency of one cycle.
For the single-threaded workload flops_fbench, all three configurations were run in the boosted mode for 100% of the time. This shows that for this workload, the shared execution unit was a severe bottleneck. Other workloads that were executed for most of the time (75%-80%) in the boosted mode were lu_lu and raytrace_raytrace when run in the S_FP_INT configuration. These workloads resulted in contention for the integer multiply and divide operations. The DFB scheme detected this and accordingly operated in the boosted mode. The remaining nine workloads operate in the boosted mode for 0% to 40% of the time. On average, the Small core was operated in the boosted mode for 17% to 25% of the time for all three configurations. Similar results were obtained for the DVFB mode.
For the SMT workloads, the number of workloads run in the boosted mode for all three configurations is higher. Six of the considered 15 workload combinations used the boosted mode for 70% to 100% of the time. In these experiments, contention for the shared resources is higher than that observed when running single-threaded workloads. On average, the Small core operated in the boosted mode for 32% to 40% of the time for all three schemes and nine of the 15 workloads operated in the boosted mode for less than 20% of the time. These results show that while some workloads prefer to run in the boosted mode for a longer duration than others, there are also several workloads for which the NM suffices, indicating that the target architecture was sized appropriately.
IMPLEMENTING THE DYNAMIC BOOSTING MECHANISMS
The proposed DFB and DVFB schemes have shown a significant potential to not only mitigate performance loss but also, in some cases, result in both performance and performance/watt improvements over the baseline. However, implementing such mechanisms may incur hardware and performance overheads. We now discuss these overheads and present the resulting area overhead in the next subsection.
Power Overheads
The expected power overhead for DFB is negligible but not for DVFB. Assuming that the on-chip voltage regulator has a conversion efficiency of 90% [Kim et al. 2008 ], 10% of the power is wasted. We have found this power to be around 1% of the total power expended in the processor and it constitutes, therefore, a very small overhead. This should be compared to the 12.5% power consumed by the execution units (measured during simulation) in conventional processors where no sharing takes place. Clearly, the overheads are far lower than the benefits provided by the boosting schemes.
Performance Overheads
The dynamic boosting schemes affect a shift in voltage and/or frequency whenever deemed necessary. Two issues arise when employing such a dynamic control: (1) lost cycles during the transition in voltage and/or frequency and (2) synchronization between the VFIs.
8.2.1. Cycles Lost During Operating Mode Transition. For the DFB scheme, only a few cycles are lost during the frequency transition. IBM's PowerTune technology [Lichtenau et al. 2004] generates multiple frequencies that are selected using multiplexers. The overhead to switch between frequencies was reported to be one cycle. Even if a separate PLL is used to generate the additional frequency, the transition overhead is not expected to be significant. We have pessimistically assumed an overhead of 10 cycles for the DFB mode. For the DVFB mode, in addition to frequency transition, a voltage transition is needed. Eyerman and Eeckhout [2011] reported that the dV /dT for onchip voltage regulators is around 20mV /ns. In our scheme, the cores transition between 1.1 and 1.35V. Hence, the time to transition between the two voltages is around 12.5ns. Considering that the Small core operates at 1.5GHz, the overhead in cycles for voltage transition is about 20 cycles. Note that during this period, the shared execution units are not accessible to avoid loss of signal integrity. Hence, whenever mode switches are made, this penalty is always experienced.
Synchronization Between the VFIs.
Since the VFIs can sometimes operate at different frequencies and/or voltages, this may lead to synchronization problems between the islands, possibly leading to loss of cycles. Note that synchronization problems will be avoided if buffers are inserted at the boundary of the two VFIs. In all the considered designs, buffers are already present in the design (ISQ). Furthermore, by making use of certain types of FIFO buffers [Semeraro et al. 2002] , any penalty due to synchronization can be completely avoided. Hence, in our experiments, we do not consider any overhead due to synchronization.
AREA SAVINGS
In the target architecture, large and infrequently used resources are shared between cores, resulting in area savings. Kumar et al. [2004] report that the area savings of sharing just the FP units is around 6.1%. Hence, the S_FP_X configuration is expected to result in around 6% to 7% savings in area per core. Shivakumar et al. [2003] specify that the area occupied by the INT and FP execution units is approximately 12% to 13%. In Figure 19 , the floorplan of the Intel Nehalem processor is shown. 4 The approximate area occupied by the execution units and the OOO scheduling logic (integer/FP ISQ and ROB) are also shown. The execution units occupy around 18% of the area of the core. Considering that ALUs account for a very small portion of the 18% occupied by the execution units, the S_FP_INT configuration is expected to yield around 8% to 9% savings in area per core. The OOO logic occupies 14% of the core area, and assuming that half of that is occupied by the ROB and the other half by the integer and FP ISQ, the approximate area savings per core for the S_FP_QX configuration is around 9% to 10%. These savings in area are certainly expected to be considerably larger than the investment in real estate required for controlling access to the shared units.
Next, we estimate the area requirement for an on-die voltage converter. Hazucha et al. [2005] report an area of 0.008mm 2 for an output power of 0.1 watts in 90nm technology. We therefore estimate an area of 0.16mm 2 (20X) for an on-die voltage converter with 2 watts of output power. Considering that the die area of the Atom processor 5,6 is around 24-26mm 2 , the area of the on-chip voltage regulator is negligible compared to the execution core area.
CONCLUSIONS
In this article, we have investigated the performance and performance/watt of multicore processors that share infrequently accessed execution resources. Inspired by the AMD Bulldozer architecture, we studied the impact of sharing the floating-point (FP) execution unit and issue queue between two cores in a dual-core processor. We then expanded the scope of the study by considering a Big core that is akin to Intel's Nehalem processor and a Small core that is akin to Intel's Atom processor. A variety of multiprogrammed and multithreaded workload combinations were studied in single-threaded and SMT modes. We found that this architecture can sometimes result in significant loss in performance (up to 28%). To mitigate this performance loss, we limited the sharing to just the execution units, including FP and integer divide and multiply units. This reduced the performance penalty to 14%. Sensitivity of the performance and performance/watt of such architectures to the shared resource access latency was also investigated. It was found that both performance and performance/watt are highly sensitive to the shared resource access latency. Our sensitivity study has further indicated that as long as the cores share high-throughput execution units, for most of the workloads, a small gain in performance/watt is achieved at the expense of a small loss in performance. In order to mitigate such loss in performance, a DVFB scheme has been presented to accelerate execution in the shared resources. Such dynamic boosting was found to completely negate the performance losses and resulted in significant performance/ watt gains. The dynamic scheme improves the performance and performance/ watt of resource sharing architectures by as much as 22% and 10%, respectively. We also observed a performance and performance/watt improvement of 13% and 14%, respectively, over nonsharing cores. Furthermore, the performance/watt/area improves by as much as 26.2%, increasing the attractiveness of sharing.
