Scale-out workloads like media streaming or Web search serve millions of users and operate on a massive amount of data, and hence, require enormous computational power. As the number of users is increasing and the size of data is expanding, even more computational power is necessary for powering up such workloads. Data centers with thousands of servers are providing the computational power necessary for executing scale-out workloads. As operating data centers requires enormous capital outlay, it is important to optimize them to execute scale-out workloads e ciently. Server processors contribute signi cantly to the data center capital outlay, and hence, are a prime candidate for optimizations. While data centers are constrained with power, and power consumption is one of the major components contributing to the total cost of ownership (TCO), a recently-introduced scale-out design methodology optimizes server processors for data centers using performance per unit area. In this work, we use a more relevant performance-per-power metric as the optimization criterion for optimizing server processors and reevaluate the scale-out design methodology. Interestingly, we show that a scale-out processor that delivers the maximum performance per unit area, also delivers the highest performance per unit power.
INTRODUCTION
Companies like Google, Facebook, and Microso rely on their data centers to deliver scale-out services like streaming, social networking, and search. Such high-throughput data centers consume enormous energy while executing scale-out applications. As such, data centers consume more than three percent of total global energy and contribute to two percent of the total CO2 emissions [1] . Economical and environmental concerns necessitate making data centers more energy e cient.
Server processors contribute signi cantly to the power consumption of data centers [6] . Data centers use conventional server processors [11, 24] where highly speculative general-purpose processor cores are surrendered with large last-level caches (LLCs). As technology scaling provides more transistors, more cores with highly capable memory controllers and larger caches are employed in conventional processors. Prior work reveals that cores' capabilities, o -chip memory bandwidth, interconnection networks' bandwidth and cache size are over-provisioned with respect to what scale-out workloads need [3, 4, [11] [12] [13] . Accordingly, using the conventional methodology for scale-out data centers is a poor choice with respect to both performance and energy e ciency.
In another approach, tiled processors, which have many small and energy-e cient cores, replace conventional processors for the purpose of increasing per-server throughput as a result of using more processor cores [31] . Just like conventional processors, more cores and larger caches are placed in tiled processors as a result of technology scaling. Although tiled processors improve energy e ciency and performance as compared to conventional processors [21] , they make suboptimal use of silicon real estate [22] . Large caches found in tiled designs are not e ective for scale-out workloads because such caches are much smaller than the size of the data sets and much larger than the instruction footprint. So they cannot capture the data sets anyway and are beyond what is arXiv:1808.04864v1 [cs.AR] 14 Aug 2018 1:2 needed for instructions. Not only large caches have long access latencies, but also they impose high power usage. Moreover, in tiled processors, as the number of tiles increases, the access latency of LLC also increases [31] . Consequently, tiled methodology is not a good candidate for today's and especially tomorrow's energy-e cient designs.
Recent work proposed a scalable processor architecture that is based on the scale-out design methodology to maximize performance density (i.e., performance per unit area) [22] . e building block of the resulting processors named scale-out processors is a pod. A pod is a module that combines a few cores with a small LLC to form a server. A pod runs an operating system and has its own so ware stack. A scale-out processor consists of one or more pods with no inter-pod connectivity. With scale-out processors, technology scaling results in increasing the number pods. Prior work showed that scale-out processors maximize performance density (PD) [22] and improve total cost of ownership (TCO) [15] as compared to tiled and conventional processors.
Previous work optimized scale-out processors using performance per unit area due to the importance of die area at 40 nm fabrication technology. But in technologies below 20 nm, both at the chip level and at data centers, power and energy are number one constraints [6, 10] . While scaleout processors o er the highest performance density [22] , it is not clear if they are optimal with respect to energy e ciency. To shed light on this issue, in this work, we use a similar methodology as prior work [22] but use performance per power (P 3 ) as the optimization criterion.
Our experiments show that scale-out processors that are optimized for performance density are also optimal with respect to energy e ciency and vice versa. In this work, we make the following contributions:
• We use a system that consists of both processors and DRAM to evaluate the energy e ciency of various processor organizations.
• We show that for the technology node that we considered, the optimal pod con guration using performance per power is the same as what has been obtained using performance density.
• We show that the optimal pod con guration does not change for a large variety of technology nodes and DRAM parameters.
METHODOLOGY
Prior work [22] came up with a methodology to allocate the limited resources (mainly area) to the various components of a multi-core processor targeting maximization of throughput per unit area. In this work, we a empt to e ciently allocate power to various components targeting maximization of energy e ciency. Furthermore, we discuss how the optimal pod con guration changes if various characteristics of the system change. We use a combination of cycle-accurate simulation, analytic models and technical reports for this study.
Design and technology parameter
We analyze various designs in 14 nm technology using 0.8 volts for chip supply voltage. Our area constraints set to 280 mm 2 , our power budget for all designs sets to 95 W. We also use up to 6 single-channel DDR4 interfaces in our chip designs. Table 1 contains a summary of design parameters. Reported powers are estimation of real power on our workloads. We use three di erent core types in our study. Conventional processors represent aggressive, 4-way, highly speculative core microarchitecture. Tiled and scale-out processors are evaluated by two di erent core types. e rst model is a 3-way high-performance out-of-order core representing ARM Cortex-A15 [27] and the second model is a dual-issue in-order core, similar to Cortex-A8 [5] . We set all cores' frequency to 2 GHz in order to make comparison between Area of di erent SoC components derived from scaling micrograph of a Nehalem processor in 45 nm technology [19] . We extract DDR4 DRAM power consumption parameters using published DDR4 power characteristics [29] . As DRAM cannot easily be scaled beyond 20 nm technology [2, 26], we assume 20 nm DRAM in this study. We consider at most 70% utilization for DDR4 memory channels [9] . Power of other SoC components estimated by modeling Sun UltraSPARC T2 con gured using McPAT v0.8 [20] .
Chip organizations
For each design, we use as many cores and as much cache as we can without violating any constraints in area, power or memory bandwidth. Maximum required memory bandwidth determines the number of memory controllers in our designs. Performance and power estimation methodologies are described in Sections 2.4 and 2.5, respectively.
Conventional.
Conventional processor can accommodate at most 17 cores before reaching the speci ed power budget. ree DDR4 memory controllers are su cient to serve the o -chip demands. We use 48 MB of LLC in the processor. Cores and caches are connected through a crossbar interconnect.
Tiled with OoO cores.
Tiled OoO processor can accommodate 139 cores before reaching the power constraint. We use 80 MB of LLC in this processor. A mesh interconnect with 3-cycle delay per hop for both link and router is used for all tiled designs.
2.2.3
Tiled with in-order cores. By keeping the same LLC size as tiled OoO design, tiled in-order design has 225 cores and 80 MB of LLC. In this design, power constraint restricts the number of cores.
2.2.4 Scale-out. For determining core count and cache size of scale-out design, we have done an extensive evaluation changing the cache from 1 to 8 MB and core count from 1 to 256.
Scale-out workloads
We take scale-out workloads from CloudSuite. Our workloads include Data Serving, MapReduce, SAT Solver, Web Frontend, and Web Search. We have two MapReduce workloads in our suite, classi cation (MapReduce-C) and word count (MapReduce-W). 
Performance evaluation
As cycle-accurate full-system simulation is 100,000 times slower than real hardware [30] , it is impractical to search the whole design space with time-consuming simulations. Instead, we use an analytic model [16, 22] that its parameters derived from simulations. is model predicts performance based on cache size, cache miss ratio, core count, cache access latency and memory access latency.
To derive parameters of the analytic model, we use full-system simulation. For full-system simulation of di erent pod sizes, we use Flexus [14] , which is built on top of Virtutech Simics. Flexus extends Simics functional model with detailed models of OoO and in-order cores and the cache hierarchy.
We evaluate 10 seconds of execution of each workload using SimFlex sampling methodology [30] . For each measurement, we load checkpoints with warmed caches and branch predictors, and then run 100 K cycles to reach the steady state before collecting measurements for the subsequent 50 K cycles. For Data Serving workload, we need to run the simulations for 2000 K cycles to reach the steady state. We use the ratio of the number of application instructions to the total number of cycles (including the cycles spent executing operating system code) to measure performance; this metric has been shown to accurately re ect overall system throughput [30] . roughput measured with 95% accuracy and an average error rate lower than 4%.
Power evaluation
We use McPAT for power estimation of SoC components. For cores, however, recent studies show that McPAT is not accurate for power analysis due to the di erences between core structure and its implementation [32] . As an alternative, prior work has shown that Instruction per Cycle (IPC) is strongly correlated to the power consumption [7, 8, 25] . For example, Bircher and John [7] report an average of only 3% error in core's power usage when compared to the measured CPU power. Moreover, Rodrigues et al. show that it is possible to estimate a core's power usage with an average error rate of less than 3.9% using performance counters [25] . Using these approaches requires having power and energy numbers of the examined cores. For this purpose, we use the empirical power reports from the published technical report [28] for cores' power estimation.
RESULTS
We rst nd the optimal pod size for each core type and then replicate pods in each design to reach one of the constraints. Subsequently, we compare the resulting scale-out processor with tiled and conventional architectures. Finally, we compare performance-density optimal scale-out processors against their performance-per-power optimal counterparts. 
System with out-of-order cores
Average performance per power (P 3 ) across all workloads for four di erent LLC sizes is shown in Figure 1 . Larger cache sizes are not investigated as they deteriorate performance per power. Each graph contains three lines corresponding to three di erent interconnect types.
We observe that in all designs and regardless of cache size and interconnect, performance per power diminishes as the number of cores starts to exceed 32. A system with 16 cores, 4 MB of LLC and a crossbar interconnect maximizes P 3 . is is identical to the pod that maximizes performance per unit area [22] .
Based on circumstances discussed in Section 2, our scale-out processor design at 14 nm is powerlimited and can accommodate eight pods. e resulting system area and power are 253 mm 2 and 87 W (with DRAM 130 W), respectively.
Scale-out design with out-of-order cores achieves 3.95× higher P 3 as compared to the conventional processor due to using simpler cores and a smaller LLC. Also, a scale-out design has notable advantages over tiled designs with respect to P 3 : its overall P 3 is 26% higher than the tiled design. is advantage stems from ine cient large cache size and long inter-hop latency in tiled architectures. Figure 2 shows the average performance per power of di erent processors across all workloads. Based on these results, a P 3 -optimal pod contains 32 cores with 4 MB of LLC and a crossbar interconnect. Again, the P 3 -optimal pod with in-order cores is identical to the performance-density optimal pod [22] . is is because scale-out processors are tuned for the characteristics of scale-out workloads: (1) massive request-level parallelism, (2) large instruction footprint, and (3) enormous datasets in the main memory.
System with in-order cores
Resulting scale-out processor with in-order cores can a ord seven pods before violating the power budget. With all peripherals and interconnect, scale-out chip's total die-area is 193 mm 2 and consumes 86 W (with DRAM 139 W).
e scale-out chip with in-order cores o ers 43% higher P 3 as compared to a tiled design. Furthermore, it achieves 3.2× higher P 3 over conventional designs.
Sensitivity of Optimal Pod Configuration to Design Parameters
We perform a study on how the optimal pod will change if parameters of the design change. We use OoO small cores in this study. All the remaining design aspects are the same as previous experiments. LLC power usage, core's static and dynamic power, and DRAM access energy are the main elements of this study. We sweep the energy usage of these components from 0.1× to 10× of the current values to see how these changes a ect the optimal pod con guration. Figure 3 shows the results of our study. e solid rectangles indicate the state space while the do ed rectangles show parts of the state space in which the optimal pod con guration does not change. e gure clearly shows that the 16-core, 4-MB pod remains the optimal pod con guration for a large range of parameters. Figure 3a shows that changing cores' dynamic power by 10× does not change the optimal pod con guration. Moreover, cores' static power a ects the optimal pod con guration only when it is increased by 8× of its current value. Power-hungry cache system that at least consumes 4.7× more power, moves us towards having a smaller pod with fewer cores and a smaller LLC. On the other hand, increasing the DRAM access energy by more than 8.5× does the exact opposite. A power-hungry DRAM calls for a pod with a larger LLC to lter out more data accesses. Figure 3b shows that a 10× decrease in core power or DRAM access energy does not change the optimal pod con guration. Moreover, a low-power LLC only a ects the optimal pod con guration when its power usage becomes 0.3× of its current value.
is means that in more advanced technology nodes in which the energy of the core, cache, and DRAM is not expected to change signi cantly, the optimal pod con guration is likely to remain the same. Table 2 summarizes our chip-organization, power consumption, limiting factor, area, performance, power, PD and P 3 in 14 nm technology. As we consider DRAM dynamic power in our study, reported powers are more than the power budget that we set in Section 2, however, all chips consume less power than the limit. Performance column shows average user-instruction per clock cycle [30] that the corresponding design can deliver.
Summary
Our study indicates that a single pod con guration is optimal for both energy e ciency and performance density. Also, many technological changes in the cache, core or DRAM do not change the optimal pod con guration. We also showed how the pod con guration would change if characteristics of the components change signi cantly.
RELATED WORK
ere are proposals that optimize data-center cost, power, and/or area with an e cient processor architecture. Such pieces of prior work [15, 17, 18, 22] partially share some of the insights and/or conclusions of this work. Our work is di erent from prior work on scale-out processors [15, 22] in many aspects. Unlike those studies that target area as the optimization criterion, we use energy e ciency. While prior work [15] showed that a performance-density (PD) optimal processor also o ers be er energy e ciency, this work is the rst to show that a PD-optimal processor is also optimal with respect to energy e ciency. Moreover, unlike prior work, we included DRAM energy in our study. Finally, we study the e ect of variations of subsystem characteristics on the optimal pod con guration.
CONCLUSION
As the primary constraint of data centers is power usage, server processors that are optimized for scale-out workloads should exhibit excellent energy e ciency. For this purpose, we revisited the scale-out design methodology with respect to energy e ciency. We found that in many real-world conditions (like the ones in our study), the scale-out processors that are optimal with respect to performance density are also optimal with respect to energy e ciency.
