Scale-Out Processors & Energy Efficiency by Esmaili-Dokht, Pouya et al.
1Scale-Out Processors & Energy Eiciency
POUYA ESMAILI-DOKHT, Universitat Polite`cnica de Catalunya (UPC) & Barcelona Supercomputing Center (BSC)
MOHAMMAD BAKHSHALIPOUR, Sharif University of Technology & Institute for Research in Fundamental Sciences (IPM)
BEHNAM KHODABANDELOO, Institute for Research in Fundamental Sciences (IPM)
PEJMAN LOTFI-KAMRAN, Institute for Research in Fundamental Sciences (IPM)
HAMID SARBAZI-AZAD, Sharif University of Technology & Institute for Research in Fundamental Sciences (IPM)
Scale-out workloads like media streaming or Web search serve millions of users and operate on a massive
amount of data, and hence, require enormous computational power. As the number of users is increasing and
the size of data is expanding, even more computational power is necessary for powering up such workloads.
Data centers with thousands of servers are providing the computational power necessary for executing
scale-out workloads. As operating data centers requires enormous capital outlay, it is important to optimize
them to execute scale-out workloads eciently. Server processors contribute signicantly to the data center
capital outlay, and hence, are a prime candidate for optimizations. While data centers are constrained with
power, and power consumption is one of the major components contributing to the total cost of ownership
(TCO), a recently-introduced scale-out design methodology optimizes server processors for data centers
using performance per unit area. In this work, we use a more relevant performance-per-power metric as
the optimization criterion for optimizing server processors and reevaluate the scale-out design methodology.
Interestingly, we show that a scale-out processor that delivers the maximum performance per unit area, also
delivers the highest performance per unit power.
Additional Key Words and Phrases: Scale-Out Workload, Data Center, Server Processor, Performance per Wa,
Performance per Unit Area.
1 INTRODUCTION
Companies like Google, Facebook, and Microso rely on their data centers to deliver scale-out
services like streaming, social networking, and search. Such high-throughput data centers consume
enormous energy while executing scale-out applications. As such, data centers consume more than
three percent of total global energy and contribute to two percent of the total CO2 emissions [1].
Economical and environmental concerns necessitate making data centers more energy ecient.
Server processors contribute signicantly to the power consumption of data centers [6]. Data
centers use conventional server processors [11, 24] where highly speculative general-purpose
processor cores are surrendered with large last-level caches (LLCs). As technology scaling provides
more transistors, more cores with highly capable memory controllers and larger caches are employed
in conventional processors. Prior work reveals that cores’ capabilities, o-chip memory bandwidth,
interconnection networks’ bandwidth and cache size are over-provisioned with respect to what
scale-out workloads need [3, 4, 11–13]. Accordingly, using the conventional methodology for
scale-out data centers is a poor choice with respect to both performance and energy eciency.
In another approach, tiled processors, which have many small and energy-ecient cores, replace
conventional processors for the purpose of increasing per-server throughput as a result of using
more processor cores [31]. Just like conventional processors, more cores and larger caches are placed
in tiled processors as a result of technology scaling. Although tiled processors improve energy
eciency and performance as compared to conventional processors [21], they make suboptimal
use of silicon real estate [22]. Large caches found in tiled designs are not eective for scale-out
workloads because such caches are much smaller than the size of the data sets and much larger
than the instruction footprint. So they cannot capture the data sets anyway and are beyond what is
ar
X
iv
:1
80
8.
04
86
4v
1 
 [c
s.A
R]
  1
4 A
ug
 20
18
1:2
needed for instructions. Not only large caches have long access latencies, but also they impose high
power usage. Moreover, in tiled processors, as the number of tiles increases, the access latency of
LLC also increases [31]. Consequently, tiled methodology is not a good candidate for today’s and
especially tomorrow’s energy-ecient designs.
Recent work proposed a scalable processor architecture that is based on the scale-out design
methodology to maximize performance density (i.e., performance per unit area) [22]. e building
block of the resulting processors named scale-out processors is a pod. A pod is a module that
combines a few cores with a small LLC to form a server. A pod runs an operating system and
has its own soware stack. A scale-out processor consists of one or more pods with no inter-pod
connectivity. With scale-out processors, technology scaling results in increasing the number pods.
Prior work showed that scale-out processors maximize performance density (PD) [22] and improve
total cost of ownership (TCO) [15] as compared to tiled and conventional processors.
Previous work optimized scale-out processors using performance per unit area due to the
importance of die area at 40 nm fabrication technology. But in technologies below 20 nm, both at
the chip level and at data centers, power and energy are number one constraints [6, 10]. While scale-
out processors oer the highest performance density [22], it is not clear if they are optimal with
respect to energy eciency. To shed light on this issue, in this work, we use a similar methodology
as prior work [22] but use performance per power (P3) as the optimization criterion.
Our experiments show that scale-out processors that are optimized for performance density are
also optimal with respect to energy eciency and vice versa. In this work, we make the following
contributions:
• We use a system that consists of both processors and DRAM to evaluate the energy eciency
of various processor organizations.
• We show that for the technology node that we considered, the optimal pod conguration
using performance per power is the same as what has been obtained using performance
density.
• We show that the optimal pod conguration does not change for a large variety of technol-
ogy nodes and DRAM parameters.
2 METHODOLOGY
Prior work [22] came up with a methodology to allocate the limited resources (mainly area) to the
various components of a multi-core processor targeting maximization of throughput per unit area.
In this work, we aempt to eciently allocate power to various components targeting maximization
of energy eciency. Furthermore, we discuss how the optimal pod conguration changes if various
characteristics of the system change. We use a combination of cycle-accurate simulation, analytic
models and technical reports for this study.
2.1 Design and technology parameter
We analyze various designs in 14 nm technology using 0.8 volts for chip supply voltage. Our area
constraints set to 280 mm2, our power budget for all designs sets to 95 W. We also use up to 6
single-channel DDR4 interfaces in our chip designs.
Table 1 contains a summary of design parameters. Reported powers are estimation of real power
on our workloads. We use three dierent core types in our study. Conventional processors represent
aggressive, 4-way, highly speculative core microarchitecture. Tiled and scale-out processors are
evaluated by two dierent core types. e rst model is a 3-way high-performance out-of-order
core representing ARM Cortex-A15 [27] and the second model is a dual-issue in-order core, similar
to Cortex-A8 [5]. We set all cores’ frequency to 2 GHz in order to make comparison between
Scale-Out Processors & Energy Eiciency 1:3
Table 1. Estimated power and area values for dierent components at 14nm
Components Area Power
Cores
Conventional
Out-of-Order
In-Order
3.1 mm2
1.1 mm2
0.32 mm2
3.8 W
0.4 W
0.2 W
LLC 16-way SA 0.62 mm2 Per MB 0.2 W per MB
Interconnect 0.2-4.5 mm2 <5 W
DDR4 controller
(PHY+controller) 12 mm
2 5.7 W
SoC components 42 mm2 5 W
dierent architectures convenient. Cache area and power parameters are extracted from CACTI
6.5 [23].
Area of dierent SoC components derived from scaling micrograph of a Nehalem processor in
45 nm technology [19]. We extract DDR4 DRAM power consumption parameters using published
DDR4 power characteristics [29]. As DRAM cannot easily be scaled beyond 20 nm technology [2, 26],
we assume 20 nm DRAM in this study. We consider at most 70% utilization for DDR4 memory
channels [9]. Power of other SoC components estimated by modeling Sun UltraSPARC T2 congured
using McPAT v0.8 [20].
2.2 Chip organizations
For each design, we use as many cores and as much cache as we can without violating any constraints
in area, power or memory bandwidth. Maximum required memory bandwidth determines the
number of memory controllers in our designs. Performance and power estimation methodologies
are described in Sections 2.4 and 2.5, respectively.
2.2.1 Conventional. Conventional processor can accommodate at most 17 cores before reaching
the specied power budget. ree DDR4 memory controllers are sucient to serve the o-chip
demands. We use 48 MB of LLC in the processor. Cores and caches are connected through a crossbar
interconnect.
2.2.2 Tiled with OoO cores. Tiled OoO processor can accommodate 139 cores before reaching
the power constraint. We use 80 MB of LLC in this processor. A mesh interconnect with 3-cycle
delay per hop for both link and router is used for all tiled designs.
2.2.3 Tiled with in-order cores. By keeping the same LLC size as tiled OoO design, tiled in-order
design has 225 cores and 80 MB of LLC. In this design, power constraint restricts the number of
cores.
2.2.4 Scale-out. For determining core count and cache size of scale-out design, we have done
an extensive evaluation changing the cache from 1 to 8 MB and core count from 1 to 256.
2.3 Scale-out workloads
We take scale-out workloads from CloudSuite. Our workloads include Data Serving, MapReduce,
SAT Solver, Web Frontend, and Web Search. We have two MapReduce workloads in our suite,
classication (MapReduce-C) and word count (MapReduce-W).
1:4
0
0,05
0,1
0,15
0,2
0,25
1 2 4 8 16 32 64 128 256
Pe
rf.
	  p
er
	  P
ow
er
Number	  Of	  Cores
LLC=1M
1 2 4 8 16 32 64 128 256
Number	  Of	  Cores
LLC=2M
1 2 4 8 16 32 64 128 256
Number	  Of	  Cores
LLC=8M
1 2 4 8 16 32 64 128 256
Number	  Of	  Cores
LLC=4M
Fig. 1. Performance per power (P3) for a design with OoO cores and various cache sizes
2.4 Performance evaluation
As cycle-accurate full-system simulation is 100,000 times slower than real hardware [30], it is
impractical to search the whole design space with time-consuming simulations. Instead, we use
an analytic model [16, 22] that its parameters derived from simulations. is model predicts
performance based on cache size, cache miss ratio, core count, cache access latency and memory
access latency.
To derive parameters of the analytic model, we use full-system simulation. For full-system
simulation of dierent pod sizes, we use Flexus [14], which is built on top of Virtutech Simics.
Flexus extends Simics functional model with detailed models of OoO and in-order cores and the
cache hierarchy.
We evaluate 10 seconds of execution of each workload using SimFlex sampling methodology [30].
For each measurement, we load checkpoints with warmed caches and branch predictors, and then
run 100 K cycles to reach the steady state before collecting measurements for the subsequent 50 K
cycles. For Data Serving workload, we need to run the simulations for 2000 K cycles to reach the
steady state. We use the ratio of the number of application instructions to the total number of
cycles (including the cycles spent executing operating system code) to measure performance; this
metric has been shown to accurately reect overall system throughput [30]. roughput measured
with 95% accuracy and an average error rate lower than 4%.
2.5 Power evaluation
We use McPAT for power estimation of SoC components. For cores, however, recent studies show
that McPAT is not accurate for power analysis due to the dierences between core structure and its
implementation [32]. As an alternative, prior work has shown that Instruction per Cycle (IPC) is
strongly correlated to the power consumption [7, 8, 25]. For example, Bircher and John [7] report
an average of only 3% error in core’s power usage when compared to the measured CPU power.
Moreover, Rodrigues et al. show that it is possible to estimate a core’s power usage with an average
error rate of less than 3.9% using performance counters [25]. Using these approaches requires
having power and energy numbers of the examined cores. For this purpose, we use the empirical
power reports from the published technical report [28] for cores’ power estimation.
3 RESULTS
We rst nd the optimal pod size for each core type and then replicate pods in each design to reach
one of the constraints. Subsequently, we compare the resulting scale-out processor with tiled and
conventional architectures. Finally, we compare performance-density optimal scale-out processors
against their performance-per-power optimal counterparts.
Scale-Out Processors & Energy Eiciency 1:5
0
0,05
0,1
0,15
0,2
0,25
0,3
1 2 4 8 16 32 64 128 256
Pe
rf.
	  p
er
	  P
ow
er
Number	  Of	  Cores
LLC=1M
1 2 4 8 16 32 64 128 256
Number	  Of	  Cores
LLC=2M
1 2 4 8 16 32 64 128 256
Number	  Of	  Cores
LLC=8M
1 2 4 8 16 32 64 128 256
Number	  Of	  Cores
LLC=4M
Fig. 2. Performance per power (P3) for a design with in-order cores and dierent cache sizes
3.1 System with out-of-order cores
Average performance per power (P3) across all workloads for four dierent LLC sizes is shown in
Figure 1. Larger cache sizes are not investigated as they deteriorate performance per power. Each
graph contains three lines corresponding to three dierent interconnect types.
We observe that in all designs and regardless of cache size and interconnect, performance per
power diminishes as the number of cores starts to exceed 32. A system with 16 cores, 4 MB of LLC
and a crossbar interconnect maximizes P3. is is identical to the pod that maximizes performance
per unit area [22].
Based on circumstances discussed in Section 2, our scale-out processor design at 14 nm is power-
limited and can accommodate eight pods. e resulting system area and power are 253 mm2 and
87 W (with DRAM 130 W), respectively.
Scale-out design with out-of-order cores achieves 3.95× higher P3 as compared to the con-
ventional processor due to using simpler cores and a smaller LLC. Also, a scale-out design has
notable advantages over tiled designs with respect to P3: its overall P3 is 26% higher than the tiled
design. is advantage stems from inecient large cache size and long inter-hop latency in tiled
architectures.
3.2 System with in-order cores
Figure 2 shows the average performance per power of dierent processors across all workloads.
Based on these results, a P3-optimal pod contains 32 cores with 4 MB of LLC and a crossbar
interconnect. Again, the P3-optimal pod with in-order cores is identical to the performance-density
optimal pod [22]. is is because scale-out processors are tuned for the characteristics of scale-out
workloads: (1) massive request-level parallelism, (2) large instruction footprint, and (3) enormous
datasets in the main memory.
Resulting scale-out processor with in-order cores can aord seven pods before violating the
power budget. With all peripherals and interconnect, scale-out chip’s total die-area is 193mm2 and
consumes 86 W (with DRAM 139 W).
e scale-out chip with in-order cores oers 43% higher P3 as compared to a tiled design.
Furthermore, it achieves 3.2× higher P3 over conventional designs.
3.3 Sensitivity of Optimal Pod Configuration to Design Parameters
We perform a study on how the optimal pod will change if parameters of the design change. We
use OoO small cores in this study. All the remaining design aspects are the same as previous
experiments. LLC power usage, core’s static and dynamic power, and DRAM access energy are the
main elements of this study. We sweep the energy usage of these components from 0.1× to 10× of
1:6
8,5x
4,7x
8x
10x 1x
10x
DRAM	  Access	  
Power
LLC	  Power
Cores'	  Static	  Power
Cores'	  Dynamic	  
Power
Conditions	  In	  Which	  16-­‐Core,	  4	  MB	  Pod	  Remains	  Optimal
(a)
0,1x
0,3x
0,1x
0,1x
0,1x
1x
DRAM	  Access	  
Power
LLC	  Power
Cores'	  Static	  Power
Cores'	  Dynamic	  
Power
Conditions	  In	  Which	  16-­‐Core,	  4	  MB	  Pod	  Remains	  Optimal
(b)
Fig. 3. We vary cores’ dynamic power, cores’ static power, LLC power usage, and DRAM access power by up
to 10× up (down) with respect to their current values for a scale-out processor designed with OoO cores in
Part a (b). The solid rectangles show the state space. The doed rectangles show parts of the state space in
which the 16-core, 4-MB pod remains optimal.
the current values to see how these changes aect the optimal pod conguration. Figure 3 shows
the results of our study. e solid rectangles indicate the state space while the doed rectangles
show parts of the state space in which the optimal pod conguration does not change. e gure
clearly shows that the 16-core, 4-MB pod remains the optimal pod conguration for a large range
of parameters.
Figure 3a shows that changing cores’ dynamic power by 10× does not change the optimal pod
conguration. Moreover, cores’ static power aects the optimal pod conguration only when it
is increased by 8× of its current value. Power-hungry cache system that at least consumes 4.7×
more power, moves us towards having a smaller pod with fewer cores and a smaller LLC. On the
Scale-Out Processors & Energy Eiciency 1:7
Table 2. Resulting chips at 14 nm technology
14 nm
Processor Design Constraint Cores LLC (MB) MCs Area (mm2) Performance Power (Wa) PD P3
Conventional Power-limited 17 48 3 161 23 105 0.14 0.22
Tiled (OoO) Power-limited 139 80 3 280 86 128 0.31 0.67
Scale-Out (OoO) Power-limited 128 32 5 253 109 130 0.43 0.84
Tiled (In-Order) Power-limited 225 80 5 224 80 137 0.36 0.58
Scale-Out (In-Order) Power-limited 224 28 6 193 116 139 0.60 0.83
other hand, increasing the DRAM access energy by more than 8.5× does the exact opposite. A
power-hungry DRAM calls for a pod with a larger LLC to lter out more data accesses.
Figure 3b shows that a 10× decrease in core power or DRAM access energy does not change the
optimal pod conguration. Moreover, a low-power LLC only aects the optimal pod conguration
when its power usage becomes 0.3× of its current value. is means that in more advanced
technology nodes in which the energy of the core, cache, and DRAM is not expected to change
signicantly, the optimal pod conguration is likely to remain the same.
3.4 Summary
Table 2 summarizes our chip-organization, power consumption, limiting factor, area, performance,
power, PD and P3 in 14 nm technology. As we consider DRAM dynamic power in our study,
reported powers are more than the power budget that we set in Section 2, however, all chips
consume less power than the limit. Performance column shows average user-instruction per clock
cycle [30] that the corresponding design can deliver.
Our study indicates that a single pod conguration is optimal for both energy eciency and
performance density. Also, many technological changes in the cache, core or DRAM do not
change the optimal pod conguration. We also showed how the pod conguration would change if
characteristics of the components change signicantly.
4 RELATEDWORK
ere are proposals that optimize data-center cost, power, and/or area with an ecient processor
architecture. Such pieces of prior work [15, 17, 18, 22] partially share some of the insights and/or
conclusions of this work. Our work is dierent from prior work on scale-out processors [15, 22] in
many aspects. Unlike those studies that target area as the optimization criterion, we use energy
eciency. While prior work [15] showed that a performance-density (PD) optimal processor also
oers beer energy eciency, this work is the rst to show that a PD-optimal processor is also
optimal with respect to energy eciency. Moreover, unlike prior work, we included DRAM energy
in our study. Finally, we study the eect of variations of subsystem characteristics on the optimal
pod conguration.
5 CONCLUSION
As the primary constraint of data centers is power usage, server processors that are optimized for
scale-out workloads should exhibit excellent energy eciency. For this purpose, we revisited the
scale-out design methodology with respect to energy eciency. We found that in many real-world
conditions (like the ones in our study), the scale-out processors that are optimal with respect to
performance density are also optimal with respect to energy eciency.
1:8
REFERENCES
[1] 2007. Report to congress on server and data center energy eciency public law 109-431. Public Law 109 (2007), 431.
[2] 2007. Semiconductor Industry Association. International technology roadmap for semiconductors, 2007.
[3] Mohammad Bakhshalipour, Pejman Lot-Kamran, Abbas Mazloumi, Farid Samandi, Mahmood Naderan, Mehdi
Modarressi, and Hamid Sarbazi-Azad. 2018. Fast Data Delivery for Many-Core Processors. IEEE Transactions on
Computers (TC) (2018).
[4] Mohammad Bakhshalipour, Pejman Lot-Kamran, and Hamid Sarbazi-Azad. 2018. Domino Temporal Data Prefetcher.
In International Symposium on High Performance Computer Architecture (HPCA). IEEE, 131–142.
[5] Max Baron. 2006. e F1: TI’s 65nm Cortex-A8. Microprocessor Report 20, 7 (July 2006), 1–9.
[6] Luiz Andre´ Barroso, Jimmy Clidaras, and Urs Ho¨lzle. 2013. e Datacenter as a Computer: An Introduction to the Design
of Warehouse-Scale Machines (2nd ed.). Morgan and Claypool Publishers.
[7] W. L. Bircher and L. K. John. 2012. Complete System Power Estimation Using Processor Performance Events. IEEE
Trans. Comput. 61, 4 (April 2012), 563–577. DOI:hp://dx.doi.org/10.1109/TC.2011.47
[8] Gilberto Contreras and Margaret Martonosi. 2005. Power Prediction for Intel XScale®Processors Using Performance
Monitoring Unit Events. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED).
221–226. DOI:hp://dx.doi.org/10.1145/1077603.1077657
[9] Howard David, Chris Fallin, Eugene Gorbatov, Ulf R. Hanebue, and Onur Mutlu. 2011. Memory Power Management via
Dynamic Voltage/Frequency Scaling. In Proceedings of the 8th ACM International Conference on Autonomic Computing.
31–40.
[10] Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark Silicon
and the End of Multicore Scaling. In Proceeding of the 38th Annual International Symposium on Computer Architecture
(ISCA). 365–376.
[11] Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu
Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsa. 2012. Clearing the Clouds: A Study of Emerging
Scale-Out Workloads on Modern Hardware. In Proceedings of the 17th International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS). 37–48.
[12] Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu
Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsa. 2012. antifying the Mismatch between
Emerging Scale-Out Applications and Modern Processors. ACM Transactions on Computer Systems 30, 4 (Nov. 2012),
15:1–15:24.
[13] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B.
Falsa. 2014. A Case for Specialized Processors for Scale-Out Workloads. IEEE Micro 34, 3 (May 2014), 31–42. DOI:
hp://dx.doi.org/10.1109/MM.2014.41
[14] Flexus. 2012. hp://parsa.ep.ch/simex. (2012).
[15] Boris Grot, Damien Hardy, Pejman Lot-Kamran, Babak Falsa, Chrysostomos Nicopoulos, and Yiannakis Sazeides.
2012. Optimizing Data-Center TCO with Scale-Out Processors. IEEE Micro 32, 5 (Sept. 2012), 52–63.
[16] Nikos Hardavellas. 2009. Chip Multiprocessors for Server Workloads.
[17] Nikos Hardavellas, Michael Ferdman, Babak Falsa, and Anastasia Ailamaki. 2011. Toward Dark Silicon in Servers.
IEEE Micro 31, 4 (July-August 2011), 6–15.
[18] Taeho Kgil, Shaun D’Souza, Ali Saidi, Nathan Binkert, Ronald Dreslinski, Steven Reinhardt, Krisztian Flautner,
and Trevor Mudge. 2006. PicoServer: Using 3D Stacking Technology to Enable a Compact Energy Ecient Chip
Multiprocessor. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages
and Operating Systems (ASPLOS). 117–128.
[19] Rajesh Kumar and Glenn Hinton. 2009. A Family of 45nm IA Processors. In Proceedings of the IEEE International
Solid-State Circuits Conference. 58–59.
[20] Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An
Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. In Proceedings
of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 469–480.
[21] Kevin Lim, Parthasarathy Ranganathan, Jichuan Chang, Chandrakant Patel, Trevor Mudge, and Steven Reinhardt.
2008. Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments. In
Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA). 315–326.
[22] Pejman Lot-Kamran, Boris Grot, Michael Ferdman, Stavros Volos, Onur Kocberber, Javier Picorel, Almutaz Adileh,
Djordje Jevdjic, Sachin Idgunji, Emre Ozer, and Babak Falsa. 2012. Scale-Out Processors. In Proceedings of the 39th
Annual International Symposium on Computer Architecture (ISCA). 500–511.
[23] Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2007. Optimizing NUCA Organizations
and Wiring Alternatives for Large Caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International
Scale-Out Processors & Energy Eiciency 1:9
Symposium on Microarchitecture (MICRO). 3–14.
[24] Vijay Janapa Reddi, Benjamin C. Lee, Trishul Chilimbi, and Kushagra Vaid. 2010. Web Search Using Mobile Cores:
antifying and Mitigating the Price of Eciency. In Proceedings of the 37th Annual International Symposium on
Computer Architecture (ISCA). 314–325.
[25] R Rodrigues, A Annamalai, I Koren, and S Kundu. 2013. A study on the use of performance counters to estimate power
in microprocessors. IEEE Transactions on Circuits and Systems II: Express Briefs 60 (12) (2013), 882–886.
[26] H. B. Sohail, B. Vamanan, and T. N. Vijaykumar. MigrantStore: Leveraging Virtual Memory in DRAM-PCM Memory
Architecture. Technical Report. ECE Technical Reports, TR-ECE-12-02, Purdue University.
[27] Jim Turley. 2010. Cortex-A15 ”Eagle” Flies the Coop. Microprocessor Report 24, 11 (Nov. 2010), 1–11.
[28] Evangelos Vasilakis and Manolis G.H Katevenis. 2015. An Instruction Level Energy Characterization of ARM Processors.
Technical Report. Computer Architecture and VLSI Systems (CARV) Laboratory, Institute of Computer Science (ICS),
Foundation of Research and Technology Hellas (FORTH).
[29] T. Vogelsang. 2010. Understanding the Energy Consumption of Dynamic Random Access Memories. Proc. Int’l Symp.
Microarchitecture, IEEE CS Press (2010), 363–374.
[30] omas F. Wenisch, Roland E. Wunderlich, Michael Ferdman, Anastassia Ailamaki, Babak Falsa, and James C. Hoe.
2006. SimFlex: Statistical Sampling of Computer System Simulation. IEEE Micro 26, 4 (July-August 2006), 18–31.
[31] Bob Wheeler. 2011. Tilera Sees Opening in Clouds. Microprocessor Report 25, 7 (July 2011), 13–16.
[32] Sam Xi, Hans Jacobson, Pradip Bose, Gu-Yeon Wei, and David Brooks. 2015. antifying Sources of Error in McPAT
and Potential Impacts on Architectural Studies. In International Symposium on High Performance Computer Architecture
(HPCA).
