Toward Dark Silicon in Servers by Hardavellas, Nikos et al.
..........................................................................................................................................................................................................................
TOWARD DARK SILICON IN SERVERS
..........................................................................................................................................................................................................................
SERVER CHIPS WILL NOT SCALE BEYOND A FEW TENS TO LOW HUNDREDS OF CORES,
AND AN INCREASING FRACTION OF THE CHIP IN FUTURE TECHNOLOGIES WILL BE DARK
SILICON THAT WE CANNOT AFFORD TO POWER. SPECIALIZED MULTICORE PROCESSORS,
HOWEVER, CAN LEVERAGE THE UNDERUTILIZED DIE AREA TO OVERCOME THE INITIAL
POWER BARRIER, DELIVERING SIGNIFICANTLY HIGHER PERFORMANCE FOR THE SAME
BANDWIDTH AND POWER ENVELOPES.
......Although workloads with limited
parallelism pose performance challenges with
chip multiprocessors (CMPs), server work-
loads with abundant parallelism are believed
to be immune, capable of scaling to the par-
allelism available in the hardware. Contrary
to popular belief, however, CMPs are not a
panacea for server processor designs. Despite
the inherent scalability in threaded server
workloads, increasing core counts can’t
directly translate into performance improve-
ments because chips are physically con-
strained in power and off-chip bandwidth.
Whereas transistor counts grow exponen-
tially following Moore’s law, the transistor
threshold and supply voltages do not scale
commensurately,1 and the power consump-
tion of the additional transistors can no
longer be mitigated through circuit-level
techniques. Although a trade-off exists be-
tween performance and power consumption,
the transistors’ switching speed and fre-
quency cannot be reduced sufficiently to
keep at bay the power consumption of expo-
nentially more transistors and simultaneously
deliver reasonable performance. The multi-
plying core counts constitute a substantial
fraction of the chip’s transistors, contributing
to both dynamic and static power. Voltage-
frequency scaling (VFS) might lower the dy-
namic power, but static power dissipation
and performance requirements impose a
limit. At the same time, the lethargic drop
of supply voltages and the shrinking range
of operational voltage (in the last decade,
the difference between Vth and Vdd narrowed
by 70 percent1) dampen the impact of VFS.
Even if the core power limitations could
be eluded through highly efficient core
designs or low-operational-power transistors,
the rising core and thread counts drastically
increase pressure on the limited and nonscal-
able off-chip bandwidth, encountering the
bandwidth wall. Traditional approaches to al-
leviate off-chip bandwidth pressure call for
larger on-chip caches, which further drive
up the chip’s power consumption, reducing
the power available to the cores. Thus, with-
out a technological miracle to overcome the
power constraints imposed by thermal cool-
ing and power delivery, we will soon inevita-
bly enter an era of dark silicon, building dense
and fast devices that we can’t afford to power.
In this work, we show how we can use the
abundant, power-constrained, and underutil-
ized die real estate effectively to improve
server performance and power efficiency by
populating the die area with a large, diverse
array of application-specific heterogeneous
cores. These specialized CMPs can achieve
peak performance and power efficiency by
dynamically powering up only a small number
of cores specifically designed for the given
workload, with all but the most application-
specific cores remaining dark (dynamically
disabled) when not in use. Specialized CMPs
[3B2-9] mmi2011040006.3d 22/7/011 12:24 Page 6
Nikos Hardavellas
Northwestern University
Michael Ferdman
Carnegie Mellon
University
Babak Falsafi
Anastasia Ailamaki
E´cole Polytechnique
Fe´de´rale de Lausanne
..............................................................
6 Published by the IEEE Computer Society 0272-1732/11/$26.00 c 2011 IEEE
show promise in improving the aggregate
performance and energy efficiency of server
workloads. This is especially true if the spe-
cialized CMPs are coupled with emerging
memory technologies, which mitigate the
off-chip bandwidth wall and fully expose
the processor to the power constraints.
Methodology
We consider aggregate server throughput
as the performance metric. To design CMPs
that attain peak performance while staying
within the physical constraints, it is imperative
to jointly optimize a large number of design
parameters. Complexity and runtime require-
ments make it impractical to rely on simula-
tion for a large-scale design-space-exploration
study. Instead, we rely on first-order analytical
models of the dominant components, relating
the effects of technology-driven physical con-
straints to the performance of server work-
loads running on future CMPs.
We construct detailed parameterized mod-
els that conform to the International Technol-
ogy Roadmap for Semiconductors (ITRS, www.
itrs.net) projections of future manufacturing
technologies. Using the analytical models as
constraints, we derive peak-performance
designs by jointly optimizing supply and
threshold voltage, clock frequency, core
count, manufacturing process, cache size,
and memory technology.
Our models have been independently
vetted and used in a recent study of heteroge-
neous computing.2 Similar models were inde-
pendently developed and validated against
PARSEC benchmarks, demonstrating that
multicore performance will not scale with
technology.3 We corroborate these results,
propose specialized computing as a promising
solution, and evaluate its potential.
Hardware model
To evaluate a range of core design choices,
we model CMPs with cores built in one of
three ways: general-purpose (GPP), embedded
(EMB), or specialized (SP). GPP are scalar
four-waymultithreaded in-order coresmodeled
after Sun’s UltraSPARC,4 representing the class
of modern CMPs targeted for the high-
throughput server environment. A four-way
multithreaded core achieves speedup of 1.7
over a single-threaded core.5 EMBare advanced
low-power embedded cores similar to the cores
in ARM11MPCore (www.arm.com/products/
CPUs/ARM11MPCoreMultiprocessor.html),
representing a power-conscious core design
paradigm, without a fundamental change in
processor performance when executing server
workloads compared to GPP cores.
To estimate the performance and power
efficiency of specialized cores, we evaluate
hypothetical application-specific cores. Rather
than representing a specific core design, we
envision a CMP populated with specialized
hardware such as GPUs, digital signal pro-
cessors (DSPs), and field-programmable
gate arrays (FPGAs) in addition to application-
specific integrated circuits (ASICs) imple-
menting common server operations. At any
given time, only the subset of these hardware
components that’s best suited to execute effi-
ciently the current workload would be pow-
ered up; we broadly use the label SP to
represent a single powered-up hardware com-
ponent within this design. Based on published
data on general-purpose cores and ASICs run-
ning an optimized context-adaptive binary
arithmetic coding (CABAC) segment of the
H.264 video encoder,6 we estimate that SP
cores can deliver 20 the performance of a
GPP core at 10 less power. This estimate is
conservative because the CABAC segment is
heavily control-intensive and hampers many
hardware optimizations.
Each core is supported by 64-Kbyte Level 1
(L1) instruction and data caches and a shared
unified nonuniform cache architecture
(NUCA) L2 cache. We model a 2D torus
interconnect for the L2 cache and separately
optimize the cache size for each technology
node with CACTI 6.0.7 We do not evaluate
deeper cache hierarchies because prior re-
search shows that a NUCA organization out-
performs any multilevel cache design.8
Technology model
Projecting from current technologies to
future trends, we model CMPs across four
fabrication technologies: 65 nm, 45 nm,
32 nm (due in 2013), and 20 nm (due in
2017). We scale technology parameters across
technologies in accordance to the ITRS pro-
jections and model bulk planar CMOS for
the 65- and 45-nm nodes, ultra-thin-body
fully-depleted metal-oxide-semiconductor
[3B2-9] mmi2011040006.3d 20/7/011 15:38 Page 7
....................................................................
JULY/AUGUST 2011 7
field-effect transistors (MOSFETs) at 32 nm,
and double-gate FinFETs at 20 nm.
Tomitigate the powerwall, we evaluate low-
ering the leakage current byusinghigh-Vth tran-
sistors. These low-operational-power (LOP)
transistors experience orders of magnitude
lower subthreshold leakage current, while
achieving 54 to 68 percent of the switching
speed of high-performance (HP) transistors.
We explore CMPs that use high-performance
transistors for the entire chip, high-performance
transistors for the cores and LOP for the
cache, and LOP transistors for the entire chip.
Area model
Based on the ITRS projections, we model a
310mm2 die.Our algorithms eliminate all can-
didate designs that exceed the 310-mm2 die
area. We proportionally allocate 72 percent of
the die for cores and cache,9 with the remaining
area allocated to the interconnect and system-
on-chip (SoC) components. We estimate the
core area by scaling existing designs.
ForGPP cores, we scale anUltraSPARCT1
core,4 measuring 13.67 mm2 at 65 nm. For
EMB cores, we scale an ARM11 core, measur-
ing 2.48 mm2 at 65 nm. The SP core area is
equal to the area of an EMB core in our
model. We use the ITRS to estimate the area
required for an ECC-protected L2 cache and
scale the cores and caches across technologies.
Performance model
Our performance model is based on
Amdahl’s law, assuming 99-percent applica-
tion parallelism unless otherwise noted. Be-
cause of space constraints, we present only
a summary of the model in this text; the
exact formulas and derivations used for the
presented results are available elsewhere.10,11
We estimate the performance of a single
core by calculating the aggregate number
of user instructions committed per cycle
(UIPC), which is proportional to overall
server-system throughput.12 We compute
the UIPC by accounting for the memory ac-
cess latency as a function of load/store fre-
quency and cache miss rates, empirically
measuring the fraction of load/store instruc-
tions and the L1 miss rate of each application
using a 16-core full-system simulation in
Flexus.12 We use the L2 access latency
reported by CACTI 6.0.7 Memory-access
latency is projected using 7-percent annual
improvement in DRAM latency, starting
with 53 ns in 2007 (PC-667 at 65 nm).
Table 1 presents our workloads. Unless
noted otherwise, the results are averaged
across all workloads.
L2 cache miss rate and data-set evolution models
The cachemiss rate plays a dominant role in
performance. To estimate a workload’s cache
miss rate, we curve-fit empirical measurements
across L2 cache sizes between 256 Kbytes and
64 Mbytes. We find that the x-shifted power
law, a(x þ b)g, offers the closest fit for our
data, having only a 1.3-percent average error
across all measurements. Table 1 lists the
miss-rate model parameters for each workload.
We provide the full miss-rate scaling formu-
las in a previous work,10 including details on the
curve fitting of miss rates and data-set growth
projections based on the TPC-A, -B, -C,
[3B2-9] mmi2011040006.3d 20/7/011 15:38 Page 8
Table 1. Workload and miss-rate model parameters.
Workload Description a b g
Mean
error (%)
Maximum
error (%)
Online transaction
processing
(OLTP)
TPC-C v3.0 on IBM DB2 v8 ESE, 100
warehouses (10 Gbytes), 64 clients,
2-Gbyte buffer pool
0.5785 0.4750 0.589 1.3 8.2
Decision support
systems (DSS)
TPC-H throughput test on IBM DB2 v8 ESE,
Queries 2, 6, 13, 16, 480-Mbyte buffer
pool, 1-Gbyte database, 16 clients
0.5925 0.5154 0.327 0.5 6.5
Web server
(Apache)
SPECweb99 on Apache HTTP Server v2.0,
16,000 connections, fast common gate-
way interface (fastCGI), worker thread-
ing model
1.0081 2.1104 0.503 1.2 4.9
....................................................................
8 IEEE MICRO
...............................................................................................................................................................................................
BIG CHIPS
and -E workloads. When considering designs
in a future fabrication technology, we use the
year of the technology’s introduction to proj-
ect the application data-set size with 29 per-
cent annual growth. We adjusted cache miss
rates based on the projected data-set size for
each technology.
Off-chip bandwidth model
We model the chip bandwidth require-
ments by estimating the off-chip activity rate,
scaled from the simulation measurements.
The off-chip bandwidth usage is proportional
to the L2 miss rate, the number of cores, and
the activity of the cores (such as clock frequency
and core performance). The maximum avail-
able bandwidth is calculated on the basis of
the number of pads and the maximum off-
chip clock as provided by the ITRS for each
technology. Based on the ITRS, two-thirds of
the pads are used for power and ground, leaving
atmost one-third available for off-chipmemory
signaling. Our algorithms discard candidate
CMP designs with a bandwidth utilization
that exceeds available bandwidth.
Additionally, we evaluate CMPs that use
3D-stacked memory13 as a high-capacity
high-bandwidthL3 cache.We treat 3D-stacked
memory as a large L3 cache because the mem-
ory it houses is insufficient for a large-scale
server software installation. We model a 3D-
stackedmemory where each layer has a capacity
of 8 Gbits at 45-nm technology.13 The worst-
case power consumption for each 8-Gbit layer
is 3.7 W.11 We model eight layers, for a total
of 8 Gbytes, with an additional control/logic
layer, increasing the average chip temperature
by an estimated 10C.13 When we evaluate
3D-stacked CMPs, we account for the effects
of the increased temperature on power dissipa-
tion. We estimate that memory access time
improves by 32.5 percent using 3D stacking
due to the more efficient communication be-
tween the cores and 3D memory.11
We compute L3 miss rates similar to L2.
We present 3D-stacked DRAM as a case
study; photonics and other emerging technol-
ogies for mitigating the off-chip bandwidth
wall are expected to exhibit similar trends.
Power model
We use a first-order power model to com-
pute the total chip power by summing the
dynamic and static power of the individual
components (cores, cache, interconnect,
I/O, and SoC components). For each tech-
nology node we evaluate, we use the ITRS
data to dictate the maximum allowable
power for air-cooled chips with heat sink
for that technology. The power limits are
used as an input to the model to automati-
cally discard all candidate CMP designs
that exceed the power limits published by
the ITRS for that technology.
Based on the ITRS projections, power de-
livery to the die will pose an additional con-
straint owing to poor signal integrity when
large currents are delivered at low voltages.
If alternative cooling technologies, such as
liquid cooling, were employed, the power de-
livery would impose a CMP design constraint
for which no mainstream solutions have been
proposed to date. We therefore focus on air-
cooled systems in this study and consider
power limits based only on thermal cooling
rather than on power delivery.
We use the Sun UltraSPARC,4 ARM11
MPCore, and 10 percent of the UltraSPARC
core as reference points for the dynamic power
model of the GPP, EMB, and SP cores, re-
spectively. We compute the dynamic power
of N cores by scaling the reference core’s dy-
namic power PH proportionally to the gate
capacitance of the target technology, the tar-
get frequency, and the supply voltage square:
PD;Ncores¼N PH Gate Capacitance ðRÞ
Gate Capacitance ðRHÞ
 fVdd  Vdd Rð Þ
 2
ðVdd;H Þ2
 F
FH
Vdd,H and FH represent the supply voltage
and frequency of the reference core H in
technology node RH, and Vdd(R) is the
nominal supply voltage of technology R.
We use f Vdd to perform voltage-frequency
scaling, trading off clock frequency for
power. At temperature T K, the supply volt-
age scaling is estimated by 2.3 Vth(R, T )
fVdd  Vdd(R)  Vdd(R). We quantize the
scaling factor f Vdd in steps of 10 percent.
The frequency F is scaled with the supply
voltage, such that F  Fmax(R ) 
f Fmax( f Vdd ). We account for the nonlinear re-
lationship between frequency and voltage by
fitting published data to compute fFmax( ).
14
[3B2-9] mmi2011040006.3d 20/7/011 15:38 Page 9
....................................................................
JULY/AUGUST 2011 9
We estimate the L2 dynamic power by
scaling published data for the Sun Ultra-
SPARC T1 cache.4 We scale the cache dy-
namic power across technologies similarly
to the core power and adjust it proportion-
ally to the cache access rate. We compute
cache activity from the measured L1 miss
rates, the core count, and the relative core
performance. We compute the network dy-
namic power based on network activity,
scaled over the same reference design. The
network activity conservatively equals the
cache activity, adjusted by the average hop
count on a 2D-torus interconnect. We esti-
mate the I/O subsystem power proportion-
ally to the reference GPP core design, the
L2 access rate by all participating cores, and
the L2 miss rate, scaled across technologies
similarly to core power. We cap the band-
width to the maximum allowed at each tech-
nology, assuming that worst-case power is
expended when all I/O pins are fully utilized.
Exact formulas and constants used for the
calculation of the dynamic power compo-
nents are available in previous works.10,11
To estimate static power, we model only
the subthreshold leakage because it will dom-
inate gate and junction leakage in future
technologies.15 The cache’s static power is di-
rectly proportional to its size, the supply volt-
age, the gate width, and the leakage current at
the corresponding temperature. We estimate
an average ratio of gate length to width of 3.0
across technologies and obtain gate lengths
from the ITRS. We account for core leakage
by estimating the number of transistors in a
core using the ITRS density projections and
assuming a 50-percent switching rate. We
scale the leakage current to a target tempera-
ture by fitting the subthreshold coefficient.16
We calculate leakage at 66C, a typical oper-
ating temperature of today’s chips.4
Analysis
We begin by explaining the progression of
our peak-performance search algorithm on
the basis of the results plotted in Figure 1a.
The plot shows the aggregate chip perfor-
mance as a function of the L2 cache size.
The area curve shows the performance of
designs that have unlimited power and off-
chip bandwidth and are constrained only
by the die area. We can leverage parallelism
in these designs to achieve high performance
by populating the entire die area with cores.
Following the area curve to the right, a
larger portion of the chip area is dedicated
[3B2-9] mmi2011040006.3d 20/7/011 15:38 Page 10
0
100
200
300
400
500
600
700
Cache size (Mbyte)
1,
00
0 
×
 M
IP
S
Area (maximum frequency)
Power (maximum frequency)
Bandwidth (VFS)
Area + power (VFS)
Peak performance
0
100
200
300
400
500
600
700
1 2 4 8 16 32 64 256128 512
1 2 4 8 16 32 64 256128 512
1 2 4 8 16 32 64 256128 512
Cache size (Mbyte)
1,
00
0 
×
 M
IP
S
Area (maximum frequency)
Power (maximum frequency)
Bandwidth (VFS)
Area + power (VFS)
Peak performance
(a)
(b)
Area (maximum frequency)
Power (maximum frequency)
Bandwidth (VFS)
Area + power (VFS)
Peak performance
Cache size (Mbyte)
0
100
200
300
400
500
600
700
1,
00
0 
×
 M
IP
S
(c)
Figure 1. Performance of general-purpose (GPP) chip multiprocessors
(CMPs) running Apache, using high-performance (HP) transistors for both
cores and cache (a), HP transistors for cores and low-operational-power
(LOP) for cache (b), and LOP transistors for both cores and cache (c) at
20 nm. The peak-performing CMPs are bandwidth-constrained for small
caches and power-constrained for large caches, with the optimal design point
at the intersection of the two constraints. (VFS: voltage-frequency scaling.)
....................................................................
10 IEEE MICRO
...............................................................................................................................................................................................
BIG CHIPS
to the L2 cache. Although larger caches mean
fewer cores, each core’s performance is higher
due to greater cache capacity, leading to
higher aggregate chip performance up to ap-
proximately 64 Mbytes of L2 cache. The per-
formance benefit beyond 64 Mbytes is
outweighed by the cost of further reducing
the core count, leading to an aggregate per-
formance drop at larger cache sizes.
The power curve shows designs populated
with cores running at the maximum fre-
quency, with power limited due to air-cooling
constraints, but having unlimited area and
off-chip bandwidth. The high power of each
core restricts these designs to a handful of
cores, severely limiting the aggregate chip per-
formance. Increasing cache size takes more
power away from the already limited number
of cores, dropping performance even further.
The bandwidth curve shows designs that
are limited only in off-chip bandwidth, per-
mitting unlimited area and power use. The
core count and core frequency are jointly opti-
mized to find the peak-performing configura-
tion in light of the bandwidth constraint.
Larger caches reduce off-chip bandwidth pres-
sure, allowing the bandwidth-limited designs
to achieve improved performance.
Conversely, the area+power curve shows
designs limited in area and power but per-
mitted to consume unlimited off-chip band-
width. However, unlike the power curve, the
area+power curve jointly optimizes the core
count, voltage, and frequency, selecting the
peak throughput design combination for
each evaluated L2 cache size.
Finally, the peak performance curve follows
the strictest constraint, showing the feasible
CMP designs. At small cache sizes, the off-
chip bandwidth serves as the performance-
limiting factor. Beyond 20 Mbytes, however,
the power consumed by the L2 cache restricts
the number of cores, penalizing performance.
Therefore,weconclude that thepeak-performing
GPP design at 20 nm using HP transistors
should use approximately 20 Mbytes of L2
cache with the remaining power budget utilized
by cores. Moreover, the gap between the peak
performance curve and the area curve at
20Mbytes cache indicates that the best possible
20-nmGPP design with HP transistors cannot
use the majority of the die area formore cores,
because that area can’t be powered up.
Figures 1b and 1c extend the analysis to
designs that use slower low-leakage transis-
tors. Because designs using only HP transis-
tors are severely power limited, they can
power only 20 percent of the cores that fit
in the die at 20 nm. Using LOP transistors
for the cache (see Figure 1b) enables larger
caches that can support twice the number
of cores and yield higher performance, with
core leakage accounting for 20 percent of
the chip power. Due to power constraints,
the peak-performance designs must employ
clocks at least 43 percent lower than the
maximum frequency supported by the tech-
nology. Although LOP transistors are slower
than HP transistors, they retain 54 to 68 per-
cent of the maximum switching speed
according to ITRS. As such, LOP devices
are suitable to implement both the cores
and the cache, yielding higher power effi-
ciency (see Figure 1c). Ultimately, however,
even after optimizing transistor types, peak-
performance CMPs can power less than
35 percent of the cores that could fit on die.
Using an identical analysis, we find the
highest performance design feasible for each
technology generation. Figure 2 plots the
core counts of the peak-performing designs
[3B2-9] mmi2011040006.3d 20/7/011 15:38 Page 11
1
2
4
8
16
32
64
128
256
2004 2007 2010 2013 2016 2019
N
o.
 o
f c
or
es
Year
Area limit
HP cores and cache
HP cores, LOP cache
LOP cores and cache
Figure 2. Core counts for peak-performance GPP CMPs using HP transis-
tors for both cores and cache, HP transistors for cores and LOP for cache,
and LOP transistors for both cores and cache. The gap between the LOP
designs and the die-area limit suggests that an increasing fraction of the
die area cannot be utilized by cores.
....................................................................
JULY/AUGUST 2011 11
as well as the theoretical number of cores
that could fit into the die area at the corre-
sponding technology. Beyond 2013 (32 nm),
HP-based designs become impractical due
to the chip power limits. Although LOP-
based designs offer a way forward, the large
gap that emerges between the LOP designs
and the die-area limit suggests that either
die sizes will shrink or a large portion of
the chip silicon will have to remain dark
(powered down).
Multicore processors with milliwatt cores
Lean cores deliver high performance at
low power; for example, the UltraSPARC
T1 consumes 2 W per thread.4 However,
embedded cores can deliver reasonable perfor-
mance at orders of magnitude lower power;
for example, 279 mW for ARM1176JZ(F)-S
with an eight-stage out-of-order pipeline
and dynamic branch prediction. Because of
their small size and power efficiency,
CMPs employing EMB cores can fit many
more cores within the physical constraints.
We find that EMB-based multicore CMPs
generally exhibit trends similar to GPP-
based multicore CMPs (see Figure 3).
Both GPP and EMB designs require
similar-sized caches to remain within the
bandwidth envelope. But to reach peak per-
formance, EMB multicores require double
the cores compared to GPP. However, the
high core count provides only a marginal
performance benefit because of Amdahl’s
law and the increased power consumption
of the larger on-chip interconnect. As a
result, the best EMB design trails GPP by
13 percent in absolute performance with a
99-percent parallel workload, achieving a
speedup over GPP with 99.6 percent or
higher workload parallelism.
We also evaluated multithreaded EMB
cores and found they behave similarly to single-
threaded cores due to the increased power and
bandwidth requirements. Ultimately, peak-
performance designs with EMB cores can
power only 12 percent of the cores that fit
in the die (see Figure 4), potentially leaving
a large chip area underutilized.
Specialized multicore processors
Amdahl’s law prohibits large core counts
from delivering high aggregate performance
[3B2-9] mmi2011040006.3d 20/7/011 15:38 Page 12
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
1 2 4 8 16 32 64 128 256 512
Cache size (Mbyte)(a)
1,
00
0 
×
 M
IP
S
Area (maximum frequency)
Power (maximum frequency)
Bandwidth (VFS)
Area + power (VFS)
Peak performance
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
1 2 4 8 16 32 64 128 256 512
Cache size (Mbyte)(b)
1,
00
0 
×
 M
IP
S
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
1 2 4 8 16 32 64 128 256 512
Cache size (Mbyte)(c)
1,
00
0 
×
 M
IP
S
Area (maximum frequency)
Power (maximum frequency)
Bandwidth (VFS)
Area + power (VFS)
Peak performance
Area (maximum frequency)
Power (maximum frequency)
Bandwidth (VFS)
Area + power (VFS)
Peak performance
Figure 3. Performance of CMPs with GPP cores (a), embedded (EMB) cores
(b), and specialized (SP) cores (c) using LOP transistors at 20 nm. SP designs
significantly outperform GPP and EMB designs because SP cores are highly
power-efficient.
....................................................................
12 IEEE MICRO
...............................................................................................................................................................................................
BIG CHIPS
(except for embarrassingly-parallel applica-
tions). An alternative design is to deliver
higher performance with fewer but more
powerful cores. We evaluated an extreme ap-
plication of this approach by considering
specialized computing, where a multicore
chip might contain hundreds of diverse
application-specific cores, activating only
those cores that are most useful to the run-
ning application while leaving the vast ma-
jority of the on-chip cores powered down.
The few cores that are simultaneously pow-
ered up in this design reduce the impact of
Amdahl’s law on aggregate performance,
whereas matching specialized cores to an
application’s requirements enables high per-
formance at high power efficiency.
Figure 4 compares the number of cores
for the peak-performing designs across the
studied core types and process generations.
We found that the peak-performance SP
designs employ only 16 to 32 cores, with a
large fraction of the chip die area occupied
by a cache. For our workloads, the observa-
tion of low-core-count SP designs out-
performing alternative designs holds up to
99.9 percent parallelism.
The superior power and performance
characteristics of SP cores push the power
envelope much further than is possible
with other core designs. As a result, SP multi-
cores attain 2 to 12 speedup over the
GPP and EMB designs and are ultimately
constrained by the limited off-chip band-
width (see Figure 5). The performance im-
provement achieved by SP multicores on
server workloads is in line with prior re-
search on mobile applications.17
Effect of bandwidth-mitigating technologies
Bandwidth considerations push cache
sizes up, reducing the power available to em-
ploy more or faster cores for all core types.
We evaluated 3D-stacked DRAM caches to
observe the trends of future processor designs
in light of technologies that might alleviate
off-chip bandwidth pressure for future pro-
cessors. We expect that other bandwidth-
mitigating technologies (such as photonics)
will exhibit similar trends.
A 3D-stacked memory cache pushes the
bandwidth constraint beyond the power
limits, leading to designs that are only
power-constrained and achieve higher per-
formance (see Figure 5). Eliminating the
off-chip bandwidth bottleneck pushes
designs back to the power-limited regime
where the die area is underutilized due to an
inability to power up all cores (see Figure 4).
GPP and EMB CMPs attain only a mod-
est performance improvement (less than
35 percent).
However, the reduction in off-chip band-
width requirements when combining 3D
memory with specialized cores results in sig-
nificant speedup (3 at 20 nm) and relieves
the pressure on the on-chip cache size. As a
result, peak-performance designs with SP
cores can be realized in an increasingly
smaller silicon area, with the otherwise dark
silicon used to implement a large collection
of specialized cores to increase the likelihood
of finding a core suitable for the current
computation (see Figure 6).
[3B2-9] mmi2011040006.3d 20/7/011 15:38 Page 13
1
2
4
8
16
32
64
128
256
512
1,024
2004 2007 2010 2013 2016 2019 
N
o.
 o
f c
or
es
 
Year
Max EMB cores 
EMB + 3D memory 
EMB 
GPP + 3D memory 
GPP 
SP + 3D memory 
SP 
Figure 4. Core counts for peak-performance GPP, EMB, and SP CMPs,
with conventional or 3D-stacked memory. GPP and EMB designs scale
up to only a few tens to low hundreds of cores, although an order of
magnitude more cores fit in the die. SP designs require only a handful
of cores to attain peak performance.
....................................................................
JULY/AUGUST 2011 13
A s technology scaling continues, powerconstraints will prevent conventional
multicore designs from scaling beyond a few
tens to low hundreds of cores, leaving an
increasing fraction of the die unused.
Specialized multicores repurpose the unused
dark silicon to implement a large number of
workload-specific cores and power up only
the few cores that most closely match the
requirements of the executing workload.
Although specialized multicores are an
appealing design, further research is needed
to realize them. We must characterize modern
workloads to identify computational segments
that are candidates for off-loading to specia-
lized cores and devise core architectures
suitable to execute them. Moreover, we must
develop the software infrastructure and
runtime environment that will facilitate code
migration at the appropriate granularity. We
plan to continue tackling these important
issues and make specialized computing a
reality. MICRO
....................................................................
References
1. Y. Watanabe, J.D. Davis, and D.A. Wood,
‘‘Widget: Wisconsin Decoupled Grid Exe-
cution Tiles,’’ Proc. 37th Int’l Symp. Com-
puter Architecture, IEEE CS Press, 2010,
pp. 2-13.
2. E.S. Chung et al., ‘‘Single-Chip Heteroge-
neous Computing: Does the Future Include
Custom Logic, FPGAs, and GPGPUs?’’ Proc.
43rd IEEE/ACM Int’l Symp. Microarchitec-
ture, IEEE CS Press, 2010, pp. 225-236.
3. H. Esmaeilzadeh et al., ‘‘Dark Silicon and
the End of Multicore Scaling,’’ Proc. 38th
Int’l Symp. Computer Architecture, ACM
Press, 2011.
4. A.S. Leon et al., ‘‘A Power-Efficient High-
Throughput 32-Thread SPARC Processor,’’
IEEE J. Solid-State Circuits, vol. 42, no. 1,
2007, pp. 7-16.
5. N. Hardavellas et al., ‘‘Database Servers on
Chip Multiprocessors: Limitations and
Opportunities,’’ Proc. 3rd Biennial Conf.
Innovative Data Systems Research, 2007,
pp. 79-87; www.cidrdb.org/cidr2007.
6. R. Hameed et al., ‘‘Understanding Sources
of Inefficiency in General-Purpose Chips,’’
Proc. 37th Int’l Symp. Computer Architec-
ture, IEEE CS Press, 2010, pp. 37-47.
7. N. Muralimanohar, R. Balasubramonian,
and N. Jouppi, ‘‘Optimizing NUCA Organiza-
tions and Wiring Alternatives for Large
Caches with CACTI 6.0,’’ Proc. 40th
IEEE/ACM Int’l Symp. Microarchitecture,
IEEE CS Press, 2007, pp. 3-14.
8. C. Kim, D. Burger, and S.W. Keckler, ‘‘An
Adaptive, Non-uniform Cache Structure for
Wire-Delay Dominated On-Chip Caches,’’
ACM SIGPLAN Notices, vol. 37, no. 10,
2002, pp. 211-222.
9. J.D. Davis, J. Laudon, and K. Olukotun,
‘‘Maximizing CMP Throughput with Mediocre
Cores,’’ Proc. 13th Int’l Conf. Parallel Archi-
tectures and Compilation Techniques, IEEE
CS Press, 2005, pp. 51-62.
10. N. Hardavellas, ‘‘Chip Multiprocessors for
Server Workloads,’’ doctoral dissertation,
Dept. of Computer Science, Carnegie Mellon
Univ., 2009.
11. N. Hardavellas et al., ‘‘Power Scaling: The
Ultimate Obstacle to 1K-Core Chips,’’ tech.
report NWU-EECS-10-05, Northwestern
Univ., 2010.
[3B2-9] mmi2011040006.3d 20/7/011 15:38 Page 14
1
2
4
8
16
32
64
2004 2007 2010 2013 2016 2019 
S
p
ee
d
up
 
Year  
SP + 3D memory 
SP 
EMB + 3D memory 
EMB 
GPP + 3D memory 
GPP 
Figure 5. Speedup of peak-performance GPP, EMB, and SP CMPs
using LOP transistors, with conventional and 3D-stacked memory.
SP outperforms GPP and EMB designs by 2 to 12 across technologies.
....................................................................
14 IEEE MICRO
...............................................................................................................................................................................................
BIG CHIPS
12. T.F. Wenisch et al., ‘‘SimFlex: Statistical
Sampling of Computer System Simulation,’’
IEEE Micro, vol. 26, no. 4, 2006, pp. 18-31.
13. G.H. Loh, ‘‘3D-Stacked Memory Architec-
tures for Multi-core Processors,’’ Proc.
35th Int’l Symp. Computer Architecture,
IEEE CS Press, 2008, pp. 453-464.
14. T. Burd et al., ‘‘A Dynamic Voltage Scaled
Microprocessor System,’’ Proc. IEEE Int’l
Solid-State Circuits Conf., IEEE CS Press,
2000, pp. 294-295.
15. S. Rodriguez and B. Jacob, ‘‘Energy/Power
Breakdown of Pipelined Nanometer Caches
(90nm/65nm/45nm/32nm),’’ Proc. Int’l Symp.
Low-Power Electronics and Design, ACM
Press, 2006, pp. 25-30.
16. H. Hua et al., ‘‘Performance Trend in Three-
Dimensional Integrated Circuits,’’ Proc. Int’l
Interconnect Technology Conf., 2006,
pp. 45-47.
17. N. Goulding-Hotta et al., ‘‘The GreenDroid
Mobile Application Processor: An Architec-
ture for Silicon’s Dark Future,’’ IEEE Micro,
vol. 31, no. 2, 2011, pp. 86-95.
Nikos Hardavellas is the June and Donald
Brewer Assistant Professor of Electrical En-
gineering and Computer Science at North-
western University, and the director of the
Parallel Architecture Group at Northwestern
(PARAG@N). His research interests are in
hardware and software design for energy-
efficient scalable parallel architectures, mem-
ory systems, and on-chip interconnects.
Hardavellas has a PhD in computer science
from Carnegie Mellon University. He’s a
member of the ACM and IEEE.
Michael Ferdman is a PhD candidate in
electrical and computer engineering at
Carnegie Mellon University. His research
interests include computer architecture with
an emphasis on proactive memory system
design. Ferdman has an MS in electrical and
computer engineering from Carnegie Mellon
University. He’s a student member of the
ACM and IEEE.
Babak Falsafi is a professor of computer and
communication sciences at E´cole Polytech-
nique Fe´de´rale de Lausanne, where he directs
the EcoCloud center targeting robust, eco-
nomic, and environmentally friendly cloud
technologies. Falsafi has a PhD in computer
science from the University of Wisconsin-
Madison. He is a senior member of the
ACM and IEEE.
Anastasia Ailamaki is a professor of com-
puter science at E´cole Polytechnique Fe´de´r-
ale de Lausanne. Her research interests are in
optimizing database workloads for modern
hardware and disks and in managing large
data sets for scientific applications. Ailamaki
has a PhD in computer science from the
University of Wisconsin-Madison.
Direct questions or comments about this
article to Nikos Hardavellas, Northwestern
University, Technological Institute—EECS,
2145 Sheridan Rd., Evanston, IL 60208;
nikos@northwestern.edu.
[3B2-9] mmi2011040006.3d 20/7/011 15:38 Page 15
64
128
256
512
2004 2007 2010 2013 2016 2019 
D
ie
 s
iz
e 
(m
m
2 )
 
Year  
Max die size 
OLTP
DSS
Apache 
Trendline (exp.) 
Figure 6. Die size of peak-performance SP CMPs with 3D-stacked mem-
ory. The gap between the trendline and the maximum die size indicates
that an increasing fraction of the silicon area is left unutilized (dark). Instead
of wasting it, specialized multicores repurpose it to implement application-
specific cores.
....................................................................
JULY/AUGUST 2011 15
