Exploring the design space for 3D clustered architectures by Balasubramonian, Rajeev & Awasthi, Manu
Exploring the Design Space for 3D Clustered Architectures
Manu Awasthi, Rajeev Balasubramonian 
School of Computing, University of Utah 
{ m a n u a ,  r a j e e v } @ c s . u t a h . e d u *
Abstract
3D die-stacked chips are emerging as intriguing 
prospects fo r  the future because o f  their ability to reduce 
on-chip wire delays and pow er consumption. However, 
they will likely cause an increase in chip operating tem­
perature, which is already a m ajor bottleneck in modern 
microprocessor design. We believe that 3D will provide 
the highest performance benefit fo r  high-ILP cores, where 
wire delays fo r  2D  designs can be substantial. A clustered 
microarchitecture is an example o f  a complexity-effective 
implementation o f  a high-ILP core. In this paper, we con­
sider 3D organizations o f  a single-threaded clustered mi­
croarchitecture to understand how floorplanning impacts 
performance and temperature. We first show that delays 
between the data cache and ALUs are most critical to p er­
formance. We then present a novel 3D layout that pro ­
vides the best balance between temperature and perfor­
mance. The best-performing 3D layout has 12% higher 
performance than the best-performing 2D  layout.
Keywords: 3D die-stacked chips, clustered architec­
tures, microarchitecturalfloorplanning, wire delays, cache 
hierarchies.
1. Introduction
Interconnect performance [25] and power [24] have 
emerged as major bottlenecks. The vertical stacking of 
dies enables low intra-chip distances for signal transmis­
sion and helps alleviate the interconnect bottleneck. How­
ever, 3D chips will likely experience high power densi­
ties and operating temperatures; inter-die insulators within 
a 3D chip also make it harder for heat to escape to the 
heat sink on the chip’s surface. Every time the operating 
temperature exceeds a threshold, processors are forced to 
switch to low-power and low-performance modes. A high 
thermal emergency threshold leads to high packaging and 
cooling costs, while a low thermal emergency threshold 
leads to frequent emergencies and lower performance. As­
suming that 3D chips represent the way of the future, it
*This work was supported in part by NSF grant CCF-0430063 and by 
an NSF CAREER award.
is important that architects pursue innovations that allow 
these 3D chips to either incur tolerable cooling costs or low 
performance overheads when dealing with thermal emer­
gencies.
As of now, it is too early to tell if 3D chips do rep­
resent the way of the future. Any one of many hurdles 
can threaten to be a show-stopper: manufacturing limi­
tations, poor yield, cooling limitations, not enough per­
formance benefit, lack of maturity in EDA tools, alterna­
tive computing technologies and market forces. In spite 
of these hurdles, early-stage architecture results are nec­
essary to determine the potential of pursuing the 3D ap­
proach. Recent papers have indicated that the potential 
for performance and power improvement is non-trivial 
[4, 23, 29, 30, 31, 32, 33, 35, 39]. The most compelling 
arguments in favor of 3D chips are as follows:
• Future chip multiprocessors (CMPs) that accommo­
date tens or hundreds of cores will most likely be 
limited by memory bandwidth. Consider the follow­
ing design: DRAM dies stacked on top of process­
ing dies, with inter-die vias scattered all over each die 
surface. By leveraging the entire chip surface area for 
memory bandwidth (as opposed to part of the chip 
perimeter in conventional 2D chips), this design can 
help meet the memory needs of large-scale CMPs. 
There are indications that Intel designers may con­
sider this possibility [20] and Samsung has already 
developed 3D chips that stack DRAM on top of a pro­
cessing die [36].
• It is well-known that as process technologies shrink, 
wire delays do not scale down as well as logic de­
lays. Future processors will likely be communication- 
bound, with on-chip wire delays of the order of 
tens of cycles [1, 18] and on-chip interconnects ac­
counting for half the dynamic power dissipated on a 
chip [24]. A 3D implementation dramatically reduces 
the distances that signals must travel, thereby reduc­
ing wire delay and wire power consumption.
• Since each die can be independently manufactured, 
heterogeneous technologies can be integrated on a 
single chip.
One approach to take advantage of 3D is to implement
every block of a microprocessor as a 3D circuit, also re­
ferred to as folding. Puttaswamy and Loh characterize 
the delay and power advantage for structures such as the 
data cache [29], register file [31], ALU [32], and issue 
queue [30]. The delays of these structures can typically 
be reduced by less than 10%, implying a potential in­
crease in clock speed, or a potential increase in instruction- 
level parallelism (ILP) by accommodating more regis­
ter/cache/issue queue entries for a fixed cycle time tar­
get. The disadvantage with the folding approach is that 
potential hotspots (e.g., the register file) are stacked verti­
cally, further exacerbating the temperature problem. Much 
design effort will also be invested in translating well- 
established 2D circuits into 3D.
An alternative approach is to leave each circuit block 
untouched (as a 2D planar implementation) and leverage 
3D to stack different microarchitectural structures verti­
cally. The primary advantage of this approach is the ability 
to reduce operating temperature by surrounding hotspots 
with relatively cool structures. A second advantage is 
a reduction in inter-unit wire delay/power and a poten­
tially shorter pipeline. The goal of this paper is to carry 
out a preliminary evaluation of this alternative approach. 
Wire delays between microarchitectural blocks may not 
represent a major bottleneck for small cores. Hence, it is 
unlikely that 3D will yield much benefit for such cores. 
Clearly, larger cores with longer inter-unit distances stand 
to gain more from a 3D implementation. Since we are in­
terested in quantifying the maximum performance poten­
tial of 3D for a single core, as an evaluation platform, we 
employ a large clustered architecture capable of support­
ing a large window of in-flight instructions. Many prior 
papers [1, 22, 28] have shown that a clustered microar­
chitecture represents a complexity-effective implementa­
tion of a large core. In a clustered design, processor re­
sources (registers, issue queue entries, ALUs) are parti­
tioned into small clusters, with an interconnect fabric en­
abling register communication between clusters. Such mi­
croarchitectures have even been adopted by industrial de­
signs [22]. A multi-threaded clustered architecture is also 
capable of simultaneously meeting industrial demands for 
high ILP, high TLP (thread-level parallelism), and clock 
speeds [9, 14].
We first present data on the impact of inter-unit wire 
delays on performance (Section 2). Wire delays between 
integer ALUs and data caches have the most significant 
impact on performance. Also, data cache banks are rel­
atively cool structures, while clusters have higher power 
densities. This implies that the relative placement of clus­
ters and cache banks can have a significant impact on per­
formance and temperature. The baseline 2D clustered ar­
chitecture is described in Section 3 and 3D design options 
are explored in Section 4. Performance and temperature 
results are presented in Section 5. We discuss related work
2. Motivation
Floorplanning algorithms typically employ a simulated 
annealing process to evaluate a wide range of candidate 
floorplans. The objective functions for these algorithms 
are usually a combination of peak temperature, silicon 
area, and metal wiring overhead. In an effort to reduce 
temperature, two frequently communicating blocks may 
be placed arbitrarily far apart. As a result, additional 
pipeline stages are introduced between these blocks just 
for signal transmission. In modern microprocessors, the 
delays across global wires can exceed a single cycle. The 
Intel Pentium4 [17] has a couple of pipeline stages ex­
clusively for signal propagation. As wire delays continue 
to grow, relative to logic delays and cycle times, we can 
expect more examples of multi-cycle wire delays within 
a microprocessor. We extended the Simplescalar-3.0 [5] 
toolset to model the critical loops in a monolithic super­
scalar out-of-order processor. In Table 1, we list the salient 
processor parameters as well as details regarding the criti­
cal loops. Figure 1 shows the effect of wire delays between 
various pipeline stages on average IPC for the SPEC-2k 
benchmark set.
We see that wire delays between the ALU and data 
cache degrade IPC the most. Every additional cycle be­
tween the ALU and data cache increases the load-to-use 
latency by two cycles: it takes an extra cycle to commu­
nicate the effective address to the cache and an extra cycle 
to bypass the result to dependent integer ALU operations. 
Further, it takes longer for the issue queue to determine if a 
load instruction is a cache hit or miss. Many modern pro­
cessors employ load-hit speculation, where it is assumed 
that loads will hit in the cache and dependent instructions 
are accordingly scheduled. Load hit-speculation imposes 
an IPC penalty in two ways: (i) In order to facilitate re­
play on a load-hit mis-speculation, dependent instructions 
are kept in the issue queue until the load hit/miss outcome 
is known -  this increases the pressure on the issue queue 
(regardless of whether the speculations are correct or not). 
(ii) On a load-hit mis-speculation, dependent instructions 
are issued twice and end up competing twice for resources. 
The introduction of wire delays between the ALU and data 
cache increases the time taken to determine if the load is a 
hit/miss -  correspondingly, there is greater pressure on the 
issue queue and more dependents are issued on a load-hit 
mis-speculation.
The other noticeable wire delays are between the issue 
queue and ALUs. These delays also increase the penal­
ties imposed by load-hit speculation. Every other 4-cycle 
wire delay has less than a 5% impact on IPC. These ex­
periments confirm that the ALUs and data caches must be 
placed as close to each other as possible during the floor-
in Section 6 and draw conclusions in Section 7.
Simulation param eters
Fetch queue size 
Bimodal predictor size 
Level 2 predictor 
Branch mispredict penalty 








at least 10 cycles 
20 Int, 15 FP 
4/2
32KB 2-way 
32KB 2-way 2-cycle 
80/40 “
Branch predictor 
Level 1 predictor 
BTB size 
Fetch, Dispatch, Commit width 
Register fi le size 
FP ALUs/mult-div 
Memory latency 
L2 unifi ed cache 
I and D TLB
comb, of bimodal and 2-level 
16K entries, history 12 
16K sets, 2-way 
4 “
80 (Int and FP, each)
2/1
300 cycles for the fi rst block 
2MB 8-way, 30 cycles 
128 entries, 8KB page size
Pipeline stages involved in wire delay How the wire delay affects performance
Branch predictor and Lll-Cache 
I-Cache and Decode 
Decode and Rename 
Rename and Issue queue 
Issue queue and ALUs 
Integer ALU and LID-Cache 
FP ALU and LID-Cache 
Integer ALU and FP ALU 
LI caches and L2 cache 
Clusters in a clustered microarchitecture
Branch mispredict penalty 
Branch mispredict penalty, penalty to detect control instruction 
Branch mispredict penalty 
Branch mispredict penalty and register occupancy 
Branch mispredict penalty, register occupancy, LI miss penalty, load-hit speculation penalty 
load-to-use latency, LI miss penalty, load-hit speculation penalty 
load-to-use latency for fbating-point operations 
dependences between integer and FP operations 
LI miss penalty 
inter-cluster dependences
Table 1. Simulation parameters for the monolithic superscalar and the effect of wire delays on critical loops.
Im pact of w ire delays
















Figure 1. Impact of inter-unit wire delays on IPC for a monolithic superscalar processor.
planning process. Hence, for most of this paper, we will 
assume that the rest of the pipeline is distributed over mul­
tiple dies in a 3D chip and we will carefully consider the 
relative placement of execution units and cache banks.
3. Baseline Clustered Architecture
Wire delays play a greater role in large cores with many 
resources. We will therefore consider cores with in-flight 
instruction windows as high as 256. Even a medium-scale 
out-of-order processor such as the Alpha 21264 employs 
a clustered architecture to support an in-flight instruction 
window of 80. As an evaluation platform for this study, 
we adopt a dynamically scheduled clustered microarchi­
tecture.
Centralized Front End
As shown in Figure 2, instruction fetch, decode, and 
dispatch (register rename) are centralized in our processor 
model. During register rename, instructions are assigned 
to one of eight clusters. The instruction steering heuristic 
is based on Canal et a l.’s ARMBS algorithm [6] and at­
tempts to minimize load imbalance and inter-cluster com­
munication. For every instruction, we assign weights to 
each cluster to determine the cluster that is most likely to 
minimize communication and issue-related stalls. Weights 
are assigned to a cluster if it produces input operands for 
the instruction. Additional weights are assigned if thatpro- 
ducer has been on the critical path in the past. A cluster 
also receives weights depending on the number of free is­
sue queue entries within the cluster. Each instruction is as­
signed to the cluster that has the highest weight according 
to the above calculations. If that cluster has no free register 
and issue queue resources, the instruction is assigned to a 
neighboring cluster with available resources.
Figure 2. Baseline 2D implementation of the 8-cluster 
system.
Execution Units
Our clustered architecture employs small computation 
units (clusters) that can be easily replicated on the die.
Each cluster consists of a small issue queue, physical reg­
ister file, and a limited number of functional units with a 
single cycle bypass network among them. The clock speed 
and design complexity benefits stem from the small sizes 
of structures within each cluster. Dependence chains can 
execute quickly if they only access values within a clus­
ter. If an instruction sources an operand that resides in 
a remote register file, the register rename stage inserts a 
“copy instruction” [6] in the producing cluster so that the 
value is moved to the consumer's register file as soon as 
it is produced. These register value communications hap­
pen over longer global wires and can take up a few cycles. 
Aggarwal and Franklin [2] show that a crossbar intercon­
nect performs the best when connecting a small number 
of clusters (up to four), while a hierarchical interconnect 
performs better for a large number of clusters.
Cache Organization
In this paper, we consider centralized and distributed 
versions of the L1 data cache. Our implementations are 
based on state-of-the-art proposals in recent papers [3, 16, 
34, 40]. Load and store instructions are assigned to clus­
ters, where effective address computation happens. The 
effective addresses are then sent to the corresponding LSQ 
and L1 data cache bank. For a centralized cache organi­
zation, a single LSQ checks for memory dependences be­
fore issuing the load and returning the word back to the re­
questing cluster. When dispatching load instructions, the 
steering heuristic assigns more weights to clusters that are 
closest to the centralized data cache.
As examples of decentralized cache organizations, we 
consider replicated and word-interleaved caches. In a 
replicated cache, each cache bank maintains a copy of the 
L1 data cache. This ensures that every cluster is relatively 
close to all of the data in the L1 cache. However, in ad­
dition to the high area overhead, every write and cache 
refill must now be sent to every cache bank. An LSQ at 
every cache bank checks for memory dependences before 
issuing loads. A word-interleaved cache distributes every 
cache line among the various cache banks (for example, all 
odd words in one bank and even words in another bank). 
This ensures that every cluster is relatively close to some 
of the data in the L1 cache. Word-interleaved caches have 
larger capacities than replicated caches for a fixed area 
budget. Once the effective address is computed, it is sent to 
the corresponding LSQ and cache bank. Load instructions 
must be steered to clusters that are in close proximity to 
the appropriate cache bank. Since the effective address is 
not known at dispatch time, a predictor is employed and 
the predicted bank is fed as an input to the instruction 
steering algorithm. A mechanism [40] is required to en­
sure that memory dependences are maintained even when 
a store instruction’s bank is mispredicted. Initially, each 
LSQ maintains a dummy entry for every store, preventing
| | Cluster Cache bank Intra-die horizontal wire Inter-die vertical wire
Die 1
Die 0 /  J k  /
/£ Z fn ~ 7 Z z 7 £ l7 /
/£ I 7 /Z 7 /
- J l
/£ 3 7 1 Z 7 /
/£ Z /Z I7 l
7 /Z U /Z 7 .
/£ 3 7 1 Z 7 /
(a) Arch-1 (cache-on-cluster) (b) Arch-2 (cluster on cluster) (c) Arch-3 (staggered)
Figure 3. Block diagrams for 8-cluster 3D architectures 1, 2, and 3.
subsequent loads from issuing. Once the store address is 
known, only the corresponding LSQ tracks that address, 
while other LSQs remove the dummy entry. Thus, both 
decentralized caches suffer from the problem that stores 
have to be broadcast to all LSQs.
4.3D Clustered Architectures
We consider the three most interesting relative place­
ments for clusters and cache banks in 3D (shown in Fig­
ure 3). For most of this discussion, we assume that (i) two 
dies are bonded face-to-face (F2F [4]), (ii) each cache bank 
has the same area as a set of four clusters, and (iii) the sys­
tem has eight clusters and two cache banks. The floorplan- 
ning principles can also be extended to greater numbers of 
clusters, cache banks, and dies. The differentiating design 
choices for the three architectures in Figure 3 are: (i) How 
close is a cluster to each cache bank? (ii) Which commu­
nication link exploits the low-latency inter-die via? These 
choices impact both temperature and performance.
Architecture 1 (cache-on-cluster):
In this architecture, all eight clusters are placed on the 
lower device layer (die 0) while the data cache banks are 
placed on the upper device layer (die 1). The heat sink 
and spreader are placed on the upper device layer. The L1 
data cache is decentralized and may either be replicated or 
word-interleaved. The link from each crossbar to the cache 
banks is implemented with inter-die vias. Inter-die vias are 
projected to have extremely low latencies and sufficient 
bandwidth to support communication for 64-bit register 
values 1. In such an architecture, communication between 
two sets of four clusters can be expensive. Such commu­
nication is especially encountered for programs with poor 
register locality or poor bank prediction rates (in the case 
of a word-interleaved cache). By placing all (relatively
1 Inter-die vias have a length of 10fjm and a pitch of [23].
hot) clusters on a single die, the rate of lateral heat spread­
ing is negatively impacted. On the other hand, vertical heat 
spreading is encouraged by placing (relatively) cool cache 
banks upon clusters.
Architecture 2 (cluster-on-cluster):
This is effectively a rotated variation of Architecture 1. 
Clusters are stacked vertically, and similarly, cache banks 
are also stacked vertically. In terms of performance, com­
munication between sets of four clusters is now on faster 
inter-die vias, while communication between a cluster and 
its closest cache bank is expensive. In terms of thermal 
characteristics, the rate of lateral heat spreading on a die 
is encouraged, while the rate of vertical heat spreading be­
tween dies is discouraged.
Architecture 3 (staggered):
Architecture 3 attempts to surround hot clusters with 
cool cache banks in the horizontal and vertical directions 
with a staggered layout. This promotes the rate of vertical 
and lateral heat spreading. Each set of four clusters has 
a link to a cache bank on the same die and a low-latency 
inter-die link to a cache bank on the other die. Thus, ac­
cess to cache banks is extremely fast. In a word-interleaved 
cache, bank prediction helps guide a load to a cluster that 
can access the predicted cache bank with a vertical inter­
connect. In a replicated cache, a load always employs 
the corresponding vertical interconnect to access the cache 
bank. On the other hand, register communication between 
sets of four clusters may now be more expensive as three 
routers must be navigated. However, there are two equidis­
tant paths available for register communication, leading to 
fewer contention cycles. In our experiments, register trans­
fers are alternately sent on the two available paths.
Sensitivity Study:
Most of our evaluation employs a specific 8-cluster 2- 
bank system to understand how 3D organizations impact 
performance and temperature characteristics. As future
[^Cluster Q^Cache bank --------  Intra-die horizontal wire i Inter-die vertical wire
Die 1
/ /n j r /a ^ /  /£ I % £ I j[ /  /£ J j [£ I% //.____ |... '........I...V ............. '............ V /._____I... ........ I... /
0 /§3%337
/I E T t^ J /  /£ 2 7 /1 1 7 /  /£ E 7 £ 1 7 /
(a) Arch-1 (cache-on-cluster) (b) Arch-2 (cluster on cluster) (c) Arch-3 (staggered) 
Figure 4. Block diagrams for 3D organizations of the 4-cluster system.
work, we plan to also quantify these effects as a function 
of number of dies, clusters, cache banks, network charac­
teristics, different resource sizes, etc. For this paper, we 
repeat our experiments for one other baseline system with 
four clusters. This helps confirm that our overall conclu­
sions are not unique to a specific processor model.
The second processor model has four clusters and each 
cluster is associated with a cache bank (either word- 
interleaved or replicated). The clusters are connected with 
a ring network. Figure 4 illustrates the two-die organiza­
tions studied for the 4-cluster system.
5. Results
5.1. Methodology
Fetch queue size 64
Branch predictor comb, of bimodal and 2-level
Bimodal predictor size 2048
Level 1 predictor 1024 entries, history 10
Level 2 predictor 4096 entries
BTB size 2048 sets, 2-way
Branch mispredict penalty at least 12 cycles
Fetch width 8 (across up to 2 basic blocks)
Issue queue size 15 per cluster (int and fp, each)
Register fi le size 30 per cluster (int and fp, each)
Integer ALUs/mult-div 1/1 (in each cluster)
FP ALUs/mult-div 1/1 (in each cluster)
LI I-cache 64KB 2-way
LI D-cache 64KB 2-way set-assoc (8-clusters), 
32KB 2-way set-assoc (4-clusters), 
6 cycles, 2-way word-inter/replicated
L2 unifi ed cache 8MB 8-way, 25 cycles
I and D TLB 128 entries, 8KB page size
Memory latency 300 cycles for the fi rst chunk
Address Predictor Table size ” 64KB
Our simulator is based on Simplescalar-3.0 [5] for the 
Alpha AXP ISA. Separate issue queues and physical reg­
ister files are modeled for each cluster. Major simulation 
parameters are listed in Table 2. Contention for intercon­
nects and memory hierarchy resources are modeled in de­
tail. Each cluster is assumed to have a register file size of
30 physical registers (integer and floating point, each) and 
15 issue queue entries (integer and floating point, each). 
We present results for 23 of the 26 SPEC2k integer and 
floating point benchmarks2. The programs are simulated 
for 100 million instruction windows identified by the Sim- 
point [37] toolkit.
The latencies of interconnects are estimated based on 
distances between the centers of microarchitectural blocks 
in the floorplan. Intra-die interconnects are implemented 
on the 8X metal plane, and a clock speed of 5 GHz at 90nm 
technology is assumed. Figures 3 and 4 are representa­
tive of the relative sizes of clusters and cache banks. Each 
crossbar router accounts for a single cycle delay. For the 
topology in Figure 2, for intra-die interconnects, it takes
2Sixtrack, Facerec, and Perlbmk are not compatible with our simula­
tion environment.
Table 2. Simplescalar simulator parameters.
four cycles to send data between two crossbars, one cy­
cle to send data between a crossbar and cluster, and three 
cycles to send data between the crossbar and 32KB cache 
bank. All vertical inter-die interconnects are assumed to 
have a single cycle latency due to their extremely short 
length (10pm [23]). For intra-die interconnects in the 4- 
cluster organization, the inter-crossbar latency is 2 cycles 
and the crossbar-cache latency is 2 cycles (each cache bank 
is 8KB).
The bank predictor for the word-interleaved cache or­
ganization is based on a strided address predictor. The 
predictor has an average accuracy of 75%. This predic­
tor performs better than a branch predictor-like two-level 
predictor.
We assume a face-to-face (F2F) wafer-bonding technol­
ogy for this study. F2F bonding allows a relatively high 
inter-die via density [23] because of which we assume that 
the inter-die bandwidth is not a limiting constraint for our 
experiments.
Param eter Value
Unit Area (Router+Crossbar) 
Router+Crossbar Power 
Wire Power/Unit Length 
(Data & Control)
0.3748 mm- [23] 
119.55mW [23] 
1.422 mW/mm [8]
Table 3. Interconnect power modeling parameters.
Param eter Value
Specifi c heat capacity (Si) 
Specifi c heat capacity (Si(>) 




1.69 (W/m/K) 1 
40 (W/m/K) 1
Table 4. Hotspot Parameters [11].
The Wattch power models are employed to compute 
power consumption of each microarchitectural block. The 
contribution of leakage to total chip power is roughly 20%. 
Interconnect power (summarized in Table 3) is based on 
values for 8X minimum-length wires [8] and a generic 
Network-on-Chip router [23]. Even though prior studies 
[13] have shown that inter-die vias consume little power, 
we consider their marginal power contributions (modeled 
as wires of length 10 m).
Temperature characteristics are generated by feeding 
the power values to Hotspot-3.0’s [38] grid model with a 
500 500 grid resolution. Hotspot does not consider in­
terconnect power for thermal modeling. Hence, consistent 
with other recent evaluations [19], interconnect power is 
attributed to the units that they connect in proportion to 
their respective areas. Hotspot’s default heat sink model 
and a starting ambient temperature of 45 °C is assumed 
for all experiments. Table 4 provides more details on the 
HotSpot parameters used. Each die is modeled as two 
layers - the active silicon and the bulk silicon. A layer 
of thermal interface material (TIM) is also assumed to be 
present between the bulk silicon of the top die and the heat 
spreader [33].
5.2. IPC Analysis
The primary difference between Architectures 1/2/3 
(Figure 3) is the set of links that are implemented as inter­
die vias. Hence, much of our IPC results can be explained 
based on the amount of traffic on each set of links. We 
observed that for a word-interleaved cache, even with a 
75% bank prediction accuracy, loads are often not steered 
to the corresponding cluster (because of load imbalance or 
other register dependences). Hence, nearly half the cache 
accesses are to the remote cache bank through the inter­
crossbar interconnect. Unless bank predictors with accu­
racies greater than 95% can be designed, word-interleaved 
cache organizations will likely continue to suffer from 









Figure 5. Relative IPC improvement of word- 
interleaved architectures over the 2D base case with 










Figure 6. Relative IPC improvement of replicated 
data-cache architectures over 2D centralized cache ar­








Access type Word-int Cache 
3D -  archl
Word-int Cache 
3D -  arch2
Word-int Cache 
3D -  arch3
Replicated Cache 
3D -  archl
Replicated Cache 
3D -  arch2
Replicated Cache 
3D -  arch3
Local load accesses 4.50 5.89 4.80 4.12 5.24 4.13
Remote load accesses 9.10 8.90 7.30 0 0 0
Inter-crossbar register traffi c 8.30 7.54 5.80 8.15 7.00 5.53
Table 5. Average network latencies (in cycles) for different types of interconnect messages.
nization, all load requests are sent to the local cache bank. 
About half as many register transfers are sent on the inter­
crossbar interconnect between clusters. Table 5 shows the 
average network latencies experienced by loads and regis­
ter transfers in the most relevant 8-cluster architectures.
For Figures 5 and 6, we fix the 2D 8-cluster system with 
a centralized cache as the baseline. A 2D system with a 
word-interleaved cache performs only 2% better than the 
baseline, mostly because of the poor bank prediction rate. 
A 2D system with a replicated cache performs about 7.7% 
better than the baseline. The replicated cache performs 
better in spite of having half the L1 data cache size -  the 
average increase in the number of L1 misses in moving 
from a 64KB to a 32KB cache was 0.85%. A replicated 
cache allows instructions to not only be close to relevant 
data, but also close to relevant register operands. However, 
store addresses and data are broadcast to both cache banks 
and data is written into both banks (in a word-interleaved 
organization, only store addresses are broadcast to both 
banks).
Figures 5 and 6 show IPC improvements for word- 
interleaved and replicated cache organizations over the 2D 
baseline. The word-interleaved organizations are more 
communication-bound and stand to gain much more from 
3D. The staggered architecture-3 performs especially well 
(20.8% better than the baseline) as every cluster is rela­
tively close to both cache banks, bank mis-predictions are 
not very expensive, and multiple network paths lead to 
fewer contention cycles. Architecture-2 performs better 
than Architecture-1 because it reduces the latency for reg­
ister traffic, while slowing down access for correctly bank- 
predicted loads. The opposite effect is seen for the repli­
cated cache organizations because Architecture-2 slows 
down access for all loads (since every load accesses the 
local bank).
With the replicated cache, architecture-3 is similar to 
architecture-1 as regards cache access, but imposes greater 
link latency for inter-cluster register communication. Be­
cause there are multiple paths for register communication, 
architecture-3 imposes fewer contention cycles. As can 
be seen in Table 5, the average total latency encountered 
by register transfers is lowest for architecture-3, for both 
word-interleaved and replicated organizations. The net re­
sult is that architecture-3 performs best for both cache or­
ganizations. The move to 3D causes only a 5% improve­








Figure 7. On-chip Peak temperatures for 8-cluster
organizations (°C)
an 18% improvement for the word-interleaved organiza­
tion. For the architecture-3 model, the word-interleaved 
and replicated organizations have similar latencies for in­
structions (Table 5), but the word-interleaved organization 
has twice as much L1 cache capacity. It is interesting 
to note that an organization such as the word-interleaved 
cache, which is quite un-attractive in 2D has the best per­
formance in 3D (arch-3).
The conclusions from our sensitivity analysis with a 4- 
cluster organization are similar. Compared to a 2D base­
line with a centralized cache, the 3D word-interleaved ar­
chitectures 1, 2, and 3 yield an improvement of 9%, 10%, 
and 16%, respectively. The 3D replicated architectures 1,
2, and 3 yield improvements of 9%, 13%, and 15%, respec­
tively. The move from 2D to 3D yields an improvement of 
9.7% for the word-interleaved and 8% for the replicated 
cache organizations.
5.3. Thermal Analysis
As shown in the previous sub-section, the best 3D orga­
nization out-performs the best 2D organization by 12%. 
The primary benefit of 3D is that cache banks can be 
placed close to clusters, allowing high performance even 
for word-interleaved caches with poor bank prediction 
rates. The architectures that place cache banks close to 
clusters also have favorable thermal characteristics. For 
the 8-cluster system, Figure 7 shows the peak tempera­
ture attained by each architecture, while Figure 8 shows 














Figure 8. Average temperatures of hottest on-chip 
unit, for 8-cluster organizations (°C)
cally one of the issue or load/store queues). A similar pro­
file is also observed for the 4-cluster system and the profile 
is largely insensitive to the choice of word-interleaved or 
replicated banks. Architectures 1 and 3 that stack cache 
upon cluster are roughly 12 °C cooler than architecture- 
2 that stacks cluster upon cluster. Thus, staggered archi­
tecture 3 not only provides the highest performance, but 
also limits the increase in temperature when moving to 3D. 
The lateral heat spreading effect played a very minor role 
in bringing down architecture 3’s temperature -  in fact, it 
was hotter than architecture 1 because of its higher IPC 
and power density. All 3D organizations suffer from sig­
nificantly higher temperatures than the 2D chip. Note that 
our thermal model assumes Hotspot’s default heat sink and 
does not take into account the ability of the inter-die vias 
to conduct heat to the heat sink. An advantage of 3D is the 
reduction in interconnect power. On an average, for the 8- 
cluster configurations we recorded a decrease of 8, 11, and 
10 % respectively for architectures 1, 2, and 3.
6. Related Work
While the VLSI community has actively pursued tools 
to implement 3D circuits [10, 12], the computer architec­
ture community is only beginning to understand the impli­
cations of 3D architectures. A research group at Intel re­
ported [4] a 15% improvement in performance and power 
by implementing an IA-32 processor in 3D. As described 
in Section 1, Puttaswamy and Loh examine 3D implemen­
tations of specific structures such as the data cache [29], 
register file [31], ALU [32], and issue queue [30]. They 
also study the temperature profile of an Alpha 21364-like 
processor that is implemented across up to four dies and 
report a temperature increase of up to 33 Kelvin [33]. 
For that study, most RAM, CAM, and ALU structures are 
folded across the dies. In this paper, we attempt no fold­





ined the effect of folding the L1 data cache [35, 39]. A 
recent paper [23] explores a CMP with a 3D network of 
processing cores and L2 cache banks. Each core in the 
CMP is implemented on a single die and the core is sur­
rounded (horizontally and vertically) by L2 cache banks 
to reduce temperature. The L2 cache is implemented as 
a Non-Uniform Cache Architecture (NUCA) and the 3D 
implementation enables about a 50% reduction in average 
L2 access time.
Clustered (partitioned) processors [15, 21, 28] have re­
ceived much attention over the past decade. The Alpha 
21264 [22] is an example of a commercial design that has 
adopted such a microarchitecture. While interest in ILP 
has waned in recent years, clustered multi-threaded archi­
tectures [9,14] may simultaneously provide high ILP, high 
TLP, and high clock speeds. This paper is the first study 
of a 3D implementation of a clustered architecture. Tem­
perature studies involving clustered architectures include 
those by Chaparro et al. [7], Nelson et al. [27], and Mu- 
ralimanohar et al. [26]. The design of distributed data 
caches for clustered architectures has been evaluated by 
Zyuban and Kogge [40], Gibert et al. [16], and Balasubra- 
monian [3].
7. Conclusions
3D technology can benefit large high-ILP cores by re­
ducing the distances that signals must travel. In this paper, 
we consider various 3D design options for a clustered ar­
chitecture. Placing caches and clusters in close proximity 
in 3D enables high performance and relatively low tem­
perature. We show that a word-interleaved cache with a 
staggered 3D placement performs 12% better than the best 
2D design (with a replicated cache). For future work, we 
plan to study the effect of scaling the system to more dies, 
clusters, cache banks, etc. We also plan to design more 
accurate bank predictors that may enable better locality of 
computation and data, thereby leveraging the benefits of 
3D.
References
[1] V. Agarwal, M. Hrishikesh, S. Keckler, and D. Burger. 
Clock Rate versus IPC: The End of the Road for Con­
ventional Microarchitectures. In Proceedings ofISCA-27, 
pages 248-259, June 2000.
[2] A. Aggarwal and M. Franklin. An Empirical Study of the 
Scalability Aspects of Instruction Distribution Algorithms 
for Clustered Processors. In Proceedings oflSPASS, 2001.
[3] R. Balasubramonian. Cluster Prefetch: Tolerating On-Chip 
Wire Delays in Clustered Microarchitectures. In Proceed­
ings o f ICS-18, June 2004.
[4] B. Black, D. Nelson, C. Webb, and N. Samra. 3D Process­
ing Technology and its Impact on IA32 Microprocessors. 
In Proceedings oflCCD, October 2004.
[5] D. Burger and T. Austin. The Simplescalar Toolset, Ver­
sion 2.0. Technical Report TR-97-1342, University of 
Wisconsin-Madison, June 1997.
[6] R. Canal, J. M. Parcerisa, and A. Gonzalez. Dynamic Code 
Partitioning for Clustered Architectures. International 
Journal o f Parallel Programming, 29(1):59—79, 2001.
[7] P. Chaparro, J. Gonzalez, and A. Gonzalez. Thermal- 
effective Clustered Micro-architectures. In Proceedings 
of the 1st Workshop on Temperature Aware Computer Sys­
tems, held in conjunction with ISCA-31, June 2004.
[8] L. Cheng, N. Muralimanohar, K. Ramani, R. Balasubramo- 
nian, and J. Carter. Interconnect-Aware Coherence Proto­
cols for Chip Multiprocessors. In Proceedings ofISCA-33, 
June 2006.
[9] J. Collins and D. Tullsen. Clustered Multithreaded Archi­
tectures -  Pursuing Both IPC and Cycle Time. In Proceed­
ings o f the 18th IPDPS, April 2004.
[10] J. Cong and Y. Zhang. Thermal-Driven Multilevel Routing 
for 3-D ICs. In Proceedings ofASP-DAC, January 2005.
[11] CRC Press. CRC Handbook of Chemistry. 
http://www.hpcpnetbase.com/.
[12] S. Das, A. Chandrakasan, and R. Reif. Three-Dimensional 
Integrated Circuits: Performance, Design Methodology, 
and CAD Tools. In Proceedings ofISVLSI, 2003.
[13] W. R. Davis, J. Wilson, S. Mick, J. Xu, H. Hua, C. Mineo, 
A. M. Sule, M. Steer, and P. D. Franzon. Demystifying 3D 
ICs: The Pros and Cons of Going Vertical. IEEE Design & 
Test o f Computers, 22(6):498-510, 2005.
[14] A. El-Moursy, R. Garg, D. Albonesi, and S. Dwarkadas. 
Partitioning Multi-Threaded Processors with a Large Num­
ber of Threads. In Proceedings ofISPASS, March 2005.
[15] K. Farkas, P. Chow, N. Jouppi, and Z. Vranesic. The Mul­
ticluster Architecture: Reducing Cycle Time through Par­
titioning. In Proceedings o f MICRO-30, pages 149-159, 
1997.
[16] E. Gibert, J. Sanchez, and A. Gonzalez. Effective In­
struction Scheduling Techniques for an Interleaved Cache 
Clustered VLIW Processor. In Proceedings o f MICRO-35, 
pages 123-133, November 2002.
[17] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean,
A. Kyker, and P. Roussel. The Microarchitecture of the 
Pentium 4 Processor. Intel Technology Journal, Q1, 2001.
[18] R. Ho, K. Mai, and M. Horowitz. The Future of Wires. 
Proceedings o f the IEEE, Vol.89, No.4, April 2001.
[19] W.-L. Hung, G. Link, Y. Xie, N. Vijaykrishnan, and M. J. 
Irwin. Interconnect and thermal-aware floorplanning for 3d 
microprocessors. isqed, 0:98-104, 2006.
[20] J. Rattner. Predicting the Future, 2005. 
Keynote at Intel Developer Forum (article at 
http://www.anandtech.com/tradeshows/showdoc.aspx?i= 
2367&p=3).
[21] S. Keckler and W. Dally. Processor Coupling: Integrating 
Compile Time and Runtime Scheduling for Parallelism. In 
Proceedings ofISCA-19, pages 202-213, May 1992.
[22] R. Kessler. The Alpha 21264 Microprocessor. IEEE Micro, 
19(2):24-36, March/April 1999.
[23] F. Li, C. Nicopoulos, T. Richardson, Y. Xie, N. Vijaykr- 
ishnan, and M. Kandemir. Design and Management of 3D 
Chip Multiprocessors Using Network-in-Memory. In Pro­
ceedings ofISCA-33, June 2006.
[24] N. Magen, A. Kolodny, U. Weiser, and N. Shamir. Inter­
connect Power Dissipation in a Microprocessor. In Pro­
ceedings o f System Level Interconnect Prediction, February
2004.
[25] D. Matzke. Will Physical Scalability Sabotage Perfor­
mance Gains? IEEE Computer, 30(9):37-39, September 
1997.
[26] N. Muralimanohar, K. Ramani, and R. Balasubramonian. 
Power Effi cient Resource Scaling in Partitioned Architec­
tures through Dynamic Heterogeneity. In Proceedings o f 
ISPASS, March 2006.
[27] N. Nelson, G. Briggs, M. Haurylau, G. Chen, H. Chen,
D. Albonesi, E. Friedman, and P. Fauchet. Alleviating 
Thermal Constraints while Maintaining Performance Via 
Silicon-Based On-Chip Optical Interconnects. In Proceed­
ings o f Workshop on Unique Chips and Systems, March
2005.
[28] S. Palacharla, N. Jouppi, and J. Smith. Complexity- 
Effective Superscalar Processors. In Proceedings ofISCA-
24, pages 206-218, June 1997.
[29] K. Puttaswamy and G. Loh. Implementing Caches in a 3D 
Technology for High Performance Processors. In Proceed­
ings ofICCD, October 2005.
[30] K. Puttaswamy and G. Loh. Dynamic Instruction Sched­
ulers in a 3-Dimensional Integration Technology. In Pro­
ceedings o f GLSVLSI, April 2006.
[31] K. Puttaswamy and G. Loh. Implementing Register Files 
for High-Performance Microprocessors in a Die-Stacked 
(3D) Technology. In Proceedings ofISVLSI, March 2006.
[32] K. Puttaswamy and G. Loh. The Impact of 3-Dimensional 
Integration on the Design of Arithmetic Units. In Proceed­
ings o f ISCAS, May 2006.
[33] K. Puttaswamy and G. Loh. Thermal Analysis of a 3D 
Die-Stacked High-Performance Microprocessor. In Pro­
ceedings o f GLSVLSI, April 2006.
[34] P. Racunas and Y. Patt. Partitioned First-Level Cache De­
sign for Clustered Microarchitectures. In Proceedings o f 
ICS-17, June 2003.
[35] P. Reed, G. Yeung, and B. Black. Design Aspects of a Mi­
croprocessor Data Cache using 3D Die Interconnect Tech­
nology. In Proceedings o f International Conference on In­
tegrated Circuit Design and Technology, May 2005.
[36] Samsung Electronics Corporation. Samsung Electron­
ics Develops World’s First Eight-Die Multi-Chip Package 
for Multimedia Cell Phones, 2005. (Press release from 
h t t p  : /  / www. s a m su n g . com).
[37] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Au­
tomatically Characterizing Large Scale Program Behavior. 
In Proceedings ofASPLOS-X , October 2002.
[38] K. Skadron, M. R. Stan, W. Huang, S. Velusamy, and 
K. Sankaranarayanan. Temperature-Aware Microarchitec­
ture. In Proceedings ofISCA-30, pages 2-13, 2003.
[39] Y.-F. Tsai, Y. Xie, N. Vijaykrishnan, and M. Irwin. Three­
Dimensional Cache Design Using 3DCacti. In Proceedings 
ofICCD , October 2005.
[40] V. Zyuban and P. Kogge. Inherently Lower-Power High- 
Performance Superscalar Architectures. IEEE Transac­
tions on Computers, March 2001.
