Dynamically managing the communication-parallelism trade-off in future clustered processors by Balasubramonian, Rajeev & Dwarkadas, Sandhya
D y n a m i c a l l y  M a n a g i n g  t h e  C o m m u n i c a t i o n - P a r a l l e l i s m  T r a d e - o f f  i n  F u t u r e  
C l u s t e r e d  P r o c e s s o r s  *
Rajeev Balasubram onian1', Sandhya D w arkadas1', and David H. A lbonesi1 
1 D epartm ent o f C om puter Science, ■ D epartm ent o f E lectrical and C om puter Engineering 
University o f Rochester, Rochester, N Y  14627
Abstract
Clustered microarchitectures are an attractive alterna­
tive to large monolithic superscalar designs due to their 
potential fo r  higher clock rates in the face o f  increasingly 
wire-delay-constrained process technologies. As increasing 
transistor counts allow an increase in the number o f  clus­
ters, thereby allowing more aggressive use o f  instruction- 
level parallelism  (ILP ), the inter-cluster communication in­
creases as data values get spread across a w ider area. As 
a result o f the emergence o f  this trade-off between commu­
nication and parallelism, a subset o f the total on-chip clus­
ters is optimal fo r  performance. To match the hardware to 
the application's needs, we use a robust algorithm to dy­
namically tune the clustered architecture. The algorithm, 
which is based on program metrics gathered at periodic in­
ternals, achieves an 11% performance improvement on av­
erage over the best statically defined architecture. We also 
show that the use o f  additional hardware and reconfigu­
ration at basic block boundaries can achieve average im­
provements o f  15%. O ur results demonstrate that reconfig­
uration provides an effective solution to the communication 
and parallelism  trade-off inherent in the communication- 
bound processors o f  the future.
1. Introduction
The extraction of large amounts of instruction-level par­
allelism (ILP) from common applications on modern pro­
cessors requires the use of many functional units and 
large on-chip structures such as issue queues, register files, 
caches, and branch predictors. As CMOS process tech­
nologies continue to shrink, wire delays become dominant 
(compared to logic delays) [1, 27, 29], This, combined 
with the continuing trend towards faster clock speeds, in­
creases the time in cycles to access regular on-chip struc­
tures (caches, register files, etc.). Not only does this degrade 
instructions per cycle (IPC) performance, it also presents 
various design problems in breaking up the access into mul­
tiple pipeline stages. In spite of the growing numbers of
*This work was supported in part by NSi! grants BIA-0080124, 
C C R -9811929, CCR-9988361, CCR-0219848, and BCS-0225413; by 
DARPA/ITO under Ai-’RL contract i!29601-00-K-0182; by an IBM fa c ­
ulty Partnership Award; by the U.S. Department of Bnergy Office o f In­
ertial Confinement fusion under Cooperative Agreement No. DB-i!C03- 
92Si! 19460; and by external research and/or equipment grants from Intel, 
IBM, and DBC/Compaq.
transistors available to architects, it is becoming increas­
ingly difficult to design large monolithic structures that aid 
ILP extraction without increasing design complexity, com­
promising clock speed, and limiting scalability in future 
process technologies.
A potential solution to these design challenges is a clus­
tered microarchitecture [17, 29] in which the key processor 
resources are distributed across multiple clusters, each of 
which contains a subset of the issue queues, register files, 
and the functional units. In such a design, at the time of in­
struction rename, each instruction is steered into one of the 
clusters. As a result of decreasing the size and bandwidth 
requirements of the issue queues and register files, the ac­
cess times of these cycle-time critical structures are greatly 
reduced, thereby permitting a faster clock. The simplifica­
tion of these structures also reduces their design complexity.
An attractive feature of a clustered microarchitecture is 
the reduced design effort in producing successive genera­
tions of a processor. Not only is the design of a single clus­
ter greatly simplified, but once a single cluster core has been 
designed, more of these cores can be put into the processor 
for a low design cost (including increasing front-end band­
width) as the transistor budget increases. Adding more clus­
ters could potentially improve IPC performance because 
each program has more resources to work with. There is 
little effect if any on clock speed from doing this as the im­
plementation of each individual cluster does not change. In 
addition, even if the resources in a large clustered processor 
cannot be effectively used by a single thread, the schedul­
ing of multiple threads on a clustered processor can signif­
icantly increase the overall instruction throughput. The rel­
atively low design complexity and the potential to exploit 
thread-level parallelism make a highly-clustered processor 
in the billion transistor era an extremely attractive option.
The primary disadvantage of clustered microarchitec­
tures is their reduced IPC compared to a monolithic design 
with identical resources. Although dependent instructions 
within a single cluster can issue in successive cycles, extra 
inter-cluster bypass delays prevent dependent instructions 
that lie in different clusters from issuing in successive cy­
cles. While monolithic processors might use a potentially 
much slower clock to allow a single-cycle bypass among all 
functional units, a clustered processor allows a faster clock, 
thereby introducing additional latencies in cycles between 
some of the functional units. The clustered design is a vi-
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA'03)
1063-6897/03 $17.00 © 2003 IEEE
able option only if the IPC degradation does not offset the 
clock speed improvement.
Modern processors like the Alpha 21264 [24] at 0.35/U 
technology already employ a limited clustered design, 
wherein the integer domain, for example, is split into two 
clusters. A number of recent studies [2, 8, 11, 12, 17] 
have explored the design of heuristics to steer instructions 
to clusters. Despite these advances, the results from these 
studies will likely need to be reconsidered in the near future 
for the following reasons:
• Due to the growing dominance of wire delays [27, 29] 
and the trend of increasing clock speeds, the resources 
in each cluster core will need to be significantly re­
duced relative to those assumed in prior studies.
• There will be more clusters on the die than assumed 
in prior studies due to larger transistor budgets and the 
potential for exploiting thread-level parallelism [36].
• The number of cycles to communicate data between 
the furthest two clusters will increase due to the wire 
delay problem [1]. Furthermore, communication de­
lays will be heterogeneous, varying according to the 
position of the producer and consumer nodes.
• The data cache will need to be distributed among clus­
ters, unlike the centralized cache assumed by most 
prior studies, due to increased interconnect costs and 
the desire to scale the cache commensurately with 
other cluster resources.
While the use of a large number of clusters could greatly 
boost overall throughput for a multi-threaded workload, its 
impact on the performance of a single-threaded program is 
not as evident. The cumulative effect of the above trends is 
that clustered processors will be much more communication 
bound than assumed in prior models.
As the number of clusters on the chip increases, the num­
ber of resources available to the thread also increases, sup­
porting a larger window of in-flight instructions and thereby 
allowing more distant instruction-level parallelism (1LP) to 
be exploited. At the same time, the various instructions and 
data of the program get distributed over a larger on-chip 
space. If data has to be communicated across the various 
clusters frequently, the performance penalty from this in­
creased communication can offset any benefit derived from 
the parallelism exploited by additional resources.
In this paper, we present and evaluate a dynamically 
tunable clustered architecture that attempts to optimize the 
communication-parallelism trade-off for improved single­
threaded performance in the face of the above trends. The 
balance is effected by employing only a subset of the to­
tal number of available clusters for the thread. Our results 
show that the performance trend as a function of the num­
ber of clusters varies across different programs depending
on the degree of distant 1LP present in them. This motivates 
the need for dynamic algorithms that identify the optimal 
number of clusters for any program phase and match the 
hardware to the program’s requirements. We present algo­
rithms that vary the number of active clusters at any pro­
gram point and show that a simple algorithm that looks at 
performance history over the past few intervals often yields 
most of the available performance improvements. However, 
such an algorithm misses fine-grained opportunities for re­
configuration, and we present alternative techniques that in­
vest more hardware in an attempt to target these missed op­
portunities. The simple interval-based algorithm provides 
overall improvements of 11%, while the fine-grained tech­
niques are able to provide 15% improvements over the best 
static organization.
Disabling a subset of the clusters for a given program 
phase in order to improve single-threaded performance has 
other favorable implications. Entire clusters can turn off 
their supply voltage, thereby greatly saving on leakage en­
ergy, a technique that would not have been possible in a 
monolithic processor. Alternatively, these clusters can be 
used by (partitioned among) other threads, thereby simul­
taneously achieving the goals of optimal single and multi­
threaded throughput.
The rest of the paper is organized as follows. Section 2 
describes the clustered microarchitecture and Section 3 de­
scribes our simulation infrastructure. Section 4 develops 
and evaluates our algorithms for the run-time allocation of 
clusters to each program phase for a centralized cache. Sec­
tion 5 summarizes their performance for a decentralized 
cache model. In Section 6, we evaluate the sensitivity of 
the results to various processor parameters. We describe re­
lated work in Section 7 and conclude in Section 8.
2. The Base Clustered Processor Architecture
We start by describing a baseline clustered processor 
model that has been commonly used in earlier studies 
[2, 8, 11, 12, 17]. Such a model with four clusters is shown 
in Figure 1. The branch predictor and instruction cache are 
centralized structures, just as in a conventional processor. 
At the time of register renaming, each instruction gets as­
signed to a specific cluster. Each cluster has its own issue 
queue, register file, a set of functional units, and its own lo­
cal bypass network. Bypassing of results within a cluster 
does not take additional cycles (in other words, dependent 
instructions in the same cluster can issue in successive cy­
cles). However, if the consuming instruction is not in the 
same cluster as the producer, it has to wait additional cycles 
until the result is communicated across the two clusters.
A conventional clustered processor [2,8,11,12,17] dis­
tributes only the register file, issue queue, and the functional 
units among the clusters. The data cache is centrally lo­
cated. An alternative organization [40] distributes the cache
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA'03)
1063-6897/03 $17.00 © 2003 IEEE
Figure 1. The base clustered processor (4 clusters) with 
the centralized cache.
among the clusters, thereby making the design more scal­
able, but also increasing the implementation complexity. 
Since both organizations are attractive design options, we 
evaluate the effect of dynamic tuning on both organizations.
2.1. The Centralized Cache
In the traditional clustered designs, once loads and stores 
are ready, they are inserted into a centralized load-store 
queue (LSQ) (Figure 1). From here, stores are sent to the 
centralized LI cache when they commit and loads are is­
sued when they are known to not conflict with earlier stores. 
The LSQ is centralized because a load in any cluster could 
conflict with an earlier store from any of the other clusters.
For the aggressive processor models that we are study­
ing, the cache has to service a number of requests every 
cycle. An efficient way to implement a high bandwidth 
cache is to make it word-interleaved. For a 4-way word- 
interleaved cache, the data array is split into four banks and 
each bank can service one request every cycle. Data with 
word addresses of the form 4N are stored in bank 0, of the 
form 4N+1 are stored in bank 1, and so on. Such an orga­
nization supports a maximum bandwidth of four and helps 
minimize conflicts to a bank.
In a processor with a centralized cache, the load latency 
depends on the distance between the centralized cache and 
the cluster issuing the load. In our study, we assume that 
the centralized LSQ and cache are co-located with cluster
1. Hence, a load issuing from cluster 1 does not experi­
ence any communication cost. A load issuing from cluster
2 takes one cycle to send the address to the LSQ and cache 
and another cycle to get the data back (assuming that each 
hop between clusters takes a cycle). Similarly, cluster 3 ex­
periences a total communication cost of four cycles for each 
load. This is in addition to the few cycles required to per­
form the cache RAM look-up.
Steering Heuristics: A clustered design allows a faster 
clock, but incurs a noticeable IPC degradation because of 
inter-cluster communication and load imbalance. Minimiz­
ing these penalties with smart instruction steering has been
Figure 2, The clustered processor (4 clusters) with the 
decentralized cache.
the focus of many recent studies [2, 8, 11, 12, 13, 17], We 
use an effective steering heuristic [11] that steers an instruc­
tion (and its destination register) to the cluster that produces 
most of its operands. In the event of a tie or under cir­
cumstances where an imbalance in issue queue occupancy 
is seen, instructions are steered to the least loaded cluster. 
By picking an appropriate threshold to detect load imbal­
ance, such an algorithm can also approximate other pro­
posed steering heuristics like M od-N  and F irs t-F it  [8], 
The former minimizes load imbalance by steering N  in­
structions to one cluster, then steering to its neighbor. The 
latter minimizes communication by filling up one cluster 
before steering instructions to its neighbor. We empirically 
determined the optimal threshold value for load balance. 
Further, our steering heuristic also uses a criticality pre­
dictor [18, 37] to give a higher priority to the cluster that 
produces the critical source operand. Thus, our heuristic 
represents the state-of-the-art in steering mechanisms.
2.2. The Decentralized Cache
In a highly clustered processor, the centralized cache can 
be a major bottleneck as it has to support a high bandwidth, 
and its average distance to the requesting clusters increases. 
Hence, a distributed cache model [40] represents an attrac­
tive design option.
For an N-cluster system, we assume that the LI cache 
is broken into N word-interleaved banks. Each bank is as­
sociated with its own cluster. The LSQ is also split across 
the different clusters. The example in Figure 2 shows an 
organization with four clusters. Because they are word- 
interleaved, the various banks cache mutually exclusive data 
and do not require any cache coherence protocol between 
them. The goal of the steering mechanism is to steer a load 
or store to the cluster that caches the corresponding mem­
ory address. We discuss the additional steering complexities 
arising from the distributed nature of the cache in Section 5.
The L2 cache continues to be co-located with cluster 1 
and a miss in any of the LI cache banks other than that asso­
ciated with this cluster incurs additional latency depending
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA'03)
1063-6897/03 $17.00 © 2003 IEEE
Parameter Centralized Decentralized cache
cache each cluster total
Cache size 32 KB 16 KB 16N KB
Set-associativity 2-way 2-way 2-way
Line size 32 bytes 8 bytes 8N bytes
Bandwidth 4 words/cycle 1 word/cycle N words/cycle
RAM look-up time 6 cycles 4 cycles 4 cycles
LSQ size 15N 15 15N
Table 2. Cache parameters for the centralized and decen­
tralized caches. All the caches are word interleaved. N 
is the number of clusters.
Fetch queue size 64
Branch predictor comb, of bimodal and 2-level
Bimodal predictor size 2048
Level 1 predictor 1024 entries, history 10
Level 2 predictor 4096 entries
BTB size 2048 sets, 2-way
Branch mispredict penalty at least 12 cycles
Fetch width 8 (across up to two basic blocks)
Dispatch and commit width 16
Issue queue size 15 in each cluster (int and fp, each)
Register file size 30 in each cluster (int and fp, each)
Re-order Buffer (ROB) size 480
Integer ALUs/mult-div 1/1 (in each cluster)
FP ALUs/mult-div 1/1 (in each cluster)
LI I-cache 32KB 2-way
1.2 unified cache 2MB 8-way, 25 cycles
TLB 128 entries, 8KB page size (I and D)
Memory latency 160 cycles for the first chunk
Table 1. Simplescalar simulator parameters.
on the number of hops.
2.3. Interconnects
As process technologies shrink and the number of clus­
ters is increased, attention must be paid to the communi­
cation delays and interconnect topology between clusters. 
Cross-cluster communication occurs at the front-end as well 
as when communicating register values across clusters or 
when accessing the cache. Since the former occurs in every 
cycle, we assume a separate network for this purpose and 
model non-uniform dispatch latencies as well as the addi­
tional latency in communicating a branch mispredict back 
to the front-end. Since the latter two (cache and register-to- 
register communication) involve data transfer to/from reg­
isters, we assume that the same (separate) network is used.
In our study, we focus on a ring interconnect because 
of its low implementation complexity. Each cluster is di­
rectly connected to two other clusters. We assume two uni­
directional rings, implying that a 16-cluster system has 32 
total links (allowing 32 total transfers in a cycle), with the 
maximum number of hops between any two nodes being 8.
In a later section, as part of our sensitivity analysis, we 
also show results for a grid interconnect, which has a higher 
implementation cost but higher performance. The clusters 
are laid out in a two-dimensional array. Each cluster is di­
rectly connected to up to four other clusters. For 16 clusters, 
there are 48 total links, with the maximum number of hops 
being 6, thus reducing the overall communication cost.
3. Simulation Methodology
3.1. Simulator Parameters
Our simulator is based on Simplescalar-3.0 [9] for the 
Alpha AXP instruction set. The simulator has been modi­
fied to represent a microarchitecture resembling the Alpha 
21264 [24]. The register update unit (RUU) is decomposed 
into issue queues, physical register files, and the reorder 
buffer (ROB). The issue queue and the physical register file
are further split into integer and floating-point. Thus, each 
cluster in our study is itself decomposed into an integer and 
floating-point cluster. The memory hierarchy is also mod­
eled in detail (including word-interleaved access, bus and 
port contention, writeback buffers, etc).
This base processor structure was modified to model the 
clustered microarchitecture. To represent a wire-delay con­
strained processor at future technologies, each cluster core 
was assumed to have one functional unit of each type, 30 
physical registers (int and fp, each), and 15 issue queue en­
tries (int and fp, each). As many instructions can issue in 
a cycle as the number of available functional units. We as­
sume that each hop on the interconnect takes a single cycle. 
While we did not model a trace cache, we assumed that in­
structions could be fetched from up to two basic blocks at a 
time. The important simulation parameters are summarized 
in Table 1.
The number of resources in each cluster and the latency 
for each hop on the interconnect are critical parameters 
in such a study as they determine the amount and cost of 
inter-cluster communication. These parameters are highly 
technology, layout, and design-dependent, and determining 
them is beyond the scope of this study. Our results include a 
sensitivity analysis to see how the results change as our as­
sumptions on the number of registers, issue queue entries, 
functional units, and cycles per hop are varied.
Our study focuses on wire-limited technologies of the 
future and we pick latencies according to projections for 
0.035/j. We used CACTI-3.0 [34] to estimate access times 
for the cache organizations. We used the methodology in 
[1] to estimate clock speeds and memory latencies, follow­
ing SIA roadmap projections [5]. With Simplescalar, we 
simulated cache organizations with different size and port 
parameters (and hence different latencies) to determine the 
best base cases. These parameters are summarized in Ta­
ble 2. The centralized cache yielded best performance for 
a 4-way word-interleaved 32KB cache. Such a cache has a 
bandwidth of four accesses per cycle and an access time of 
six cycles. The best decentralized cache organization has a 
single-ported four-cycle 16KB bank in each cluster.
3.2. Benchmark Set
As a benchmark set, we used four SPEC2k Integer pro­
grams, three SPEC2k FP programs, and two programs from
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA'03)










cjpeg (Mediabench) testimg 150M-250M 2.06 82
crafty (SPi;C2k Int) ref 2000M-2200M 1.85 118
djpeg (Mediabench) testimg 30M-180M 4.07 249
galgel (SPi;C2k FP) ref 2000M-2300M 3.43 88
gzip (SP!;C2k Int) ref 2000M-2100M 1.83 87
mgrid (SP!;C2k FP) ref 2000M-2050M 2.28 8977
parser (SPi;C2k Int) ref 2000M-2100M 1.42 88
swim (SP!;C2k FP) ref 2000M-2050M 1.67 22600
vpr (SP!;C2k Int) ref 2000M-2100M 1.20 171
Table 3. Benchmark description. Baseline IPC Is for a 
monolithic processor with as many resources as the 16- 
cluster system. ’’Mispred branch interval” is the number 
of instrs before a branch mispredict is encountered.
2
1 . 5  
1
0 . 5  
O
Figure 3. IPCs for fixed cluster organizations with 2, 4, 8, 
and 16 clusters.
the UCLA Mediabench [25], The details on these programs 
are listed in Table 3. The programs represent a mix of vari­
ous program types, including high and low IPC codes, and 
those limited by memory, branch mispredictions, etc. Most 
of these programs were fast forwarded through the first two 
billion instructions and simulated in detail to warm the var­
ious processor structures before measurements were taken. 
While we are simulating an aggressive processor model, not 
all our- benchmark programs have a high IPC. Note that an 
aggressive processor design is motivated by the need to ran 
high IPC codes and by the need to support multiple threads. 
In both cases, the quick completion of a single low-IPC. 
thread is still important -  hence the need to include such 
programs in the benchmark set.
4. The Dynamically Tunable Clustered Design
For brevity, we focus our initial analysis on the 16- 
cluster model with the centralized cache and the ring in­
terconnect. Figure 3 shows the effect of statically us­
ing a fixed subset of clusters for a program. Increasing 
the number of clusters increases the average distance of a 
load/store instruction from the centralized cache and the 
worst-case inter-cluster bypass delay, thereby greatly affect­
ing the overall communication cost. Assuming zero inter­
cluster communication cost for loads and stores improved
c j p e g  c r a f t y  d j p e g  g a l g e l  g z i p  m g r i d  p a r s e r  s w i m  v p r
performance by 31%, while assuming zero cost for register- 
to-register communication improved performance by 11%, 
indicating that increased load/store latency dominates the 
communication overhead. This latency could be reduced by 
steering load/store instructions to the cluster closest to the 
cache, but this would increase load imbalance and register 
communication. The average latency for inter-cluster reg­
ister communication in the 16-cluster system was 4.1 cy­
cles. At the same time, using more clusters also provides 
the program with more functional units, registers, and is­
sue queue entries, thus allowing it to dispatch a larger win­
dow of in-flight instructions. Depending on which of these 
two conflicting forces dominates, performance either im­
proves or worsens as the number of clusters is increased. 
Programs with distant ILP, like djpeg (JPEG decoding from 
Mediabench), swim, mgrid, and galgel (loop-based floating­
point programs from SPEC.2K) benefit from using many re­
sources. On the other hand, most integer programs with low 
branch prediction accuracies can not exploit a large window 
of in-flight instructions. Hence, increasing the resources 
only degrades performance because of the additional com­
munication cost. This is a phenomenon hitherto unobserved 
in a clustered processor (partly because very few studies 
have looked at more than four clusters and partly because 
earlier studies assumed no communication cost in access­
ing a centralized cache).
Our goal is to tune the hardware to the program’s re­
quirements by dynamically allocating clusters to the pro­
gram. This can be very trivially achieved by modifying the 
steering heuristic to disallow instruction dispatch to the dis­
abled clusters. In other words, disabling is equivalent to 
not assigning any new instructions to the cluster. Instruc­
tions already assigned to the disabled clusters are allowed 
to complete, resulting in a natural draining of the cluster.
4.1. Consistency of Behavior Across Intervals
Various recent works [4, 6, 7, 10, 16, 19, 20, 22, 31, 38] 
have proposed run-time algorithms for the dynamic tun­
ing of hardware to a program phase’s requirements. Most 
of these techniques use an interval-based algorithm, where 
measurements over the last few intervals dictate the choice 
of configuration over subsequent intervals, where an inter­
val is a pre-specified number of committed instructions. 
Our- dynamic configuration selection mechanism is based 
on earlier proposals [7, 16], At the start of each program 
phase, we run each configuration option for an interval and 
record the IPCs. We then pick the configuration with the 
highest IPC and use it until the next phase change is de­
tected. Such a mechanism is heavily reliant on the pro­
gram’s ability to sustain uniform performance over a num­
ber of intervals. We found that floating-point programs gen­
erally show this behavior, while the integer programs show 
a lot more variability. While earlier studies have assumed 
fixed interval lengths, we found that this would result in
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA'03)
1063-6897/03 $17.00 © 2003 IEEE
Benchmark Minimum acceptable interval 
length and its 
instability factor
Instability factor  
fo ra  10K instruction 
interval
gzip 10K / 4 9 4 9
vpr 320K / 5 9 \4 9
crafty 320K / 4 9 i m ,
parser 40 M / 5 9 \2 9
swim 1 0 K /0 « m
mgrid 1 0 K /0 « 0 9
galgel 10K / 19 \ 9
cjpeg 40 K / 4 9 9 9
djpeg 1280K / 19 319
Table 4. Instability factors for different interval lengths.
very poor performance for a number of programs. Hence, 
picking an appropriate interval length is fundamental to the 
success of a configuration selection algorithm (and can be 
universally applied to the configuration of other aspects of 
the processor in addition to the number of clusters).
To study the variability of program behavior over differ­
ent intervals, we ran each of the programs for billions of in­
structions to generate a trace of various statistics at regular 
10K instruction intervals. We used three metrics to define 
a program phase - IPC, branch frequency, and frequency 
of memory references. At the start of each program phase, 
the statistics collected during the first interval were used as 
reference. For each ensuing interval, if the three metrics 
for that interval were similar to the reference points, the in­
terval was termed ‘stable’. If any of the three metrics was 
significantly different, we declared the interval as ‘unstable’ 
and began a new program phase. This analysis was done for 
many interval lengths. The instability fac to r  for an interval 
length is the percentage of intervals that were considered 
‘unstable’, i.e., the frequency of the occurrence of a phase 
change. In our study, we found that it was sufficient to only 
explore a limited subset of the possible configurations (2, 
4, 8, and 16 clusters) as they covered most of the interest­
ing cases. An instability fac to r  of 5% ensures that less than 
15% of the intervals are in sub-optimal configurations.
Table 4 shows the smallest interval length that affords an 
acceptable instability fac to r  of less than 5% for each of our 
programs. As can be seen, the interval lengths that emerge 
as the best vary from 10K to 40M. We also show the insta­
bility fac to r  for a fixed interval length of 10K instructions. 
Clearly, this interval length works poorly for a number of 
programs and would result in quite unacceptable perfor­
mance. Most programs usually show consistent behavior 
across intervals for a coarse enough interval length, making 
interval-based schemes very robust and universally appli­
cable. Even a program like parser, whose behavior varies 
dramatically based on the input data, has a low instability 
fac to r  for a large 40M instruction interval.
4.2. Variable-Interval Mechanism with Exploration
In order to arrive at the optimal instruction interval 
length at run-time, we use a simple algorithm. We start with 
the minimum instruction interval. If the instability fac to r  is
I n i t i a l i z a t i o n s  a n d  d e f i n i t i o n s :
i n t e r v a M e n g t b  =  1 0 K ;  ( n u m b e r  o f  c o m m i t t e d  i n s t r s  b e f o r e  i n v o k i n g  t h e  a l g o )  
d i s c o n t i n u e _ a l g o r i t b m  =  F A L S E ;  ( i f  t h i s  i s  s e t ,  n o  m o r e  r e c o n f i g u r a t i o n s  a r e  
a t t e m p t e d  u n t i l  t h e  n e x t  m a c r o p b a s e )  
b a v e _ r e f e r e n c e _ p o i n t  =  F A L S E ;  ( t b e  f i r s t  i n t e r v a l  i n  a  n e w  p h a s e  p r o v i d e s  a  
r e f e r e n c e  p o i n t  t o  c o m p a r e  f u t u r e  i n t e r v a l s )  
s i g n i f i c a n t _ c b a n g e _ i n _ i p c ;  ( t b i s  i s  s e t  i f  t b e  I P C  i n  t b e  c u r r e n t  i n t e r v a l  d i f f e r s  
f r o m  t h a t  i n  t b e  r e f e r e n c e  p o i n t  b y  m o r e  t h a n  1 0 % )  
s i g n i f i c a n t _ c b a n g e _ i n _ m e m r e f s ;  ( t b i s  i s  s e t  if  t b e  m e m o r y  r e f e r e n c e s  i n  t b e  
c u r r e n t  i n t e r v a l  d i f f e r s  f r o m  t b e  r e f e r e n c e  
p o i n t  b y  m o r e  t h a n  i n t e r v a l _ l e n g t b / 1 0 0 )  
s i g n i f i c a n t _ c b a n g e _ i n _ b r a n c b e s ;  ( s i m i l a r  t o  s i g n i f i c a n t _ c b a n g e _ i n _ m e m r e f s )  
n u m _ i p c _ v a r i a t i o n s  =  O ; ( t b i s  i n d i c a t e s  t b e  n u m b e r  o f  t i m e s  t h e r e  w a s  a  
s i  g  n  i f  i c a  n t _ c b a  n g e _ i  n _ i  p c ) 
s t a b l e _ s t a t e  =  F A L S E ;  ( t b i s  i s  s e t  o n l y  a f t e r  a l l  c o n f i g s  a r e  e x p l o r e d )  
n u m _ c l u s t e r s ;  ( t b e  n u m b e r  o f  a c t i v e  c l u s t e r s )  
i n s t a b i l i t y  =  O ; ( n u m b e r  i n d i c a t i n g  p h a s e  c h a n g e  f r e q u e n c y )
T H R E S H  1 =  T H R E S H 2  =  5 ;  T H R E S H 3  =  1 b i l l i o n  i n s t r u c t i o n s ;
I n s p e c t  s t a t i s t i c s  e v e r y  1 0 0  b i l l i o n  i n s t r u c t i o n s .
If  ( n e w  m a c r o p h a s e )
I n i t i a l i z e  a l l  v a r i a b l e s ;
If  ( n o t  d i s c o n t i n u e _ a l g o r i t b m )
E x e c u t e  t h e  f o l l o w i n g  a f t e r  e v e r y  i n t e r v a M e n g t b  i n s t r u c t i o n s ;
If  ( b a v e _ r e f e r e n c e _ p o i n t )
If  ( s i g n i f i c a n t _ c b a n g e _ i n _ m e m r e f s  o r  s i g n i f i c a n t _ c b a n g e _ i n _ b r a n c b e s  o r  
( s i g n i f i c a n t _ c b a n g e _ i n _ i p c  a n d  n u m _ i p c _ v a r i a t i o n s  => T H R E S H  1 ) )  
b a v e _ r e f e r e n c e _ p o i n t  =  s t a b l e _ s t a t e  =  F A L S E ;  
n u m _ i p c _ v a r i a t i o n s  =  O ; 
n u m _ c l u s t e r s  =  4 ;  
i n s t a b i l i t y  =  i n s t a b i l i t y  2 ;  
i f  ( i n s t a b i l i t y  >  T H R E S H 2 )
i n t e r v a M e n g t b  =  i n t e r v a M e n g t b  * 2 ;  
i n s t a b i l i t y  =  O ;
if  ( i n t e r v a M e n g t b  >  T H R E S H 3 )
P i c k  m o s t  p o p u l a r  c o n f i g u r a t i o n ;  d i s c o n t i n u e _ a l g o r i t b m  =  T R U E ;
e l s e
if  ( s i g n i f i c a n t _ c b a n g e _ i n _ i p c )  
if  ( s t a b l e _ s t a t e )  n u m _ i p c _ v a r i a t i o n s  =  n u m _ i p c _ v a r i a t i o n s  +  2 ;  
e l s e
n u m _ i p c _ v a r i a t i o n s  =  M A X ( —2 , n u m _ i p c _ v a r i a t i o n s —0 . 1 2 5 ) ;  
i n s t a b i l i t y  =  i n s t a b i l i t y  — 0 . 1 2 5 ;
e l s e
b a v e _ r e f e r e n c e _ p o i n t  =  T R U E ;
R e c o r d  b r a n c h e s  a n d  m e m  r e f s .
If  ( b a v e _ r e f e r e n c e _ p o i n t  a n d  n o t  s t a b l e _ s t a t e )  
r e c o r d  I P C ;
n u m _ c l u s t e r s  =  n u m _ c l u s t e r s  * 2 ;  
i f  ( n u m _ c l u s t e r s  > 1 6 )
p i c k  t b e  b e s t  p e r f o r m i n g  c o n f i g u r a t i o n ;  
m a k e  i t s  I P C  t b e  ! P C _ r e f e r e n c e _ p o i n t ;  
s t a b l e _ s t a t e  =  T R U E ;
Figure 4. Run-time algorithm for dynamic selection of the 
number of clusters. The constant increment/decrements 
for num.ipc.variations and instability were chosen to al­
low about 5% instability. The thresholds were picked to 
be reasonable round numbers.
too high, we double the size of the interval and repeat this 
until we either experience a low instability fac to r  or until we 
reach a pre-specified limit (say, a billion instructions). If we 
reach the limit, we cease to employ the selection algorithm 
and pick the configuration that was picked most often.
Once we pick an interval length, we need not remain at 
that interval length forever. The program might move from 
one large macrophase to another that might have a com­
pletely different optimal instruction interval. To deal with 
this, we can continue to hierarchically build phase detection 
algorithms. An algorithm that inspects statistics at a coarse 
granularity (say, every 100 billion instructions) could trig­
ger the detection of a new macrophase, at which point, we 
would restart the selection algorithm with a 10K interval 
length and find the optimal interval length all over again.
For completeness, in Figure 4, we describe our algorithm 
that selects the interval length, detects phases, and selects 
the best configuration at run-time. At the start of a phase, 
the statistics collected in the first interval serve as a refer­
ence point against which to compare future statistics and
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA'03)
1063-6897/03 $17.00 © 2003 IEEE
detect a phase change. The branch and memory reference 
frequencies are microarchitecture-independent parameters 
and can be used to detect phase changes even during the 
exploration process. After exploration, the best perform­
ing configuration is picked and its IPC is also used as a 
reference. A phase change is signaled if either the num­
ber of branches, the number of memory references, or the 
IPC differs significantly from the reference point. Occa­
sionally, there is a slight change in IPC characteristics dur­
ing an interval (perhaps caused by a burst of branch mispre­
dicts or cache misses), after which, behavior returns to that 
of the previous phase. To discourage needless explorations 
in this scenario, we tolerate some noise in the IPC measure­
ments (with the numJpc-variations parameter). In addition, 
if phase changes are frequent, the instability variable is in­
cremented and eventually, the interval length is doubled.
This entire process of run-time reconfiguration can be 
implemented in software with support from hardware event 
counters. A low-overhead software routine (like that used 
for software TLB miss handling) that inspects various hard­
ware counters before making a decision on the subsequent 
configuration is invoked at every interval. The algorithm 
amounts to about 100 assembly instructions, only a small 
fraction of which are executed at each invocation. Even 
for the minimum interval length of 10K instructions, this 
amounts to an overhead of much less than 1%. Implement­
ing the selection algorithm in software allows greater flex­
ibility and opens up the possibility for application-specific 
algorithms. Algorithms at higher levels that detect changes 
in macrophases have an even lower overhead. Since the 
algorithm runs entirely in software, most program-specific 
state resides in memory as opposed to hardware registers. 
Hence, apart from the event counters, no additional state 
has to be saved and restored on a context switch.
Results. In Figure 5, the third bar- illustrates the impact 
of using the interval-based selection mechanism with ex­
ploration at the start of each program phase. As reference 
points, the first two bar's show the static organizations with 
four and 16 clusters. We see that in almost all cases, the 
dynamic scheme does a very good job in approximating the 
performance of the best static organization. For floating­
point programs with little instability (galgel, mgrid, swim), 
the dynamic scheme easily matches the hardware to the pro­
gram’s requirements. For the integer programs, in most 
cases, there is an initial unstable period when the interval 
size is inappropriate. Consistent with our earlier analysis, 
the interval size is increased until it settles at one that allows 
an instability fac to r  of less than 5%. In parser, the simu­
lation interval was not long enough to allow the dynamic 
scheme to settle at the required 40M instruction interval.
In djpeg, it takes a number of intervals for the interval 
size to be large enough (1.28M instructions) to allow a small 
instability factor. Further, since the interval length is large,
■ 4 clusters
□ variable-interval with expl 
I Interval length = 1K; no expl
y  1.5
□ 16 clusters
0 Interval length = 10K; no expl 
El Interval length = 100; no expl
l
cjpeg crafty djpeg galgel gzip mgrid parser swim vpr HM
Figure 5. IPCs for the base cases and for interval-based 
schemes. The third bar represents the algorithm with 
exploration phases, while the fourth, fifth, and sixth bars 
represent algorithms with no exploration.
many opportunities for reconfiguration are missed. There 
are small phases within each interval where the ILP charac­
teristics are different. For these two reasons, the dynamic 
scheme falls short of the performance of the fixed static or­
ganization with 16 clusters for djpeg.
In the case of gzip, there are a number of prolonged 
phases, some with distant ILP characteristics, and others 
with low amounts of distant ILP. Since the dynamic scheme 
picks the best configuration at any time, its performance is 
better than even the best static fixed organization.
On average, 8.3 of the 16 clusters were disabled at any 
time across the benchmark set. In the absence of any other 
workload, this produces a great savings in leakage energy, 
provided the supply voltage to these unused clusters can be 
turned off. Likewise, for a multi-threaded workload, even 
after optimizing single-thread performance, more than eight 
clusters still remain for use by the other threads.
Overall, the dynamic interval-based scheme with explo­
ration performs about 11% better than the best static fixed 
organization. It is also very robust -  it applies to every 
program in our benchmark set as there is usually a coarse 
enough interval length such that behavior across those in­
tervals is fairly consistent. However, the downside is the 
inability to target relatively short phases. We experimented 
with smaller initial interval lengths, but found that the dy­
namic scheme encountered great instability at these small 
interval lengths, and hence, the interval lengths were in­
creased to a larger value just as before. This is caused by 
the fact that measurements become noisier as the interval 
size is reduced and it is harder to detect the same program 
metrics across intervals and accordingly identify the best 
configuration for any phase.
4.3. Interval-Based Scheme with no Exploration
To alleviate these problems, we attempted an alternative 
interval-based scheme. Instead of exploring various con­
figurations at the start of each program phase, we used a
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA'03)
1063-6897/03 $17.00 © 2003 IEEE
16-cluster configuration for an interval and based on the de­
gree of available distant ILP, we selected either a four or 16- 
cluster configuration for subsequent intervals until the next 
phase change (our earlier results indicate that these are the 
two most meaningful configurations and cover most cases). 
An instruction is marked as distant if it is at least 120 in­
structions younger than the oldest 11151111011011 in the ROB. 
At the time of issue, the instruction sets a bit in its ROB 
entry if it is distant. At the time of commit, this bit is used 
to increment the ‘degree of distant ILP*. Since each cluster 
has 30 physical registers, four clusters are enough to support 
about 120 in-flight instructions. If the number of distant in­
structions issued in an interval exceeds a certain threshold, 
it indicates that 16 clusters would be required to exploit the 
available distant ILP. In our experiments, we use a threshold 
value of 160 for an interval length of 1000. Because there 
is no exploration phase, the hardware reacts quickly to a 
program phase change and reconfiguration at a finer granu­
larity becomes meaningful. Hence, we focus on small fixed 
instruction intervals and do not attempt to increase the inter­
val length at run-time. However, since the decision is based 
on program metrics instead of exploration, some accuracy 
is compromised. Further, the smaller the interval length, 
the faster the reaction to a phase change, but the noisier the 
measurements, resulting in some incorrect decisions.
Results. Figure 5 also shows results for such a mech­
anism for three different fixed interval lengths. An inter­
val length of IK 111511110110115 provides the best trade-off be­
tween accuracy and fast reactions to phase changes. Over­
all, it shows the same 11% improvement over the best static 
base case. However, in a program like djpeg, it does much 
better (21%) than the interval-based scheme with explo­
ration because of its ability to target small phases with 
different requirements. Unfortunately, it takes a perfor­
mance hit in programs like galgel and gzip because the 
small interval-length and the noisy measurements result in 
frequent phase changes and inaccurate decision-making.
One of the primary reasons for this is the fact that the 
basic blocks executed in successive 1000 instruction inter­
vals are not always the same. As a result, frequent phase 
changes are signaled and each new phase change results in 
an interval with 16 clusters, to help determine the distant 
ILP. To alleviate this problem, we examine a fine-grain re­
configuration scheme at basic block boundaries.
4.4. Fine-Grain Reconfiguration
To allow reconfiguration at a fine granularity, we look 
upon every branch as a potential phase change. We need 
to determine if a branch is followed by a high degree of 
distant ILP, in which case, dispatch should continue freely, 
else, dispatch should be limited to only the first four clus­
ters. Exploring various configurations is not a feasible op­
tion as there are likely to be many neighboring branches in 
different stages of exploration resulting in noisy measure­
ments for each branch. Hence, until we have enough in­
formation, we assume dispatch to 16 clusters and compute 
the distant ILP characteristics following every branch. This 
is used to update a reconfiguration table so that when the 
same branch is later encountered, it is able to pick the right 
number of clusters. If we encounter a branch with no entry 
in the table, we assume a 16-cluster organization so that we 
can determine its degree of distant ILP.
Assuming that four clusters can support roughly 120 in­
structions, to determine if a branch is followed by distant 
ILP, we need to identify how many of the 360 committed 
instructions following a branch were distant when they is­
sued. Accordingly, either four or 16 clusters would be ap­
propriate. To effect this computation, we keep track of the 
distant ILP nature of the 360 last committed instructions. A 
single counter can be updated by the instructions entering 
and leaving this queue of 360 instructions so that a running 
count of the distant ILP can be maintained. When a branch 
happens to be the oldest of these 360 instructions, its degree 
of distant ILP is indicated by the value in the counter.
There is likely to still be some interference from neigh­
boring branches. To make the mechanism more robust, we 
sample the behavior for a number of instances of the same 
branch before creating an entry for it in the reconfiguration 
table. Further, we can fine-tune the granularity of reconfig­
uration by attempting changes only for specific branches. 
For example, we found that best performance was achieved 
when we attempted changes for only every fifth branch. We 
also show results for a mechanism that attempts changes 
only at subroutine calls and returns. We formalize the algo­
rithm below:
At every Nth branch, look up the reconfig table.
If entry found, change to advised configuration.
Else, use 16 c lusters.
While removing a branch from the queue of 360 
committed in strs ,
If  M samples of th is  branch have been seen,
Do not update table.
Else,
Record the la te s t sample.
If th is  is  the Mth sample,
compute the advised configuration.
Else,
advised configuration is  16 clusters.
The downside of the approach just described is the fact 
that initial measurements dictate future behavior. The na­
ture of the code following a branch could change over the 
course of the program. It might not always be easy to de­
tect such a change, especially if only four clusters are being 
used and the degree of distant ILP is not evident. To deal 
with this situation, we flush the reconfiguration table at peri­
odic intervals. We found that re-constructing the table every 
10M instructions resulted in negligible overheads.
Results. In Figure 6, in addition to the base cases and 
the interval-based scheme with exploration, we show IPCs
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA'03)
1063-6897/03 $17.00 © 2003 IEEE
Figure 6. IPCs for the base cases, the interval-based al­
gorithm with exploration, and two fine-grained reconfig­
uration schemes. The first reconfigures at every fifth 
branch, the second at every subroutine call and return.
for two fine-grained reconfiguration schemes. The first 
attempts reconfiguration at every 5th branch and creates 
an entry in the table after collecting 10 samples for each 
branch. To eliminate effects from aliasing, we use a large 
16K-entry table, though, in almost all cases, a much smaller 
table works as well. The second scheme attempts changes 
at every subroutine call and return and uses three samples. 
The figure indicates that the ability to quickly react to phase 
changes results in improved performance in programs like 
djpeg, cjpeg, crafty, parser, and vpr. The maximum num­
ber of changes between configurations was observed for 
crafty (1.5 million). Unlike in the interval-based schemes 
with no exploration, instability is not caused by noisy mea­
surements. However, gzip fails to match the performance 
achieved by the interval-based scheme. This is because 
the nature of the code following a branch changes over the 
course of the program. Hence, our policy of using initial 
measurements to pick a configuration for the future is not 
always accurate. The same behavior is observed to a lesser 
extent in galgel. Overall, the fine-grained schemes yield 
a 15% improvement over the base cases, compared to the
11 % improvements seen with the interval-based schemes.
From these results, we conclude that interval-based 
schemes with exploration are easy to implement, robust, 
and provide most of the speedups possible. Because of their 
tendency to pick a coarse interval length, a number of re­
configuration opportunities are missed. Choosing a small 
interval length is not the solution to this because of noisy 
measurements across successive small intervals. To allow 
fine-grained reconfigurations, we pick basic block bound­
aries as reconfiguration points and use initial measurements 
to predict future behavior. Except for gzip, such an ap­
proach does not trade off much accuracy and the hardware 
is able to quickly adapt to the program's needs. However, to 
get this additional 4% improvement, we have to invest some
non-trivial amount of hardware -  a table to keep track of the 
predictions and logic to maintain the distant ILP metric.
5. The Decentralized Cache Model
Clustered LSQ implementation. In the decentralized 
cache model, if an effective address is known when a mem­
ory instruction is renamed, then it can be directed to the 
cluster that caches the corresponding data. However, the 
effective address is generally not known at rename time, re­
quiring that we predict the bank that this memory operation 
is going to access. Based on this prediction, the instruction 
is sent to one of the clusters. Once the effective address is 
computed, appropriate recovery action has to be taken in the 
case of a bank misprediction.
If the operation is a load, recovery is simple - the ef­
fective address is sent to the correct cluster, where memory 
conflicts are resolved in the LSQ, data is fetched from the 
cache bank, and returned to the requesting cluster. If the 
memory operation is a store, the mis-direction could result 
in correctness problems. A load in a different cluster could 
have proceeded while being unaware of the existence of a 
mis-directed store to the same address. To deal with this 
problem, we adopt a policy similar to that in [40]. While 
renaming, a store whose effective address is unknown is 
assigned to a particular cluster (where its effective address 
is computed), but at the same time, a dummy slot is also 
created in the other clusters. Subsequent loads behind the 
dummy slot in other clusters are prevented from proceeding 
because there is an earlier store with an unresolved address 
that could potentially cause conflicts. Once the effective 
address is computed, the information is broadcast to all the 
clusters and the dummy slots in all the LSQs except one are 
removed. The broadcast increases the traffic on the inter­
connect for register and cache data (which we model).
Bank prediction. Earlier work by Yoaz et al. [39] had 
proposed the use of branch-predictor-like tables to predict 
the bank accessed by a load or store. In our simulations, we 
use a two-level bank predictor with 1024 entries in the first 
level and 4096 entries in the second.
Steering heuristics. In a processor with a decentralized 
cache, the steering heuristic has to handle three data de­
pendences for each load or store -  the two source operands 
and the bank that caches the data. Since the transfer of 
cache data involves two communications (the address and 
the data), performance is maximized when a load or store 
is steered to the cluster that is predicted to cache the cor­
responding data (note that unlike in the centralized cache 
model, doing so does not increase load imbalance as the 
cache is not at a single location). Even so, frequent bank 
mispredictions and the increased traffic from store address 
broadcasts seriously impact performance. Ignoring these 
effects improved performance by 29%. At the same time, 
favoring the dependence from the cache bank results in in-
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA'03)
1063-6897/03 $17.00 © 2003 IEEE
tf 1.5
0.5 -
■ 4 clusters□ 16 clusters□ variable-interval with expl□ Interval length = 10K; no expl□ Interval length = 1K; no expl
Ha
cjpeg crafty djpeg galgel gzip mgrid parser swim vpr HM
■ 4 clusters
□ 16 clusters
□ variable-interval with expl
w
cjpeg crafty djpeg galgel gzip mgrid parser swim vpr HM
Figure 7. IPCs for dynamic interval-based mechanisms 
for the processor model with the decentralized cache.
Figure 8. IPCs for the dynamic interval-based mecha­
nism for the processor model with the grid interconnect.
creased register communication. Assuming free register 
communication improved performance by 27%. Thus, reg­
ister and cache traffic contribute equally to the communica­
tion bottleneck in such a system.
Disabling clusters. So far, our results have assumed a 
clustered processor with a centralized cache. Hence, recon­
figuration is only a matter of allowing the steering heuristic 
to dispatch to a subset of the total clusters. With a decentral­
ized cache, each cluster has a cache bank associated with it. 
Data is allocated to these cache banks in a word-interleaved 
manner. In going from 16 to four clusters, the number of 
cache banks and hence, the mapping of data to physical 
cache lines changes. To fix this problem, the least complex 
solution is to stall the processor while the LI data cache is 
flushed to L2. Fortunately, the bank predictor need not be 
flushed. With 16 clusters, the bank predictor produces a 4- 
bit prediction. When four clusters are used, the two lower 
order bits of the prediction indicate the correct bank.
Results. Because the indexing of data to physical cache 
locations changes, reconfiguration is not as seamless as in 
the centralized cache model. Every reconfiguration requires 
a stall of the processor and a cache flush. Hence, the fine­
grained reconfiguration schemes from the earlier section do 
not apply. Figure 7 shows IPCs for the base cases and the 
interval-based mechanisms. The third bar shows the scheme 
with exploration and a minimum interval length of 10K in­
structions. The fourth and fifth bars show interval-based 
schemes with no exploration and the use of distant ILP met­
rics to pick the best configuration. The simulation parame­
ters for the decentralized cache are summarized in Table 2. 
We find that the results trend is similar to that seen before 
for the centralized cache model. Except in the case of djpeg, 
there is no benefit from reconfiguring using shorter inter­
vals. Overall, the interval-based scheme with exploration 
yielded a 10% speedup over the base cases.
Since the dynamic scheme attempts to minimize recon­
figurations, cache flushes are kept to a minimum. Vpr
encountered the maximum number of writebacks due to 
flushes (400K), which resulted in a 1% IPC slowdown. 
Overall, these flushes resulted in a 0.3% IPC degradation.
6. Sensitivity Analysis
Our results have shown that the communication- 
parallelism trade-off greatly affects the scalability of differ­
ent programs as the number of clusters is increased for two 
important cache organizations. In this section, we confirm 
the applicability of our dynamic reconfiguration algorithms 
to other meaningful base cases. Some of the key param­
eters that affect the degree of communication and the de­
gree of distant ILP are the choice of interconnect between 
the clusters, the latency of communication across a hop, the 
number of functional units in each cluster, and the number 
of instructions that can be supported by each cluster (the 
number of registers and issue queue entries per cluster).
Figure 8 shows the effect of using a grid interconnect 
as described in Section 2.3 with a centralized cache model. 
Because of the better connectivity, the communication is 
less of a bottleneck and the performance of the 16-cluster 
organization is 8% better than that of the 4-cluster system. 
For brevity, we only show results with the interval-based 
scheme with exploration. The trend is as seen before, but 
because the communication penalty is not as pronounced, 
the overall improvement over the best base case is only 7%. 
The use of fine-grained reconfiguration techniques yields 
qualitatively similar results as with the ring interconnect.
We also studied the sensitivity of the results to the sizes 
of various resources within a cluster. We studied the ef­
fect of using fewer (10 issue queue entries and 20 registers 
per cluster) and more resources (20 issue queue entries and 
40 registers per cluster). When there are few resources per 
cluster, more clusters are required, on average, to exploit 
the available parallelism. Hence, the 16-cluster system is 
a favorable base case and the improvement of the interval- 
based dynamic mechanism relative to it is only 8%. When
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA'03)
1063-6897/03 $17.00 © 2003 IEEE
there are more resources per cluster, using a few clusters 
for low-ILP phases is highly beneficial. As a result, the im­
provement over the 16-cluster base is 13%. By using more 
functional units per cluster, our results were very similar to 
those in Section 4.2. Doubling the cost of communication 
across each hop results in a highly communication-bound 
16-cluster system. By employing the dynamic mechanism 
and using fewer clusters for low-TLP phases, a 23% perfor­
mance improvement was seen.
These results are qualitatively similar to the improve­
ments seen with the interval-based schemes in the earlier 
subsections, indicating that the dynamically tunable design 
can help improve performance significantly across a wide 
range of processor parameters. Thus, the communication- 
parallelism trade-off and its management are likely to be 
important in most processors of the future.
7. Related Work
A number of proposals based on clustered processors 
have emerged over the past decade [3, 8,11,12,13,17, 23,
26, 28, 30, 32, 33, 35]. These differ in the kinds of resources 
that get allocated, the instruction steering heuristics, and the 
semantics for cross-cluster communication. The cache is a 
centralized structure in all these models. These studies as­
sume a small number of total clusters with modest commu­
nication costs.
Cho et al. [14,15] cluster the cache and LSQ, but not the 
rest of the processor. Stack and frame data are cached in 
a separate bank and loads and stores are steered to one of 
two streams early in the pipeline. Yoaz et al. [39] anticipate 
the importance of splitting accesses across multiple streams 
and propose predictors for the same.
Recently, Zyuban and Kogge [40] incorporated a clus­
tered cache in their study on the power efficiency of a clus­
tered processor. Our implementation of the decentralized 
cache closely resembles theirs. A recent study by Aggarwal 
and Franklin [2] explores the performance of various steer­
ing heuristics as the number of clusters scale up. Theirs is 
the only study that looks at as many as 12 clusters and pro­
poses the use of a ring interconnect. They conclude that the 
best steering heuristic varies depending on the number of 
clusters and the processor model. To take this into account, 
each of our clustered organizations was optimized by tuning 
the various thresholds in our steering heuristic.
Many recent bodies of work [4, 6, 7, 10, 16, 19, 20, 22,
31, 38] have looked at hardware units with multiple con­
figuration options and algorithms for picking an appropri­
ate configuration at run-time. Many of these algorithms are 
interval-based, in that, they monitor various statistics over 
a fixed interval of instructions or cycles and make config­
uration decisions based on that information. Ours is the 
first proposal that identifies the importance of a variable - 
length instruction interval and incorporates this in the selec­
tion algorithm. We are also the first to look at fine-grained 
reconfiguration at branch boundaries and contrast it with 
interval-based schemes. Huang et al. [21] study adaptation 
at subroutine boundaries and also demonstrate that this can 
be more effective than using fixed instruction intervals.
Agarwal et al. [ 1 ] show that processors in future gener­
ations are likely to suffer from lower IPCs because of the 
high cost of wire delays. Ours is the first study to focus on a 
single process technology and examine the effects of adding 
more resources. The clustered processor model exposes a 
clear- trade-off between communication and parallelism, and 
it readily lends itself to low-cost reconfiguration.
8. Conclusion
We have presented and evaluated the effects of shrink­
ing process technologies and dominating wire delays on the 
design of future clustered processors. While increasing the 
number of clusters to take advantage of the increasing chip 
densities improves the processor’s ability to support multi­
ple threads, the performance of a single thread can be ad­
versely affected. This is because such processors are bound 
by cross-cluster communication costs. These costs can tend 
to dominate any increased extraction of instruction-level 
parallelism as the processor is scaled to large numbers of 
clusters. We have demonstrated that dynamically choos­
ing the number of clusters using an exploration-based ap­
proach at regular intervals is effective in optimizing the 
communication-parallelism trade-off for a single thread. It 
is applicable to almost every program and yields average 
performance improvements of 11 % over our base architec­
ture. In order to exploit phase changes at a fine grain, ad­
ditional hardware has to be invested, allowing overall im­
provements of 15%. Since 8.3 clusters, on average, are dis­
abled by the reconfiguration schemes, there is the potential 
to save a great deal of leakage energy in single-threaded 
mode. The throughput of a multi-threaded workload can 
also be improved by avoiding cross-thread interference by 
dynamically dedicating a set of clusters to each thread. We 
have verified the validity of our results for a number of 
interesting processor models, thus highlighting the impor­
tance of the management of the communication-parallelism 
trade-off in future processors.
References
[11 V. Agarwal. M. Hrishikesh. S. Keckler. and D. Burger. Clock 
Rate versus IPC: The End of the Road for Conventional M i­
croarchitectures. In Proceedings of the ISCA-27. pages 248­
259. 2000.
[2] A. Aggarwal and M. Franklin. An Empirical Study of the 
Scalability Aspects of Instruction Distribution Algorithms 
for Clustered Processors. In Proceedings oflSPASS. 2001.
[31 H. Akkary and M. Driscoll. A Dynamic Multithreading Pro­
cessor. In Proceedings of MICRO-31, 1998.
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA'03)
1063-6897/03 $17.00 © 2003 IEEE
[4] D. Albonesi. Dynamic IPC/clock rate optimization. Pro­
ceedings of the 25th International Symposium on Computer 
Architecture, pages 282-292, June 1998.
[5] S. I. Association. The National Technology Roadmap for 
Engineers. Technical report, 1999.
[6] R. I. Bahar and S. Manne. Power and Energy Reduction Via 
Pipeline Balancing. In Proceedings oflSCA-28, July 2001.
[7] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and
S. Dwarkadas. Memory Hierarchy Reconfiguration for En­
ergy and Performance in General-Purpose Processor Archi­
tectures. In Proceedings of MICRO-33, pages 245-257, Dec 
2 0 0 0 .
[8] A. Baniasadi and A. Moshovos. Instruction Distribution 
Heuristics for Quad-Cluster, Dynamically-Scheduled, Su­
perscalar Processors. In Proceedings of MICRO-33, pages 
337-347, Dec 2000.
[9] D. Burger and T. Austin. The Simplescalar Toolset, Ver­
sion 2.0. Technical Report TR-97-1342, University of 
Wisconsin-Madison, June 1997.
[10] A. Buyuktosunoglu, S. Schuster, D. Brooks, P. Bose, 
P. Cook, and D. Albonesi. An Adaptive Issue Queue for Re­
duced Power at High Performance. In Workshop on Power- 
Aware Computer Systems (PACS2000, held in conjunction 
with ASPLOS-IX), Nov 2000.
[11] R. Canal, J. M. Parcerisa, and A. Gonzalez. Dynamic Clus­
ter Assignment Mechanisms. In Proceedings of HPCA-6,
2 0 0 0 .
[12] R. Canal, J. M. Parcerisa, and A. Gonzalez. Dynamic Code 
Partitioning for Clustered Architectures. International Jour­
nal of Parallel Programming, 29(1 ):59—79, 2001.
[13] A. Capitanio, N. Dutt, and A. Nicolau. Partitioned Register 
Files for VLIWs: A Preliminary Analysis of Trade-offs. In 
Proceedings of MICRO-25, 1992.
[14] S. Cho, P.-C. Yew, and G. Lee. Access Region Locality for 
High-Bandwidth Processor Memory System Design. In Pro­
ceedings of MICRO-32, pages 136-146, 1999.
[15] S. Cho, P.-C. Yew, and G. Lee. Decoupling Local Variable 
Accesses in a Wide-Issue Superscalar Processor. In Pro­
ceedings qfISCA-26, pages 100-110, 1999.
[16] A. Dhodapkar and J. E. Smith. Managing Multi- 
Configurable Hardware via Dynamic Working Set Analysis. 
In Proceedings of ISCA -29, May 2002.
[17] K. Farkas, P. Chow, N. Jouppi, and Z. Vranesic. The Mul­
ticluster Architecture: Reducing Cycle Time through Parti­
tioning. In Proceedings of MICRO-30, 1997.
[18] B. Fields, S. Rubin, and R. Bodik. Focusing Processor Poli­
cies via Critical-Path Prediction. In Proceedings oflSCA-28, 
July 2001.
[19] D. Folegnani and A. Gonzalez. Reducing Power Con­
sumption of the Issue Logic. In Workshop on Complexity- 
Effective Design (WCED2000, held in conjunction with 
ISCA-27), June 2000.
[20] S. Ghiasi, J. Casmira, and D. Grunwald. Using IPC Vari­
ations in Workloads with Externally Specified Rates to Re­
duce Power Consumption. In Workshop on Complexity Ef­
fective Design (WCED2000, held in conjunction with ISCA- 
27), June 2000.
[21] M. Huang, J. Renau, and J. Torrellas. Positional Adapta­
tion of Processors: Applications to Energy Reduction. In 
Proceedings ofISCA-30, June 2003.
[22] M. Huang, J. Reneau, S. Yoo, and J. Torrellas. A Framework 
for Dynamic Energy Efficiency and Temperature Manage­
ment. In Proceedings of MICRO-33, pages 202-213, Dec 
2 0 0 0 .
[23] S. Kccklcr and W. Dally. Processor Coupling: Integrating 
Compile Time and Runtime Scheduling for Parallelism. In 
Proceedings of ISCA-J 9, May 1992.
[24] R. Kessler. The Alpha 21264 Microprocessor. IEEE Micro, 
19(2):24—36, March/April 1999.
[25] C. Lee, M. Potkonjak, and W. Mangione-Smith. Media- 
bench: A Tool for Evaluating and Synthesizing Multimedia 
and Communications Systems. In Proceedings of MICRO- 
30, pages 330-335, 1997.
[26] P. Lowney, S. Frcudcnberger, T. Karzes, W. Lichtenstein, 
R. Nix, J. O’Donnell, and J. Ruttenberg. The Multiflow 
Trace Scheduling Compiler. Journal of Supercomputing, 
7(1 -2):51 —142, May 1993.
[27] D. Matzke. Will Physical Scalability Sabotage Performance 
Gains? IEEE Computer, 30(9):37-39, Sept 1997.
[28] R. Nagarajan, K. Sankaralingam, D. Burger, and S. Kccklcr. 
A Design Spacc Evaluation of Grid Processor Architectures. 
In Proceedings of MICRO-34, Dec 2001.
[29] S. Palacharla, N. Jouppi, and J. Smith. Complexity-Effective 
Superscalar Processors. In Proceedings oflSCA-24, 1997.
[30] J.-M. Parcerisa, J. Sahuquillo, A. Gonzalez, and J. Duato. 
Efficient Interconnects for Clustered Microarchitectures. In 
Proceedings of PACT, Sep 2002.
[31] D. Ponomarev, G. Kucuk, and K. Ghose. Reducing Power 
Requirements of Instruction Scheduling Through Dynamic 
Allocation of Multiple Datapath Resources. In Proceedings 
of MICRO-34, Dec 2001.
[32] N. Ranganathan and M. Franklin. An Empirical Study of 
Decentralized ILP Execution Models. In Proceedings of 
ASPLOS-VIII, pages 272-281, 1998.
[33] E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith. Trace 
Processors. In Proceedings of MICRO-3Q, 1997.
[34] P. Shivakumar and N. P. Jouppi. CACTI 3.0: An Integrated 
Cache Timing, Power, and Area Model. Technical Report 
TN-2001/2, Compaq Western Research Laboratory, August
2 0 0 1 .
[35] G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar Proces­
sors. In Proceedings of ISCA-22, 1995.
[36] D. Tullsen, S. Eggers, and H. Levy. Simultaneous Multi­
threading: Maximizing On-Chip Parallelism. In Proceed­
ings of ISCA-22, pages 392-403, 1995.
[37] E. Tunc, D. Liang, D. Tullsen, and B. Calder. Dynamic 
Prediction of Critical Path Instructions. In Proceedings of 
H P C A -7,iw 2m .
[38] S. Yang, M. Powell, B. Falsafi, K. Roy, and T. Vijaykumar. 
An Integrated Circuit/Architecture Approach to Reducing 
Leakage in Deep Submicron High-Performance I-Caches. In 
Proceedings ofHPCA-7, Jan 2001.
[39] A. Yoaz, M. Erez, R. Ronen, and S. Jourdan. Specu­
lation Techniques for Improving Load Related Instruction 
Scheduling. In Proceedings qfISCA-26, pages 42-53,1999.
[40] V. Zyuban and P. Kogge. Inherently Lower-Power High- 
Performance Superscalar Architectures. IEEE Transactions 
on Computers, Mar 2001.
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA'03)
1063-6897/03 $17.00 © 2003 IEEE
