Microarchitectural techniques to reduce interconnect power in clustered processors by Balasubramonian, Rajeev & Ramani, Karthik
Microarchitectural Techniques to Reduce Interconnect Power in Clustered
Processors
Karthik Ramani*, Naveen Muralimanohar*, Rajeev Balasubramonian * 
* Department of Electrical and Computer Engineering 
School of Computing 
University of Utah
Abstract
The paper presents a preliminary evaluation of 
novel techniques that address a growing problem -  
power dissipation in on-chip interconnects. Recent 
studies have shown that around 50% of the dynamic 
power consumption in modern processors is within on- 
chip interconnects. The contribution o f interconnect 
power to total chip power is expected to be higher 
in future communication-bound billion-transistor ar­
chitectures. In this paper, we propose the design of 
a heterogeneous interconnect, where some wires are 
optimized for low latency and others are optimized 
for low power. We show that a large fraction o f on- 
chip communications are latency insensitive. Effecting 
these non-critical transfers on low-power long-latency 
interconnects can result in significant power savings 
without unduly affecting performance. Two primary 
techniques are evaluated in this paper: (i) a dynamic 
critical path predictor that identifies results that are 
not urgently consumed, and (ii) an address prediction 
mechanism that requires addresses to be transferred 
off the critical path for verification purposes. Our re­
sults demonstrate that 49% o f all interconnect trans­
fers can be effected on power-efficient wires, while in­
curring a performance penalty o f only 2.5%.
1. Introduction
The shrinking of process technologies has enabled 
huge transistor budgets on a single chip. To exploit 
these transistors for high performance processing, nu­
merous partitioned architectures have been proposed 
[9, 15, 17, 18, 19, 20]. A partitioned architecture em­
ploys small processing cores (also referred to as clus­
ters) with an interconnect fabric and distributes in­
structions of a single application across the processing 
cores. By implementing small cores, fast clock speeds 
and low design complexity can be achieved. Since 
a single application is distributed across the clusters, 
partitioned architectures inevitably entail frequent data 
transfers across the chip. Thus, future processors are 
likely to be extremely communication-bound, from the 
point of view of performance and power.
Studies [1, 14, 19] have shown that wire delays do 
not scale down at the same rate as logic delays. As a 
result, the delay to send a signal across the diameter 
of a chip will soon be of the order of 30 cycles. These 
long wire delays serve as a serious performance limiter 
at future technology generations. Further, it has been 
shown that on-chip interconnects account for roughly 
50% of the total power dissipation in modern proces­
sors [16]. Thus, the design of the interconnect fabric 
has a strong influence on processor performance and 
power.
This paper examines performance and power trade­
offs in the design of the inter-cluster communication 
network. Most evaluations on partitioned or clustered 
architectures have focused on instruction distribution 
algorithms that minimize communication and load im­
balance. However, power optimizations of the inter­
connect at the microarchitectural level have received 
little attention. We propose and evaluate microarchi- 
tectural techniques that can exploit a heterogeneous 
network with varying performance and power charac­
teristics.
A performance-centric approach attempts to opti­
mize wires to minimize delay. This entails the use 
of optimally spaced repeaters, large drivers, and wide 
wires, all of which increase power dissipation in the in­
terconnect. By optimizing wires for power efficiency, 
performance is compromised. For example, by elimi­
nating repeaters, wire delay becomes a quadratic func­
tion of wire length. To alleviate power consumption 
bottlenecks in future processor generations, we pro­
pose the design of a heterogeneous network, where 
half the wires are optimized for delay and the other 
half for power. We demonstrate that data transfers 
on the interconnect fabric have varying delay require­
ments. Many transfers are not on the program criti­
cal path and can tolerate longer communication delays. 
By effecting these transfers on the power-efficient net­
work, significant reductions in power are observed 
with minimal impact on performance. We identify two 
major sources of non-critical transfers - (i) values that 
are not urgently sourced by consuming instructions, 
and (ii) values that are being transferred to verify ad­
dress predictions.
The rest of the paper is organized as follows. Sec­
tion 2 describes the communication-bound processor 
that serves as the evaluation framework. We propose 
techniques that can exploit a heterogeneous intercon­
nect in Section 3 and evaluate them in Section 4. Sec­
tion 5 discusses related work and we conclude in Sec­
tion 6.
2. The Base Clustered Processor
Most proposals for billion-transistor processors em­
ploy a partitioned architecture [9, 15, 17, 18, 19, 20]. 
The allocation of instructions to computational units 
can be performed either statically [9, 12, 15, 17, 18, 
23] or dynamically [5, 19, 20]. All of these designs 
experience a large number of data transfers across the 
chip. For the purpose of this study, we focus on one 
example implementation of a partitioned architecture 
-  a dynamically scheduled general-purpose clustered 
processor. The solutions proposed in this paper are ap­
plicable to other architectures as well and are likely to 
be equally effective.
The clustered processor that serves as an evaluation 
platform in this study has been shown to work well for 
many classes of applications with little or no compiler 
enhancements [2, 5, 8, 11, 12, 26]. In this processor 
(shown in Figure 1), the front-end is centralized. Dur­
ing register renaming, instructions are assigned to one 
of 16 clusters. Each cluster has a small issue queue, 
physical register file, and a limited number of func-
Figure 1. The 16-cluster system with four sets of 
four clusters each and a centralized LSQ and data 
cache. A crossbar interconnect is used for com­
munication within a set of clusters and a ring con­
nects the four crossbar routers.
tional units with a single cycle bypass network among 
them. If an instruction’s source operands are in a dif­
ferent cluster, copy instructions are inserted in the pro­
ducing clusters and the values are copied into physical 
registers in the consuming cluster. To minimize com­
munication cost and load imbalance, we implement in­
struction steering heuristics that represent the state-of- 
the-art. These heuristics incorporate information on 
dependence chains, critical operands, load, physical 
location, etc., and have been extensively covered in 
other papers [5, 11, 24].
The load/store queue (LSQ) and L1 data cache are 
centralized structures. Effective addresses for loads 
and stores are computed in one of the clusters and 
then sent to the centralized LSQ. The LSQ checks for 
memory dependences before issuing loads to the data 
cache and returning data back to the requesting clus­
ter. While distributed LSQ and cache implementations 
have been proposed [13, 26], we employ a centralized 
LSQ and cache because a distributed implementation 
entails significant complexity and offers only modest 
performance improvements [4, 13].
Aggarwal and Franklin [3] point out that a crossbar 
has better performance when connecting a small num­
ber of clusters, while a ring interconnect performs bet­
ter when the number of clusters is increased. To take 
advantage of both characteristics, they propose a hier­
archical interconnect (Figure 1), where a crossbar con­
nects four clusters and a ring connects multiple sets of 
four clusters. This allows low-latency communication 
between nearby clusters. Each link on the interconnect 
has a throughput of two transfers per cycle to deal with
2




A major bottleneck limiting the performance of any 
partitioned architecture is the communication of data 
between producer and consumer instructions in differ­
ent processing units. An ideal design would employ a 
high speed wire that consumed as little power as pos­
sible. Magen et al. [16] show that 90% of the dy­
namic power consumed in the interconnect is in 10% 
of the wires. The global wires in the partitioned archi­
tecture, such as the crossbar and inter-crossbar wires 
that connect the different clusters are likely to be ma­
jor contributors to interconnect power. Data commun- 
ciation on the interconnect can be categorised as fol­
lows: register values, store data, load data, and ef­
fective addresses for loads and stores. Register value 
transfers account for 54% of all inter-cluster commu­
nication, while the transfer of data for store instruc­
tions accounts for around 5%, data produced by load 
instructions accounts for 15%, and the transfer of ef­
fective addresses accounts for 24% of all communica­
tion. In the subsequent subsections, we show that a 
number of transfers in each of these categories is la­
tency insensitive.
3.2 Power-Performance Trade-Offs in Intercon­
nect Design
The premise behind the paper is that there exists a 
trade-off between performance and power consump­
tion in the design of on-chip interconnects. An inter­
connect that is optimized for low power is likely to 
have longer delays and this has a debilitating effect on 
program performance (quantified in Section 4). There­
fore, we propose the design of a heterogeneous in­
terconnect where half of the wires are optimized for 
low delay and the other half are optimized for low 
power. As part of a preliminary evaluation, our fo­
cus has been the identification of latency insensitive 
interconnect transfers and a quantification of the per­
formance impact of effecting them on low power, long 
latency wires. A detailed characterization of the power 
and performance of different interconnect implemen­
tations remains future work. Here, we qualitatively
discuss one of the techniques that can be employed to 
reduce interconnect power, although at a performance 
cost.
Banerjee [7] developed a methodology to calculate 
the repeater size and interconnect length that mini­
mizes the total interconnect power dissipation for any 
given delay penalty. The premise behind their ap­
proach is also based on the fact that not all global 
interconnects are on the critical path and hence, a 
delay penalty can be tolerated on these non-critical 
power efficient interconnects. As process technolo­
gies scale beyond 130nm, leakage power contributes 
significantly to the interconnect power and hence their 
technique provides greater power savings. For a de­
lay penalty of 20%, the optimal interconnect saves 
50% and 70% power in 100 and 50 nm technologies. 
Clearly, with technology scaling, leakage power dissi­
pation becomes the dominating component of the total 
power dissipation and the use of small repeaters can 
help lower the total interconnect power.
3.3 Identifying Critical Register Operands
To identify latency insensitive communications, the 
first group of data transfers that we target are register 
values that get bypassed to consuming instructions as 
soon as they are generated by producing instructions.
Bypass of Register Values
In general, each instruction executed by a processor 
requires one or more input register operands for its ex­
ecution. The instructions producing the operands wake 
up the consumer as soon as they generate the operand 
values. A consumer instruction cannot issue until all 
of the input operands have arrived. The operand that 
is produced last is more critical and needs to arrive 
at the consumer as early as possible, while the other 
operands can arrive as late as the last operand. In some 
cases, even if all the operands are ready, the instruction 
can wait in the issue queue if there is heavy contention 
for the functional units. A table that can classify data 
based on their arrival time and usage can be used to 
send them either through a fast critical network or a 
slow non-critical network.
We begin with a high-level description of how crit- 
icality information is gathered. Each instruction in the 
issue queue keeps track of the time difference between 
the arrival of an input operand and its actual execution.
3
If the time difference is significant, the transfer of the 
input operand is considered latency insensitive. When 
the instruction completes, along with the completion 
signal to the reorder buffer (ROB), a couple of bits are 
transmitted, indicating the criticality nature of each in­
put operand. The ROB is augmented to keep track of 
a few PC bits for the producer of each input operand. 
These bits are used to index into a Criticality Predictor 
that is a simple array of saturating counters, similar to 
a branch predictor. When an instruction is dispatched 
to a cluster, its criticality prediction is also sent to the 
cluster. When the instruction completes, its result is 
transferred to consumers on the low latency or low 
power network depending on the criticality prediction. 
Thus, some non-trivial hardware overhead has been in­
troduced in the centralized front-end. This overhead 
may be acceptable in an architecture where the power 
consumed by on-chip communications far outweighs 
the power consumed in the criticality predictor, re­
name, and ROB stages. This implementation serves 
as an example design point to evaluate the number of 
non-critical transfers. Other simpler implementations 
of criticality predictors are possible -  for example, in­
structions following low-confidence branches. As fu­
ture work, we plan to evaluate the behavior of other 
criticality predictors.
Transfer of Ready Register Operands
The above discussion targets register values that 
have to be urgently bypassed to dependent instruc­
tions as soon as they are produced. There exists an­
other class of register input operands that are already 
ready when an instruction is dispatched by the front- 
end. Since it might take many cycles for the instruc­
tion to be dispatched to a cluster and for it to begin 
execution, the transfer of its ready register operands 
to the consuming cluster is often latency insensitive. 
We observed that all such transfers can be effected on 
the slow power-efficient network with a minimal im­
pact on performance. Such an implementation is es­
pecially favorable as it does not entail additional hard­
ware overhead for prediction mechanisms.
3.4 Communications for Cache Access
Cache Access in the Base Case
The earlier subsection describes the identification
of register values that are not urgently consumed by 
other instructions. Register values represent a subset 
of the traffic observed on the inter-cluster network. All 
loads and stores compute their effective addresses in 
the clusters and forward them to the centralized LSQ. 
A load in the LSQ waits until its memory dependences 
are resolved before accessing the L1 data cache and 
forwarding the result back to the requesting cluster. 
For store instructions, the data to be stored in the cache 
is computed in one of the clusters and forwarded to the 
centralized LSQ. The LSQ forwards this data to any 
dependent loads and eventually writes it to the data 
cache when the store commits. Thus, in addition to 
the transfer of 64-bit register values, the inter-cluster 
network is also responsible for the transfer of load ef­
fective addresses, store effective addresses, load data, 
and store data.
Non-Critical Load Data
The data produced by a load instruction may be crit­
ical or not, depending on how quickly the consuming 
instructions issue. The criticality predictor described 
in the previous subsection can identify load instruc­
tions that produce critical data. Thus, it is possible to 
employ the techniques described earlier to send load 
data on either the critical or non-critical network. In 
order to not further complicate the design of the criti- 
cality predictor, for the purposes of this study, we as­
sume that all load data is sent to the requesting cluster 
via the fast critical network.
Non-Critical Store Data
Store data, on the other hand, is often non-critical. 
The late arrival of store data at the LSQ may delay exe­
cution in the following ways: (i) dependent loads have 
to wait longer, (ii) the commit process may be stalled if 
the store is at the head of the reorder buffer. As our re­
sults in the next section show, both of these events are 
infrequent and store data can be indiscriminately sent 
on a slower, power-efficient network. In our bench­
mark set, about 5% of all interconnect traffic can be 
attributed to store data, providing ample opportunity 
for power savings.
Non-Critical Load and Store Effective Addresses
Load and store effective address transfers are usu­
ally on the critical path -  store addresses are urgently 
required to resolve memory dependences and load ad-
4
Fetch queue size 
Branch predictor 
Bimodal predictor size 
Level 1  predictor 
Level 2  predictor 
BTB size 
Branch mispredict penalty 
Fetch width 
Dispatch and commit width 
Issue queue size 
Register fi le size 





L2 unifi ed cache 
I and D TLB 
Memory latency
comb. of bimodal and 2-level 
2048
1024 entries, history 10 
4096 entries 
2048 sets, 2-way 
at least 12 cycles 
8 (across up to 2 basic blocks) 
16
15 per cluster (int and fp, each) 
3 0  per cluster (int and fp, each) 
480
1/1 (in each cluster)
1/1 (in each cluster)
32KB 2-way 
32KB 2-way set-associative,
6 cycles, 4-way word-interleaved 
2MB 8-way, 25 cycles 
128 entries, 8KB page size 
160 cycles for the fi rst chunk
64
Table 1. Simplescalar simulator parameters.
dresses are required to initiate cache access. For many 
programs, accurate address prediction and memory de­
pendence speculation can accelerate cache access [4]. 
For loads, at the time of instruction dispatch, the ef­
fective address and memory dependences can be pre­
dicted. This allows the cache access to be initiated 
without waiting for the clusters to produce load and 
store effective addresses. As soon as the cache is ac­
cessed, data is returned to the cluster that houses the 
load instruction. When the cluster computes the effec­
tive address, it is sent to the centralized LSQ to ver­
ify that the address prediction was correct. The effec­
tive address and memory dependence predictors can be 
tuned to only make high-confidence predictions, caus­
ing the mispredict rate to be much lower than 1% for 
most programs [4]. With such an implementation, the 
transfer of the effective address from the cluster to the 
centralized LSQ is no longer on the critical path -  de­
laying the verification of address predictions does not 
significantly degrade the instruction execution rate.
3.5 Summary
Thus, we observe that there are two primary sources 
of delay-insensitive transfers in a communication- 
bound processor. (i) Data that is not immediately con­
sumed by dependent instructions: Of these, register 
results produced by instructions, including loads, can 
be identified as non-critical by our criticality predictor. 
Store data is always considered non-critical. Input reg­
ister operands that have already been generated when 
the consumer instruction is dispatched, are also always 
considered non-critical. (ii) Data transfers that verify 
predictions: If load and store effective addresses are 
accurately predicted at the centralized LSQ, the trans­
fer of these addresses from the clusters to the LSQ hap­
pens only for verification purposes and does not lie on 
the critical path for any instruction’s execution.
Section 4 identifies the performance impact of send­




Our simulator is based on Simplescalar-3.0 [10] for 
the Alpha AXP ISA. Separate issue queues and phys­
ical register files are modeled for each cluster. Con-
tention on the interconnects and for memory hierarchy 
resources (ports, banks, buffers, etc.) are modeled in 
detail. To model a wire-delay-constrained processor, 
each of the 16 clusters is assumed to have 30 physi­
cal registers (int and fp, each), 15 issue queue entries 
(int and fp, each), and one functional unit of each kind. 
While we do not model a trace cache, we fetch instruc­
tions from up to two basic blocks in a cycle. Important 
simulation parameters are listed in Table 1.
The latencies on the interconnects would depend 
greatly on the technology, processor layout, and avail­
able metal area. The estimation of some of these 
parameters is beyond the scope of this study. For 
the base case, we make the following reasonable as­
sumptions: it takes a cycle to send data to the cross­
bar router, a cycle to receive data from the crossbar 
router, and four cycles to send data between cross­
bar routers. Thus, the two most distant clusters on 
the chip are separated by 10 cycles. Considering that 
Agarwal et al. [1] project 30-cycle worst-case on-chip 
latencies at 0.035//. technology, we expect this choice 
of latencies to be representative of wire-limited future 
microprocessors1 . We assume that each communica­
tion link is fully pipelined, allowing the initiation of 
a new transfer every cycle. Our results also show the 
effect of a network that has communication latencies 
that are a factor of two higher. By modeling such 
a communication-bound processor, we clearly isolate
1It must be noted that the L2 would account for a large fraction 
of chip area.
5
the performance differences between the different sim­
ulated cases. Similar result trends were also observed 
when assuming latencies that were lower by a factor of 
two. As Agarwal et al. predict [1], IPCs observed are 
low due to a very high clock rate and presence of wire 
delay penalties.
We use 21 of the 26 SPEC-2k programs as a bench­
mark set 2. The programs were fast-forwarded for two 
billion instructions, simulated in detail for a million 
instructions to warm up various structures, and then 
measured over the next 100 million instructions. The 
reference input set was used for all programs.
To understand the inherent advantages of the criti- 
cality predictor, the simulations were performed with­
out any address prediction. This is because address 
prediction adds to the benefits of the existing method­
ology and the combined results would not reflect the 
use of the criticality predictor clearly. At the end of the 
section, potential benefits of the criticality predictor 
with address prediction are discussed. For all simula­
tions, three different interconnect configurations have 
been used. In the base (high-performance) case, the 
interconnects between the clusters are optimized for 
speed and allow two transfers every cycle. To model a 
processor that has interconnects optimized for power, 
we simulate a model (low-power) like the base case, 
but with wire latencies twice as much as those in the 
high-performance case. The difference in the average 
IPC between these two cases was found to be 21%, 
underlining the impropriety of an approach that fo­
cuses solely on low power. The criticality-based case 
assumes a heterogeneous interconnect with a combi­
nation of both the power optimized and the perfor­
mance optimized wires. In every cycle,we can start 
one transfer each on the performance-optimized link 
and on the power-optimized link. The latency of the 
performance-optimized link is like that of the high- 
performance case, while the latency of the power- 
optimized link is like that of the low-power case. In 
the criticality-based case, all store data goes through 
the power optimized wires while all load data utilizes 
the fast wires. The ready operands, operands which 
are available for the consuming instruction at the time 
of dispatch, are all sent through the power optimized 
wires. The criticality predictor is used to choose be-
2Sixtrack, Facerec, and Perlbmk were not compatible with our 
simulation infrastructure, while Ammp and M cf were too memory- 
bound to be affected by processor optimizations.
IPC Analysis
Figure 2. Graph showing performance of the 
criticality-based approach, compared to the
high-performance and low-power cases.
tween the fast and the slow interconnect for bypassed 
register values.
4.2 Effects on Performance
To understand the performance of the criticality pre­
dictor, we refer to Figures 2 and 3. In Figure 2, the 
grey bars depict IPC for the low-power case, the black 
bars depict the improvements in IPC obtained by mov­
ing to the criticality-based case, and the white bars 
show the difference in IPC between the criticality- 
based and high-performance cases. The graph clearly 
shows that increasing the delay for every single trans­
fer has a much larger impact on performance than in­
creasing the delay for selected transfers based on crit- 
icality. Figure 3 shows the IPC gap between the high- 
performance and criticality-based cases and the corre­
sponding percentage of transfers that get sent on the 
low-power interconnect. Note that these models as­
sume no address prediction techniques, requiring that 
all effective addresses be sent on the low-latency inter­
connect.
The difference in the average IPC between the high- 
performance and low-power cases is 21%. For the 
criticality-based case, the criticality predictor is em­
ployed only for bypassed register transfers. The per­
centage of bypassed register transfers that get sent on 
the low-power network is roughly 29%. In addition, 
all store data and ready register operands get sent on 
the low-power network. From Figure 3, We see that 
about 36.5% of all transfers can happen on the low-
6




□ Effective addresses predicted
□ Store data
E] Ready Regs 
Byp Regs
Figure 3. Plot showing the percentage of all 
transfers that happen on the low-power inter­
connect (not including predicted effective ad­
dresses) and the corresponding performance 
loss.
power network. The overall loss in IPC is only 2.5%, 
showing that we have identified a very favorable subset 
of transfers that are relatively latency insensitive.
In eon and vortex, performance increases slightly 
compared to the base case -  the ability of the 
criticality-based interconnect to accomodate more 
data reduces contention in the links. Gap is the only 
program where the unpredictable nature of the code re­
sults in a 11% IPC loss because of inaccuracies in the 
criticality predictor.
This study highlights the importance of consider­
ing the design of wires with different levels of perfor­
mance and power in different parts of the processor. 
Figure 2 shows us that ready register transfers tend not 
to be on the critical path and hence can use the slower 
and power optimized wires. It also tells us that the 
design of bypasses requires careful performance and 
power considerations. The use of a criticality predic­
tor helps us steer data to preserve performance while 
reducing power.
Studies [4] have shown that around 52% of effective 
addresses have high confidence predictions. Trans­
ferring high confidence address predictions on the 
criticality-based interconnect can potentially achieve 
greater power savings. Figure 4 depicts the different 
kinds of transfers that happen on the interconnect as a 
fraction of the total transfers. We see that high confi­
dence predictions that go through the criticality-based
Figure 4. Graph showing different kinds of 
non-critical transfers as a fraction of total 
transfers
links account for 12.5% of the total transfers. This is 
in addition to the already existing non-critical register 
and store data transfers. Overall, we have 49% of the 
traffic (including address prediction values) through 
the power optimized links.
In future communication-bound processors, we ex­
pect more than 50% of chip power to be consumed 
within the interconnects. Assuming that our approach 
targets roughly 80% of on-chip interconnects and that 
the low-power wires consume half as much power as 
the high-power wires, overall chip power should re­
duce by about 10% with our optimizations.
5. Related Work
Tune et al. [24] and Srinivasan et al. [22] developed 
several techniques for dynamically predicting the criti­
cality of instructions. Seng et al. [21] used the QOLD 
heuristic suggested by Tune to find critical instruc­
tions and redirect them to different units optimized for 
power and performance. The QOLD heuristic predicts 
an instruction to be critical if it reaches the top of the 
issue queue, i.e. the instruction has been waiting for a 
long time for its operands. They used slow low power 
units for executing the non-critical instructions to limit 
the performance impact. A recent study by Balasubra­
monian et al. [6] evaluates the design of heterogeneous 
cache banks and the use of criticality to assign instruc­
tions and data to each bank.
Magen [16] show that around 50% of a proces­
7
sor’s dynamic power is dissipated in the interconnect 
alone. They characterize interconnect power in a cur­
rent microprocessor designed for power efficiency and 
propose power aware routing algorithms to minimize 
power in the interconnect. They show that tuning in­
terconnects for power optimizations could yield good 
results.
Several microarchitectural power simulators have 
been proposed to date. Recently, Wang et al. [25] pro­
posed an interconnection power-performance simula­
tor to study on-chip interconnects in future processors.
6. Conclusions
The power consumed by on-chip interconnects is al­
ready a major contributor to total chip power [16]. Fu­
ture billion transistor architectures are likely to expend 
significantly more power transferring values between 
the different computational units. Various techniques, 
such as fewer repeaters, low-capacitance drivers, low 
voltage swings, etc., can be employed to reduce the 
power within the interconnects, but at the cost of 
longer wire delays.
Our results show that a large fraction of on-chip 
communications are latency tolerant. This makes the 
case for a heterogeneous interconnect, where some of 
the wires are optimized for high speed and the others 
for low power. Assuming such a heterogeneous inter­
connect is possible, we evaluate its potential to limit 
performance degradation, while effecting a majority of 
transfers on the power-efficient network. Our results 
show that latency tolerant non-critical transfers can be 
classified as follows: (i) register values that are not ur­
gently read by consuming instructions (accounting for 
32.3% of all interconnect transfers), (ii) store data that 
is often not forwarded to subsequent loads (account­
ing for 4.1% of all transfers), (iii) predicted load and 
store effective addresses that are being transferred only 
for verification purposes (12.8% of all transfers). As 
a result, roughly 49% of all inter-cluster communica­
tions can be off-loaded to power-efficient wires while 
incurring a performance loss of only 2.5%. If the inter­
connect is entirely composed of power-efficient long- 
latency wires, the performance degradation is as high 
as 21%. Thus, a heterogeneous interconnect allows 
us to strike a better balance between overall processor 
performance and power.
This paper serves as a preliminary evaluation of the
potential of a heterogeneous interconnect. A more de­
tailed analysis of the power-performance trade-offs in 
interconnect design will help us better quantify the 
performance and power effect of such an approach. 
The techniques proposed here can be further extended 
to apply to other processor structures, such as the 
ALUs, register files, issue queues, etc.
References
[1] V. Agarwal,M. Hrishikesh, S. Keckler, andD. Burger. 
Clock Rate versus IPC: The End of the Road for 
Conventional Microarchitectures. In Proceedings o f 
ISCA-27, pages 248-259, June 2000.
[2] A. Aggarwal and M. Franklin. An Empirical Study 
of the Scalability Aspects of Instruction Distribution 
Algorithms for Clustered Processors. In Proceedings 
ofISPASS, 2001.
[3] A. Aggarwal and M. Franklin. Hierarchical Inter­
connects for On-Chip Clustering. In Proceedings of 
IPDPS, April 2002.
[4] R. Balasubramonian. Cluster Prefetch: Tolerating 
On-Chip Wire Delays in Clustered Microarchitec­
tures. In Proceedings o f ICS-18, June 2004.
[5] R. Balasubramonian, S. Dwarkadas, and D. Al- 
bonesi. Dynamically Managing the Communication- 
Parallelism Trade-Off in Future Clustered Processors. 
In Proceedings o f ISCA-30, pages 275-286, June 
2003.
[6 ] R. Balasubramonian, V. Srinivasan, and 
S. Dwarkadas. Hot-and-Cold: Using Criticality 
in the Design of Energy-Efficient Caches. In 
Workshop on Power-Aware Computer Systems, in 
conjunction with MICRO-36, December 2003.
[7] K. Banerjee and A. Mehrotra. A Power-optimal Re­
peater Insertion Methodology for Global Intercon­
nects in Nanometer Designs. IEEE Transactions 
on Electron Devices, 49(11):2001-2007, November 
2 0 0 2 .
[8 ] A. Baniasadi and A. Moshovos. Instruction Dis­
tribution Heuristics for Quad-Cluster, Dynamically- 
Scheduled, Superscalar Processors. In Proceedings 
o f MICRO-33, pages 337-347, December 2000.
[9] R. Barua, W. Lee, S. Amarasinghe, and A. Agar- 
wal. Maps: A Compiler-Managed Memory System 
for Raw Machines. In Proceedings o f ISCA-26, May 
1999.
[10] D. Burger and T. Austin. The Simplescalar Toolset, 
Version 2.0. Technical Report TR-97-1342, Univer­
sity of Wisconsin-Madison, June 1997.
[11] R. Canal, J. M. Parcerisa, and A. Gonzalez. Dynamic 
Cluster Assignment Mechanisms. In Proceedings of 
HPCA-6, pages 132-142, January 2000.
8
[12] K. Farkas, P. Chow, N. Jouppi, and Z. Vranesic. 
The Multicluster Architecture: Reducing Cycle Time 
through Partitioning. In Proceedings o f MICRO-30, 
pages 149-159, December 1997.
[13] E. Gibert, J. Sanchez, and A. Gonzalez. Effective 
Instruction Scheduling Techniques for an Interleaved 
Cache Clustered VLIW Processor. In Proceedings o f 
MICRO-35, pages 123-133, November 2002.
[14] R. Ho, K. Mai, and M. Horowitz. The Future of 
Wires. Proceedings o f the IEEE, Vol.89, No.4, April 
2 0 0 1 .
[15] U. Kapasi, W. Dally, S. Rixner, J. Owens, and 
B. Khailany. The Imagine Stream Processor. In Pro­
ceedings ofICCD, September 2002.
[16] N. Magen, A. Kolodny, U. Weiser, and N. Shamir. In­
terconnect Power Dissipation in a Microprocessor. In 
Proceedings ofSystem Level Interconnect Prediction, 
February 2004.
[17] R. Nagarajan, K. Sankaralingam, D. Burger, and 
S. Keckler. A Design Space Evaluation of Grid Pro­
cessor Architectures. In Proceedings o f MICRO-34, 
pages 40-51, December 2001.
[18] K. Olukotun, B. Nayfeh, L. Hammond, K. Wilson, 
and K.-Y. Chang. The Case for a Single-Chip Mul­
tiprocessor. In Proceedings o f ASPLOS-VII, October 
1996.
[19] S. Palacharla, N. Jouppi, and J. Smith. Complexity- 
Effective Superscalar Processors. In Proceedings of 
ISCA-24, pages 206-218, June 1997.
[20] J. Sanchez and A. Gonzalez. Modulo Scheduling for 
a Fully-Distributed Clustered VLIW Architecture. In
Proceedings o f MICRO-33, pages 124-133, Decem­
ber 2 0 0 0 .
[21] J. S. Seng, E. S. Tune,, andD. M. Tullsen. Reducing 
Power with Dynamic Critical Path Information. In 
Proceedings o f the 34th International Symposium on 
Microarchitecture, December 2001.
[22] S. Srinivasan, R. Ju, A. Lebeck, and C. Wilkerson. 
Locality vs. Criticality. In Proceedings o f ISCA-28, 
pages 132-143, July 2001.
[23] J. Steffan and T. Mowry. The Potential for Us­
ing Thread Level Data-Speculation to Facilitate Au­
tomatic Parallelization. In Proceedings o f HPCA-4, 
pages 2-13, February 1998.
[24] E. Tune, D. Liang, D. Tullsen, and B. Calder. Dy­
namic Prediction of Critical Path Instructions. In Pro­
ceedings ofHPCA-7, pages 185-196, January 2001.
[25] H.-S. Wang, X. Zhu, L.-S. Peh, and S. Malik. Orion: 
A Power-Performance Simulator for Interconnection 
NEtworks. In Proceedings o f the 35th International 
Symposium on Microarchitecture, November 2002.
[26] V. Zyuban and P. Kogge. Inherently Lower-Power 
High-Performance Superscalar Architectures. IEEE 
Transactions on Computers, March 2001.
9
