Reducing wire delay penalty through value prediction by Parcerisa Bundó, Joan Manuel & González Colás, Antonio María
Reducing Wire Delay Penalty through Value Prediction 
Joan-Manuel Parcerisa and Antonio Gonzilez 
Dept. d’ Arquitectura de Computadors, Universitat Politfknica de Catalunya 
c/. Jordi Girona, 1-3 Mbdul C6 
08034 Barcelona, Spain 
(jmane1,antonio) @ac.upc.es 
Abstract 
In this work we show that value prediction can be used 
to avoid the penalty of long wire delays by predicting the 
data that is communicated through these long wires and 
validating the prediction locally where the value is pro- 
duced. Only in the case of misprediction, the long wire de- 
lay is experienced. 
We apply this concept to a clustered nzicroarchitecture 
in order to reduce inter-cluster communication. The pre- 
dictability of values provides the dynamic instruction par- 
titioning hardware with less constraints to optimize the 
trade-off between communication requirements and work- 
load balance, which is the most critical issue of the parti- 
tioning scheme. We show that value prediction reduces the 
penalties caused by inter-cluster communication by 18% 
on average for a realistic implementation of a 4-cluster mi- 
croarchitecture. 
1. Introduction 
Recent studies point out that two major problems for 
scaling-up current superscalar microarchitectures will be 
the growing impact of wire delays [ I ,  2,  131, and the in- 
creasing complexity of some critical components, such as 
the issue logic, the bypass, the register file and the rename 
logic [16], since they may have a direct influence on the 
clock cycle time. 
One of the proposed solutions to this problem is based 
on clustering. In a clustered microarchitecture some of the 
critical components are partitioned into simpler structures, 
and the impact of wire delays is reduced as far as signals 
are kept local within the clusters. In a clustered architec- 
ture, deciding which instructions are executed in each clus- 
ter becomes a key issue. We will refer to this task as code 
partitioning. A code partitioning scheme determines how 
the dynamic instruction stream is split among the different 
clusters. Data dependences among instructions in different 
partitions correspond to inter-cluster communications, 
which use long wires and have a high associated latency. In 
317 0-7695-0924-WOO $10.00 0 2000 IEEE 
this work we will focus on dynamic mechanisms for parti- 
tioning the instruction stream, implemented through a 
small hardware that steers instructions to clusters, the steer- 
ing logic. 
Minimizing the impact of inter-cluster communication 
delays is one of the main objectives of any code partition- 
ing scheme. The solution that we propose in this work is to 
eliminate data dependences that cross the partition bound- 
aries by predicting the values that flow among them. Value 
prediction has been largely investigated in the context of 
superscalar processors, and it is not our purpose to design 
another predictor but to investigate its potential to reduce 
slow inter-cluster communications on a clustered architec- 
ture, and to provide a new source of performance improve- 
ments. In this paper we show that value prediction can sig- 
nificantly improve the performance of the steering logic by 
providing a less dense data dependence graph which results 
in less communication requirements and better opportuni- 
ties to balance the workload. 
It is known that the IPC of a clustered architecture is 
lower than that of an equivalent centralized organization 
without the inter-cluster communication delays. It is also 
expected that value prediction may increase the IPC in both 
cases. However, we show that the clustered architecture 
benefits from value prediction more than a centralized one, 
since value prediction removes some inter-cluster commu- 
nications. In particular, we show that the IPC degradation 
caused by inter-cluster communications can be reduced by 
18% through a simple value prediction scheme when the 
steering logic is designed to take advantage of the value 
predictor. 
The rest of this paper is organized as follows. Section 
2 presents the assumed clustered architecture. Section 3 
presents the partitioning heuristic implemented by the 
steering logic, and its specific adaptations to take advan- 
tage of value prediction, and provides a performance eval- 
uation. In Section 4, a sensitivity analysis regarding com- 
munication latency, communication bandwidth and predic- 
tor table size is performed. Section 5 reviews other related 
map table free lists work. And finally, the main conclusions are summarized in 
Section 6. 
2. Microarchitecture 
The target processor microarchitecture is a clustered 
implementation of an %way out-of-order issue superscalar 
processor with a 6 stage pipeline (fetch, decode, issue, ex- 
ecute, writeback and commit). The processor front-end 
(fetch and decode stages) is a centralized structure, and we 
assume that it has an aggressive instruction fetch mecha- 
nism to stress the instruction issue and execution sub- 
systems. The processor core is divided into N homoge- 
neous clusters: each cluster has its own instruction queue, 
a physical register file, a set of functional units, and the cor- 
responding data bypasses among these functional units. We 
experiment with several configurations (targeting different 
technologies and clock rates) having 1, 2 and 4 clusters. 
While the register file access time and issue time are as- 
sumed to be constant in all cases, structure sizes are scaled 
down with the degree of clustering. Therefore, register files 
have respectively 128, 80 and 56 physical registers per 
cluster, and instruction queue lengths are 64,32 and 16 en- 
tries. The reorder buffer length (128 entries), the total num- 
ber of functional units (8), and the total issue width (8) is 
kept constant through all the configurations. The main ar- 
chitectural parameters are described in Section 2.4. 
Local bypasses within a cluster are responsible for for- 
warding result values produced in the cluster to the inputs 
of the functional units in the same cluster. A local bypass 
takes 0 cycles, i.e. a value produced in cycle i can be an in- 
put of a local functional unit in cycle i + l .  Inter-cluster by- 
passes are responsible for forwarding values among func- 
tional units of different clusters. Since inter-cluster bypass- 
es require long wires, they will likely take several cycles in 
future technologies [ 11. Therefore, we have assumed a one- 
cycle latency for inter-cluster bypasses in the basic config- 
urations, although we also evaluate the effects of longer la- 
tencies. Latency is not the only penalty of inter-cluster 
communications. Also bandwidth is relevant, since it di- 
rectly affects the number of register file write ports, and the 
complexity of the bypass logic. We first assume an un- 
bounded number of interconnection paths in order to iso- 
late our experiments from the effect of possible bandwidth 
bottlenecks, and then we evaluate the effects of having a 
limited inter-cluster communication bandwidth. 
2.1. Handling register copies 
In a processor with N clusters, instructions are re- 
named at the decode stage, by means of a register map table 
with N fields per logical register, that allows up to N differ- 
ent mappings of the same register. An additional bit per 
318 
Initial state nnnn 
0 
R X  
Ra 
Rb 
(a) After rename I,= R x t R a  I,: P 7 t P O  (cluster n) 
I, : P 7 t P O  (cluster n) 
copy: P2,tP7 (cluster n) 
I2 : P 3 t P 2  (cluster m) 
OOBO I; 
oooo 12 
(b) After rename 12= R b t R x  
I ~ :  P7 t PO (cluster n) 
copy: P2, t P7 (clustern) 
12: P3 t P2 (cluster m) 
13: P4 t P 1  (cluster m) 
(c) After rename 13= R x t R a  
Figure 1. Example of renaming 3 instructions. l2 
requires to copy Rx from cluster n to m 
field indicates whether the mapping is valid and if so, it 
points to a physical register in the corresponding cluster. 
All logical registers must have at least one valid mapping. 
Each cluster has a free pool of physical registers from 
where they are allocated when needed. 
When an instruction I l  is decoded (see Figure l(a)), 
and it is assigned to cluster n by the steering logic, its 
source operands are renamed by looking at the field n of the 
map table. If the instruction has a destination register Rx, a 
new free physical register is allocated from the free-list of 
cluster n, the new mapping is written in the field n of the 
map table, and the other fields are set invalid, to denote that 
Rx is not currently mapped to any physical register in these 
clusters. 
Let us assume that a subsequent dependent instruction 
12, that reads register Rx, is decoded and steered to cluster 
m, different from n (see Figure l(b)). When 12 is renamed, 
the field m of the map table entry for register Rx is found 
invalid. Normal instructions are not allowed to access the 
register files of remote clusters. Instead, they require re- 
mote operands to be copied from one register file to another 
by means of special copy instructions generated on demand 
during the renaming stage. Therefore, in this example a 
new physical register is allocated in cluster m, to store the 
copy of register Rx for future reuse, and its mapping is 
written in the field m of the map table entry for Rx. This 
field becomes valid, and its mapping is used to rename the 
source operand of 12. Then, a copy instruction is dispatched 
to cluster n. This instruction will forward the value of the 
physical register in cluster n to the physical register in clus- 
ter m. This copy instruction will be handled by the issue 
logic as any other instruction, i.e., it will be executed once 
its source operand and the needed resources are available. 
Note that during the renaming of an instruction, just 
one physical register for its destination register is allocated. 
Additional physical registers to store copies of it in other 
clusters are only allocated on demand if they are required 
by subsequent instructions that do not execute in the same 
cluster. All these physical registers will be freed by the first 
subsequent instruction that writes to the same logical regis- 
ter, when it is committed (see instruction I,, in Figure I(c)). 
This scheme requires some degree of register replication 
which dynamically adapts to the program requirements and 
is much lower than replicating the whole register file. Com- 
pared with a full replication scheme, it has also less com- 
munication requirements and thus, less inter-cluster bypass 
paths and less register file write ports. 
If the processor has a limited number of inter-cluster 
bypass paths, they must be reserved by the issue mecha- 
nism like any other resource. Copy instructions provide a 
simple mechanism to allocate the required bypasses and 
schedule inter-cluster communications. They also provide 
a simple method for precise state recovery, since copy in- 
structions are inserted in the reorder buffer like normal in- 
structions. However, since a copy instruction makes the de- 
pendence chain one node longer, it increases by one cycle 
the total effective latency between the producer and the re- 
mote dependent instruction (in addition to the bus latency). 
A particular implementation could optimize this, either by 
shortening the tags propagation delay between clusters or 
by implementing specific hardware that avoids generating 
copy instructions. However, we have not assumed any of 
these optimizations in this work. 
2.2. Value prediction 
The microarchitecture implements a stride value pre- 
dictor [8,9, 191 that predicts the source operands of the in- 
structions. There is a value prediction table indexed by the 
PC and the operand order (lefvright). We first assume a 
very large table (128K entries) to isolate the results from 
the effects of a limited table size, and we later evaluate the 
impact of a table with sizes ranging from 1 K to 16K entries. 
Each entry contains the last value, the last observed stride 
and a 2-bit counter that assigns confidence to the predic- 
tion. Since each prediction involves a table access and an 
addition, we assume that value predictions are available 1 
cycle after the fetch, i.e. at the decode stage. Table updates 
are done at decode time. 
When a source operand is not yet available at decode 
time, and its predicted value is confident (the counter value 
is greater than l ) ,  the instruction is dispatched speculative- 
ly and may use the predicted value. The instruction that will 
produce this value is identified, and it is assigned the task 
of verifying that its output matches the prediction. The ver- 
ification occurs during the writeback stage of the producer 
instruction, and it takes one cycle. If it fails, the dependent 
misspeculated instruction is invalidated and reissued. 
We have assumed a selective invalidation and reissue 
mechanism [ 171, i.e. after the mispredicted instruction is 
reissued and executed, a new value is produced and propa- 
gated to dependent instructions, which in turn reissue, and 
so on. Only the instructions that depend on the mispredict- 
ed instruction are invalidated. The mechanism is in fact the 
existing issue mechanism, and therefore we have assumed 
no additional penalty for each instruction restart. 
For a clustered architecture, this speculation procedure 
is further extended, in order to reduce inter-cluster commu- 
nications. The extension apply to the case when a source 
operand is not currently mapped on the cluster where the 
instruction is being dispatched. In this case, the operand is 
predicted regardless of whether it  is available, the instruc- 
tion is dispatched speculatively, and a special verifcation- 
copy instruction is dispatched to the cluster where the op- 
erand is produced. When issued, the verification-copy 
compares locally the operand with the predicted value, and 
just in case of mismatch, it  forwards the correct value 
through an inter-cluster bypass, and the remote misspecu- 
lated instruction is reissued. 
2.3. Steering logic 
Code partitioning can be done at compile time (static) 
or at run time (dynamic). The first method relies on the 
compiler, which allocates each static instruction to a clus- 
ter, while the second method is based on a specialized hard- 
ware that decides where to distribute each dynamic instruc- 
tion. The main advantage of a static partitioning is that it re- 
quires minimal hardware support, but its downside is that it 
requires to recompile the applications because it extends 
the ISA for encoding the steering information. Further- 
more, it will require to recompile for each new microarchi- 
tecture generation that changes the number of clusters. 
In contrast, a dynamic partitioning method does not re- 
quire to recompile, because it makes clustering transparent 
to the compiler. In addition, the information used by the dy- 
namic steering logic (workload balance, data dependences) 
is obtained directly from the actual pipeline state, rather 
than estimations of the compiler. Therefore, a dynamic 
3 19 
steering scheme is more effective than a static approach be- 
cause it is more adaptable to the actual processor state. This 
work focuses on this type of steering. 
In order to maximize performance, the dynamic steer- 
ing logic must address two main goals: to minimize inter- 
cluster communications (or their associated penalties) and 
to maximize the workload balance. 
On one hand, inter-cluster communications introduce 
delays between dependent instructions, which may result in 
a performance loss if they stay in the critical path of execu- 
tion. Determining whether a communication is critical is a 
hard problem, therefore a more simple goal for the steering 
heuristic is to minimize the number of communications. 
On the other hand, when there are more ready instruc- 
tions in a cluster than functional units to execute them, the 
excess of instructions are forced to wait, incurring an addi- 
tional delay. If at the same time, another cluster has idle 
functional units, this additional delay would have been 
avoided if the steering logic had sent some instructions to a 
different cluster. We refer to this situation as a workload 
imbalance among clusters, and since it may potentially de- 
grade the performance, a major goal of the steering logic is 
to prevent it from happening. 
Intuitively, both goals (reducing communications and 
balancing workload) are sometimes conflicting, and there- 
fore a good steering algorithm must find the optimal trade- 
off between them. We outline below how these two issues 
are addressed by the steering logic from a conceptual 
standpoint. Particular steering techniques are defined in 
Section 3 
2.3.1. Communication. The valid bit associated to each 
field of the map table indicates whether the logical register 
may be directly read in the corresponding cluster without 
requiring a communication. Therefore, the steering logic 
uses this information to minimize communications by 
choosing a cluster where all or most of the source operands 
of an instruction are currently mapped. In some cases, 
when an operand is mapped in more than one cluster due to 
previously dispatched copy instructions, but the value is 
not yet available, the choice of clusters should be narrowed 
to the cluster where the value will be available sooner, to 
avoid the instruction being needlessly delayed by a com- 
munication. 
2.3.2. Workload balance. To improve the workload bal- 
ance, the steering logic must detect when there is a work- 
load imbalance and how much unbalanced it is, and must 
also determine which is the least loaded cluster. There are 
many alternatives to determine at run-time the individual 
workloads of the clusters and their relative workload im- 
balance. In other words, there are several figures that can 
be used to measure these features. From the description 
given above, we intuitively define the workload imbalance 
at a given instant of time as the total number of ready in- 
structions that cannot issue, due to having exceeded the is- 
sue width in their respective clusters, but could have issued 
in other clusters since they have idle functional units. This 
figure (we will refer to it as metric NREADY), is what we 
report in our experiments as “workload imbalance”, be- 
cause it corresponds to our definition. However, we also 
experimented several other imbalance figures to guide the 
steering decisions, and found the following scheme (we 
will refer to it as metric DCOUNT) to give the best perfor- 
mance: 
The processor has a signed counter in each of the N clus- 
ters that measures its workload. Its value is initially zero, 
and it is updated in the following way: for every instruc- 
tion dispatched to a cluster, the corresponding counter in 
that cluster is increased by N-1, while the other N-1 
counters are decreased by 1 (i.e. the sum of the counters 
is kept always zero). Therefore, the value stored in the 
counter of a given cluster is N times the difference be- 
tween the total number of instructions dispatched to that 
cluster and the average number of instructions dis- 
patched per cluster. The workload imbalance is calculat- 
ed as the maximum absolute value of the workload 
counters. Note also that in the case of two clusters, a sin- 
gle counter will suffice. 
The NREADY figure matches more exactly our defi- 
nition of workload balance. However, when it is used by 
the steering logic, the actions taken to compensate a work- 
load imbalance (sending instructions to the least loaded 
cluster) may not update immediately the NREADY figure, 
if some of the steered instructions are not ready. When this 
occurs, the corrective action may result disproportionate, 
and cause an imbalance in another direction or some un- 
necessary inter-cluster communications. This does not hap- 
pen with the DCOUNT figure, since it varies instantly and 
in proportion to the steering decisions, which allows the 
steering logic to gauge more accurately the actions to com- 
pensate a workload imbalance. Thus, the steering logic 
uses the DCOUNT figure to determine balancing actions 
and we use the NREADY figure to measure and report 
workload balance. 
2.4. Experimental framework 
We perform our microarchitectural timing simulations 
with a modified version of the Simplescalar tool set [3], 
version 3.0. It was extended to include register renaming 
through a physical register file, instruction queues (sepa- 
rate integer and FP), stride value prediction, steering logic, 
and a clustered processor core. 
Three different configurations were simulated, having 
1, 2 and 4 clusters respectively. Each was simulated with 
and without value prediction. The total issue width, number 
320 
Parameter 
ketch, decode & retire width 
Branch Predictor 
ROB size 
Instruction queue size 
Functional units 
Issue width 
1 Cluster config. 2 Clusters config. 4 Clusters config. 
8 instructions 
Combined predictor of IK entries with a Gshare with 64K 2-bit counters, 16 bit global history, and a bimodal 
predictor of 2K entries with 2-bit counters. 
128 
64 32 16 
8 int (4 include mul/div) 
4 fp (2 include fp muYdiv) 
8 int/ 4 fp 
4 int (2 include mul/div) 
2 fp (include fp mul/div) 
4 int/ 2 fp 
2 int (1 include mul/div) 
1 fp (includes fp mul/div) 
2 int/ 1 fr, 
YD-cache L2 
Memory 
I256 KB, 4-way set associative, 64 byte lines, 6 cycle hit time. 
18 bytes bus bandwidth to main memory, 18 cycles first chunk, 2 cycles interchunk. 
Communications 
Register file sizes 
I-cache LI 
D-cache L1 
Table 2. The Mediabench benchmark suite’ 
Out-of-order issue. Loads may execute when prior store addresses are known 
I-cycle latency. Communications consume issue width and instruction queue entries 
128 I80 I56 
64KB, 2-way set-associative. 32 byte lines, 1 cycle hit time, 6 cycle miss penalty 
64KB. 2-way set-associative. 32 byte lines, I cvcle hit time, 6 cvcle miss oenaltv. 3 R/W oorts 
program 
cJP% 
djpeg 
epicdec 
epicenc 
instr. count input description (millions) 
testlmg.ppm 18.8 image 
testimg.jpg 6.0 image 
test-image.pgm.E 11 .1  image 
test-image.pgm 70.6 image 
g72 1 enc 1 cIinton.pcm 
I gsmdec 1clinton.pcm.gsm I 115.1 laudio I 
440.6 laudio 
gsmenc 
mesamipmap 
mesaosdemo 
clinton.pcm 307.1 audio 
m.ppm 75.2 3D graphics 
o.ppm 29.7 3D graphics 
. -  I I I I mpeg2enc 1 test.par I 222.0 lvideo 1 
mesatextgen I t.ppm 129.4 I3D graphics 
lrawcaudio 1clinton.pcm I 8.7 laudio I 
PgPdec 
PgPenc 
rasta 
PgPtext.PgP 108.6 encryption 
pgptest.plain 130.6 encryption 
ex5Lcl.wav 26.4 audio 
run lengths. All the benchmarks were compiled for the Al- 
pha AXP using Compaq’s C compiler with the -04 optimi- 
zation level, and they were run till completion. 
We define a new metric to evaluate the performance of 
a clustered configuration relative to that of a centralized 
one, with similar characteristics: the normalized N-clusters 
IPC Ratio (IPCRN for short) is the quotient IPCN-clusters/ 
IPCl -cluster It indicates the IPC degradation caused by in- 
ter-cluster communication delays on a clustered architec- 
ture, and its maximum value is 1. This metric is useful to 
evaluate the impact of a particular technique (e.g. value 
prediction) on a clustered architecture, by comparing the 
IPCR obtained with and without implementing the tech- 
nique. An IPCR increase would indicate that the technique 
produces higher IPC improvements in the clustered archi- 
tecture than in the centralized one, thus measuring the ben- 
efits that are exclusive to the clustered architecture, isolat- 
ed from other more general improvements that affect both 
configurations. 
3. A steering scheme for value prediction 
In this section, we first introduce a steering scheme 
that is very effective but does not include any technique to 
leverage value prediction. Then we present a steering 
mechanism that exploits value prediction as a way to re- 
duce communication requirements. 
3.1. The baseline steering algorithm 
We have evaluated several steering strategies de- 
scribed in previous works [4,5], and variations of them. Fi- 
nally, the best performance was obtained with an enhanced 
version of the “Advanced RMBS” heuristic [4], general- 
ized for an arbitrary number of clusters, which will be the 
Baseline scheme considered in this paper. This algorithm 
321 
: 1 cluster - no predict 
$I 1 cluster - predict 
2 cluster - no predict 
2 cluster - predict 
4 cluster - no predict 
I 4 cluster - predict 
6 
U 
3 4  
2 
0 
Figure 2. IPC of a 1 ,2  and 4-cluster configurations (baseline steering), with value prediction, and without it 
applies the criteria discussed above in the following way: 
in most cases, as a primary rule, it gives the highest priority 
to the reduction of communication penalties, and as a sec- 
ond rule, it tries to improve the workload balance. Howev- 
er, in some cases, when the workload imbalance is consid- 
ered too high, the balance criterion takes precedence. The 
algorithm is described next, in more detail. 
1. If the workload imbalance is higher than a given 
threshold, the current instruction is sent to the least 
loaded cluster. 
Else, the clusters that will cause minimum communi- 
cation penalties are identified: 
2.1. If any source operand is not available at dis- 
patch time, select the cluster(s) where the pend- 
ing operand(s) are to be produced. 
2.2. If all source operands are available, select the 
clusters that have the greatest number of oper- 
ands currently mapped. 
2.3. If it has no source operands, select all clusters. 
Finally, choose the least loaded cluster among those 
selected in step 2. 
The threshold mentioned in rule 1 was set experimen- 
tally to DCOUNT=32 and DCOUNT=16 on a 4-cluster and 
a 2-cluster configurations, respectively. 
Figure2 shows the IPC obtained with the baseline 
steering algorithm for 1 ,2  and 4 clusters, with value predic- 
tion and without it. The IPCs are higher when value predic- 
tion is implemented, although the improvement is rather 
low for the centralized configuration (2% on average, and 
negative for several benchmarks). The benefits are higher 
for the clustered organization (5% and 16% for the 2 and 4- 
cluster configurations respectively). 
The two leftmost bars in each group in Figure 3 depict 
other interesting figures from the same previous experi- 
ments. Graph c shows the IPCR ratio increase provided by 
value prediction, which is a performance improvement spe- 
2. 
3. 
cific to each clustered architecture, as discussed in Section 
2.4. This graph shows a notable increase of the IPCR ratios 
when value prediction is implemented (in spite of a slight 
increase in the average workload imbalance, see graph a) 
which is due to a drastic communication reduction (graph 
b), especially for the 4-cluster configuration, where com- 
munications are also higher: IPCR4 increases by 14%, from 
0.65 to 0.74, and the communications rate is reduced by 
44%, from 0.22 to 0.12. 
3.2. Enhancing the partitioning scheme through 
value prediction 
In this paper we focus on how value prediction may 
improve the performance of the steering logic in a clustered 
processor. We propose some modifications of the Baseline 
steering heuristic, based on the assumption that the predict- 
ed source operands will never cause communications or de- 
lays, and thus the steering may concentrate on improving 
the workload balance. The assumption is true if the predic- 
tion does not fail, but it may not hold otherwise. However, 
as far as the misprediction rate is kept low, these modifica- 
tions may improve significantly the workload balance. The 
first two modifications to the steering strategy are de- 
scribed below, in more detail: 
First, when the source operand of an instruction is pre- 
dicted and it is not yet available, the steering algorithm con- 
siders it as available. By doing so, the algorithm does not 
force to steer the instruction to the cluster where the oper- 
and is going to be produced (if it is the only operand, rule 
2. I is not applied). 
The second modification consists on considering any 
predicted source operand to be mapped in all clusters be- 
cause, regardless of the cluster it is sent to, it will not cause 
any additional inter-cluster communication (unless the pre- 
diction fails and the operand is remote). In consequence, 
communication issues do not impose any restriction on the 
322 
1 .o 
0.8 
e: 0.6 ' 0.4 
(a> 
3 0.20 
$ g  $ -s 
*a s 
-- Baseline - nopredict 8 2.0 
2 b 2 0.10 
U P 0.2 
I VPB - perfectpredict 1.0 
mi Baseline - predict 3 
VPB - predict p" 
0.0 0.00 0.0 
2 4 2 4 2 4 
Clusters Clusters Clusters 
Figure 3. Comparison of 4 configurations: Baseline without and with prediction, VPB with prediction and 
VPB with perfect prediction. (a) Workload Imbalance (b) Communicationshstruction (c) Normalized IPCR 
choice of clusters (i.e. this operand does not constrain the 
set of candidate clusters, if rule 2.2 is applied). 
In summary, these two modifications to the baseline 
steering algorithm eliminate in some cases the constraints 
imposed by communications/delays issues (in rule 2), so 
that the algorithm has better opportunities for balancing the 
workload (since rule 3 selects one cluster from a wider 
choice of clusters). 
We evaluated the impact of these two modifications on 
a 4-cluster configuration, and found that they produce a 
negligible average performance improvement over the 
baseline scheme. The average workload balance is reduced 
by 31% and, since imbalance correction actions (which ig- 
nore communication issues) are less frequent, one would 
expect also to have less communications. However, the 
communications ratio (which mostly determines the IPC) 
remains constant because there is also a communications 
increase due to an indiscriminate use of the optimistic ini- 
tial assumptions. More specifically, if an instruction that 
uses a predicted source operand is sent to a cluster where it 
is not mapped, and the prediction fails, then this instruction 
will be re-issued non-speculatively, and a communication 
will be required to read the correct operand from a remote 
cluster. 
3.3. The VPB steering scheme 
In consequence, to minimize the above mentioned 
communications increase, the second modification to the 
Baseline steering scheme should only apply to those cases 
in which there is a potential for improving the workload 
balance. In particular, we propose that the steering logic 
considers predicted source operands to be mapped in all 
clusters only when the workload imbalance is higher than a 
given threshold (that we set empirically to DCOUNT=16 
and DCOUNT=8, for a 4-cluster and a 2-cluster configura- 
tions, respectively). In other words, if the workload is very 
well balanced, the steering does not rely on value predic- 
tion to improve workload balance, since it may increase the 
communication requirements. We refer to this technique as 
the Value Prediction Based scheme (VPB). 
Figure 3 compares workload imbalance, communica- 
tion rate and IPCR for 4 different configurations: the Base- 
line without and with value prediction, the VPB scheme, 
and VPB with perfect prediction. Comparing the results for 
a 4-cluster configuration with value prediction, the VPB 
scheme has 12% less communications than the Baseline 
and a 10% lower workload imbalance, which results in a 
significant performance improvement (IPCR, increases 
from 0.74 to 0.77). 
The rightmost bar in each group in Figure 3 show an 
upper bound for the VPB scheme, assuming a perfect pre- 
dictor. Communications are not zero because of fp values, 
that are not considered by our predictor. IPCR ratios are 
0.90 and 0.96 for a 4- and a 2-cluster configurations respec- 
tively, which suggests that the performance of the VPB 
scheme may significantly be improved by a more effective 
predictor. 
So far, all the reported experiments assumed that the 
renamekteering logic takes a single cycle. However, due to 
the additional complexity introduced by the steering logic, 
it might require 2 cycles, for some particular technology. 
We simulated a 2-cycle renamekteer stage and obtained 
that, for a 4-cluster configuration with VPB, the IPC is de- 
graded by less than 2%. 
In summary, we observe that value prediction produc- 
es significant performance improvements for a cluster or- 
ganization, which are higher than those observed for a cen- 
tralized one, especially when adequate steering techniques 
are implemented. In particular, we have found that for a 4- 
cluster configuration the IPCR, ratio increases on average 
by 18%, from 0.65 to 0.77, and for a 2-cluster configuration 
IPCR, increases by 5%, from 0.85 to 0.89. This is due to 
the drastic 50% reduction of the communication rate (from 
0.22 to 0.1 1 for 4 clusters, and from 0.12 to 0.06 for 2 clus- 
ters). We can thus conclude that value prediction is a very 
effective technique to reduce the communication require- 
ments of clustered processors. 
323 
1 2  4 
CO”. Latency 
(cycles) 
1 Unlimited 
CO”. Bandwidth 
(pathdcluster) 
t- 2 clusters - predict -w- 4 clusters - predict 
a 2 clusters - no predict 4 clusters - no predict 
Figure 4. Impact of (a) communication latency and 
(b) communication bandwidth, on the IPC 
The overall benefits of value prediction translate into 
an increase in IPC of 21% on average (from 2.96 to 3.59) 
for a 4-cluster architecture, and a smaller increase for a 2- 
cluster configuration (8%, from 3.84 to 4.14), whereas for 
a centralized processor the benefits are almost negligible 
(2%, from 4.54 to 4.63). Note that we assumed a simple 
value predictor and the results will likely be better with 
more complex and effective predictors 
4. Sensitivity analysis 
In future technologies the widening gap between the 
relative speeds of gates and wires will decrease dramatical- 
ly the percentage of on-chip transistors that a signal can 
travel in a single clock cycle [I]. Using high clock rates 
will require not only to reduce the capacity of many com- 
ponents like register files and issue windows, but also to 
pipeline more deeply the access to other structures. 
In this work we focus on the inter-cluster communica- 
tion bypasses. In the previous sections we have assumed 
that these communications take 1 cycle (there is a 1 cycle 
“bubble” between the copy instruction and the dependent 
instruction, in another cluster). In this section, we study the 
sensitivity of clustered architectures to the communication 
latency, measured by the IPC degradation caused by a com- 
munication latency of 1,2, and 4 cycles. In all cases we as- 
sume that communications are fully pipelined, that is, for a 
given bypass path, one communication may begin per cy- 
cle regardless of its total latency. We also analyze the im- 
pact of the communication bandwidth and value predictor 
table size on the performance of the processor. 
4.1. Communication latency 
creases from 1 to 4 cycles. For instance, on a 4-cluster con- 
figuration, the IPC decreases by 17% (and by 20% without 
prediction, because of its higher communication require- 
ments). Similar trends are observed on a 2-cluster configu- 
ration, although the performance degradation is slightly 
smaller (16% with prediction and 17% without prediction). 
4.2. Communication bandwidth 
The inter-cluster communication bandwidth has a di- 
rect impact on the complexity and delay of the register files 
[7, 161 and the bypass network, since it determines the 
number of register file write ports devoted to remote ac- 
cesses, the number of bypass multiplexer’s inputs coming 
from remote clusters, and the number of outputs from the 
bypass network to the interconnection network. Further- 
more, the inter-cluster communication bandwidth also de- 
termines the number of tags that are broadcast to the in- 
struction queues of remote clusters. Therefore, it has a di- 
rect impact on the complexity and delay of the wake-up 
logic, which depends quadratically on the total number of 
tags crossing its CAM cells [ 15, 161. 
So far, we have assumed an unbounded bandwidth for 
the interconnection network to isolate our results from pos- 
sible communication bandwidth bottlenecks. Here we 
study the impact of having a limited bandwidth. For an N- 
cluster configuration, we assume a simplified model with 
NxB independent paths. Each path is implemented through 
a pipelined bus where any cluster can send a value and each 
bus is connected to the write port of a single cluster register 
file. Therefore, we assume that each register file has B 
write ports for inter-cluster communications. Any cluster 
may allocate one of these paths to write a value to a remote 
register file, and holds it during a single cycle, since the 
communication is fully pipelined. Obviously, this model is 
somewhat idealized, since it omits the complexities due to 
the pipelining, arbitration, or variable latencies dependent 
on the topology, but it may provide a first order approach 
to evaluate the problem. 
Figure 4 (b) shows that when the communication 
bandwidth is limited to a single path per cluster there is 
very little performance degradation compared to the un- 
bounded model. For instance, on a 4-cluster configuration, 
the IPC decreases only by 1% (1.4% without value predic- 
tion), and a small IPC decrement is also observed for a 2- 
cluster configuration (0.2% and 1.8% respectively). In con- 
sequence, for inter-cluster communications in a cost-effec- 
tive architecture, it may suffice just a single write port in 
each register file, a single incoming tag per issue window, 
and a single remote bypass attached to the input multiplex- 
ers of the functional units. 
Figure 4(a) shows that there is a significant perfor- 
mance degradation when the communication latency in- 
324 
‘ O O r -  80 
40 6ol 
20 
0 
1K 4K - 1 6 K .  -128K 1 K .  4K . l 6 K ‘  .128K 
V.P.Table Size V.P.Table Size 
(entries) (entries) 
Figure 5. Impact of value predictor table size for 4 
clusters on (a) IPC (b) predictor accuracy. 
4.3. Value predictor table size 
The predictor table size determines the prediction ac- 
curacy, which has a significant influence on the perfor- 
mance. We have evaluated the impact of the predictor table 
size on a clustered architecture. Figure 5(a) shows that on 
average, for a 4-cluster configuration, there is less than 
4.5% IPC degradation when the predictor table size is re- 
duced from 128K to just 1K entries. 
Figure 5(b) shows the predictor accuracy for the same 
range of predictor sizes. We can observe that for 42% of the 
values, the predicted value was not used because it was not 
confident. The percentage of non-confident predictions is a 
bit high because we chose a rather simple value predictor. 
In addition, the hit ratio (correctly predicted values over 
predicted values) decreases from 93.4% to 90.9% when the 
predictor size is reduced from 128K to 1K. 
5. Related work 
The main contribution of this work is realizing that 
value prediction can eliminate many of the long wire com- 
munication penalties in the context of a clustered architec- 
ture. We have also presented a new steering algorithm, the 
VPB scheme, that takes advantage of value prediction to 
further reduce inter-cluster communications. Moreover, 
this paper extends the techniques used in previous works 
[4, 51 (for register renaming, dynamic steering of instruc- 
tions and forwarding values among different clusters) from 
a configuration with 2 heterogeneous clusters to a more 
general design with an arbitrary number of homogeneous 
clusters. 
Other relevant works on dynamically scheduled clus- 
tered processors are the Dependence-based, the Multiclus- 
ter and the Pews architectures. In the Dependence-based 
paradigm 11.5, 161, instructions are steered to several FIFO 
queues instead of a conventional issue window, according 
to a heuristic that ensures that two dependent instructions 
are only queued in the same FIFO if there is no other in- 
struction in between. This heuristic lacks of any explicit 
mechanism to balance the workload, which is instead ad- 
justed implicitly by the allocation algorithm of new free 
FIFO queues. This allocation algorithm generates many 
communications when it assigns a FIFO to a non-ready in- 
struction, since it does not consider in which cluster the op- 
erands are to be produced [5]. 
The Multicluster architecture [6] also used run-time 
generated copy instructions for inter-cluster communica- 
tion. In that architecture the register name space is parti- 
tioned into two subsets, and program partitioning is done at 
compile time without any ISA modification, by the appro- 
priate logical register assignment for the result of each in- 
struction. Both the workload balance and inter-cluster com- 
munication are estimated at compile time. The same au- 
thors proposed a dynamic scheme [7] that adjusts run-time 
excess workload by re-mapping logical registers. However, 
they found most heuristics to be little effective since the re- 
mapping introduces communication overheads that offset 
almost any balance benefit. 
Kemp and Franklin proposed the Pews clustered archi- 
tecture [ l l ]  where instructions are assigned to clusters 
based on register dependences. However, since they as- 
sume a centralized register file, the steering scheme only 
needs to group two dependent instructions in the same clus- 
ter when the value from the producer is not still available at 
the time the consumer is decoded. This simple scheme is 
not suitable for our “distributed” register file, and in addi- 
tion, it does not address the load balancing problem. 
The Alpha 21264 [ lo]  is also a 2-cluster organization 
that duplicates the integer register file, one copy in each 
cluster. The two register file copies are kept consistent by 
writing any result in both clusters. Instructions are dynam- 
ically steered at issue time by a central instruction queue, 
that sends an instruction to the cluster where its operands 
will be available earlier. This organization does not reduce 
the number of register write ports nor the number of regis- 
ters per cluster. Besides, it does not reduce the complexity 
of the issue logic, although it requires a simpler partitioning 
scheme. 
The Trace Processors [ 17, 201 dynamically partition 
the code sequence into chunks of consecutive instructions, 
called traces. Instruction steering to clusters is then per- 
formed at run-time in a per-trace basis. This partitioning 
may result in an acceptable workload balance since traces 
have similar sizes but it is likely to result in many inter- 
cluster communications since they are not taken into ac- 
count by the partitioning scheme. 
Sastry, Palacharla and Smith proposed a static code 
partitioning technique [ 181. The partitioning scheme is 
constrained to dispatching loads, stores and complex inte- 
325 
ger instructions to the same cluster. In addition, it requires 
some extensions to the ISA in order to specify to the hard- 
ware the target cluster of each instruction. Moreover, their 
scheme is less flexible and less effective than a dynamic 
approach as shown elsewhere [4], since all dynamic in- 
stances of the same static instruction are executed in the 
same cluster regardless of run-time conditions, such as the 
workload balance, that are difficult to estimate at compile 
time. 
6. Conclusions 
Future microprocessors are likely to be communica- 
tion bound due to the increasing penalty of wire delays. In 
this paper we show that value prediction can be an effective 
instrument to improve communication locality. In particu- 
lar, we have presented an approach to reduce inter-cluster 
communication by means of a dynamic steering logic that 
leverages value prediction. Values produced in a cluster 
and consumed in another one may not require long wire de- 
lays to propagate from the producer to the consumer if the 
consumer can correctly predict the value. The validation 
required by the prediction is locally performed in the pro- 
ducer cluster. 
We have shown that value prediction removes com- 
munications even for previously proposed steering 
schemes not specially designed to exploit value prediction. 
However, performance is higher if the steering logic ex- 
ploits the predictability of values. We have presented a 
novel steering scheme (VPB), and we have shown that it 
outperforms previous proposals. This benefit mainly 
comes from a 50% reduction in the amount of communica- 
tions. We observed that value prediction reduces the penal- 
ties caused by inter-cluster communications by 18% on av- 
erage. Moreover, whereas value prediction increases the 
IPC of a centralized architecture by just 2%, the same pre- 
dictor increases the performance of a 4-cluster microarchi- 
tecture with VPB steering by 21%. 
Acknowledgements 
We thank the anonymous referees for their valuable 
comments. This work was developed using the resources of 
the CEPBA, and is supported by the Ministry of Education 
of Spain under contract CYCIT TIC98-05 1 1. 
References 
[ 11 V.Agarwal, M.S.Hrishikesh, S.W.Keckler and D.Burger. 
“Clock Rate versus IPC: The End of the Road for Conven- 
tional Microarchitectures”, in Proc. of the 27th Annual Int. 
Symp on Comp. Architecture, June 2000. 
[2] Bohr, Mark T. “Interconnect Scaling - The Real Limiter to 
High Performance ULSI”. in Proc. of the 1995 IEEE Int. 
Electron Devices Meeting, pp. 241-244, 1995. 
D. Burger, T.M. Austin, S. Bennett. “Evaluating Future 
Microprocessors: The Simplescalar Tool Set”, Tech. 
Report CS-TR-96-1308, Univ.Wisconsin-Madison, 1996. 
R.Cana1, J-M.Parcerisa, A. Gonzhlez. “A Cost-Effective 
Clustered Architecture”. In Proc. ofthe Int. Conf on Paral- 
lel Architectures and Compilation Techniques (PACT 99) ,  
Newport Beach, CA, pp. 160-168, Oct. 1999. 
[5] R.Cana1, J-M.Parcerisa, A. Gonzhlez. “Dynamic Cluster 
Assignment Mechanisms”. In Proc. of the 6th. Int. Symp. 
on High-Performance Computer Architecture, pp.132-142, 
Jan. 2000. 
K.I.Farkas, P.Chow, N.P.Jouppi, Z.Vranesic. “The Multi- 
cluster Architecture: Reducing Cycle Time Through Parti- 
tioning”, in Proc of the 30th. Ann. Symp. on 
Microarchitecture, pp. 149-1 59, December 1997. 
[7] K.I.Farkas. “Memory-system Design Considerations for 
Dynamically-scheduled Microprocessors”, Ph.D. thesis, 
Department of Electrical and Computer Engineering, Univ. 
of Toronto, Canada, January 1997. 
[8] F.Gabbay and A.Mendelson. “Speculative Execution Based 
on Value Prediction”, TR. #1080, Technion, 1996. 
[9] J.GonzBlez and A.Gonzhlez. “Memory Address Prediction 
for Data Speculation”. Tech. Report UPC-DAC- 1996-50, 
Univ. Politttcnica de Catalunya, Spain. 1996. 
[lo] L. Gwennap. “Digital 21264 Sets New Standard”, Micro- 
processor Report, 10 (14), Oct. 1996. 
[ 111 G.A.Kemp, M.Franklin, “PEWS: A Decentralized Dynamic 
Scheduler for ILP Processing”, in Proc. of Int. Conf on 
Parallel Processing, pp. 239-246, August 1996. 
[12] C. Lee, M .  Potkonjak and W. H. Mangione-Smith, “Media- 
bench: A Tool for Evaluating and Synthesizing Multimedia 
and Communications Systems”, Proc. of the Int. Symp. on 
Microarchitecture (Micro 30), pp. 330-335, Dec. 1997. 
[ 131 D.Matzke, “Will Physical Scalability Sabotage Perfor- 
mance Gains”, IEEE Computer 30(9): 37-39, Sept. 1997. 
1141 Mediabench Home Page. URL: http://www.cs.ucla.edul 
-leec/mediabench/ 
[15] S. Palacharla, N.P. Jouppi, and I.E. Smith, “Complexity- 
Effective Superscalar Processors” in Proc of the 24th. Int. 
Symp. on Comp. Architecture, pp. 1-13, June 1997. 
[ 161 S.Palacharla. “Complexity-Effective Superscalar Proces- 
sors’’. Ph.D. thesis, Univ. of Winsconsin-Madison, 1998. 
[ 171 E.Rotenberg, Q.Jacobson, Y.Sazeides and J.E.Smith, 
“Trace Processors”, in Proc of the 30th. Ann. Symp. on 
Microarchitecture, pp. 138-148, December 1997. 
[18] S.S.Sastry, S.Palacharla and J.E.Smith, “Exploiting Idle 
Floating-point Resources For Integer Execution”, in Proc. 
of the Int. Conf on Programming Lang. Design and Imple- 
mentation, pp. 118-129, June 1998. 
[ 191 YSazeides, S.Vassiliadis, J.E.Smith.“The Performance 
Potential of Data Dependence Speculation & Collapsing”, 
Proc.of Inc. Symp. on Microarchitecture, pp.238-247, 1996. 
1201 S .  Vajapeyam and T. Mitra, “Improving Superscalar 
Instruction Dispatch and Issue by Exploiting Dynamic 
Code Sequences”, in Proc.of the Int. Symp. on Computer 
Architecture, pp. 1-12, June 1997. 
[3] 
[4] 
[6] 
326 
