Dynamically Matching ILP Characteristics Via a Heterogeneous Clustered Microarchitecture by Chen, Lei et al.
Dynamically Matching ILP Characteristics Via a Heterogeneous Clustered
Microarchitecture
Lei Chen, David H. Albonesi, Steven Dropsho†
Department of Electrical and Computer Engineering
†Department of Computer Science
University of Rochester
Rochester, NY 14627
Abstract
Applications vary in the degree of instruction level par-
allelism (ILP) available to be exploited by a superscalar
processor. The ILP can also vary significantly within an
application. On one end of the microarchitecture space
are monolithic superscalar designs that exploit parallelism
within an application. At another end of the spectrum are
clustered architectures having many simple cores that can
be clocked at a higher frequency than a comparable mono-
lithic design. A disadvantage of the clustered design is the
cost to transmit results between clusters which potentially
limits the performance even in high ILP applications if in-
structions are not mapped carefully to minimize cross com-
munication.
In this paper, we propose an approach that incorporates
the strengths of the prior two by clustering multiple integer
ALUs in an asymmetric fashion. In one cluster, a 6-way
out-of-order pipeline efficiently executes code having high
ILP. In another cluster, a simpler, but deeper, 2-way pipeline
running at twice the frequency speeds up regions of code
having little ILP. We use the parallelism metrics gathered
from the dynamic Data Dependence Tracking (DDT) mech-
anism to dynamically steer instruction windows to a cluster.
We demonstrate that the heterogeneous cluster design can
improve performance by up to 30% over a monolithic 8-
way superscalar processor.
1 Introduction
A fundamental tradeoff in processor microarchitecture is
in achieving the best balance between maximizing instruc-
tions per cycle (IPC) performance and clock frequency. Due
to the diverse nature of application programs, in particu-
lar with regards to their instruction-level parallelism (ILP)
and branch and memory behavior, the effectiveness of com-
plex hardware for exploiting ILP varies considerably. Re-
cent adaptive approaches attempt to exploit this variability
by dynamically adjusting the complexity of major hardware
structures (issue queue, caches, etc.) to match these vary-
ing demands. In this paper, we investigate an alternative
approach, that of providing two static pipeline designs that
are optimized for different ILP characteristics and steering
windows of instructions to one pipeline or the other.
Although prior clustered designs have been proposed
and even implemented (e.g., in the Alpha 21264 micropro-
cessor [8]), these are homogeneous in nature and are used
primarily to increase frequency at the expense of reducing
IPC. The latter stems from the inter-cluster communica-
tion delays that are incurred when register values are passed
among clusters. Our heterogeneous clustering approach is
rather to tailor each cluster to particular application ILP
characteristics, and to use only one cluster at a time. By
switching between clusters at a coarse grain, and sending
values to the disabled cluster’s register file throughout exe-
cution to keep it current, we largely avoid the inter-cluster
communication penalties of the homogeneous cluster de-
sign, and are able to tailor the pipeline to the application.
The rest of this paper is organized as follows. In the
next section, we describe the heterogeneous cluster design.
We provide details of the pipeline design by which the fre-
quency is increased at the expense of additional pipeline
stages. Our methodology is described in Section 3, and our
results in Section 4. We discuss related work in Section 5,
and conclude in Section 6.
2 Heterogeneous Cluster Microarchitecture
To improve performance, general-purpose processor mi-
croarchitectures dedicate considerable resources in order to
exploit ILP. The added circuit complexity can potentially
impact the cycle time of the processor, or increase latency
in datapaths, thereby mitigating some of the obtained IPC
improvement. A balanced design makes trade-offs between
the increased complexity and cycle time/latency to maxi-
mize overall performance.
Unfortunately, application phases with low ILP must pay
a higher delay cost than what may be achieved with a sim-
pler execution core matching the lower ILP. The microar-
chitecture we describe here has two execution cores, one
wide and one narrow. The wide core runs at a base fre-
quency and improves performance through exploiting par-
allelism. The narrow core runs at twice the frequency of
the wide core (but is more deeply pipelined) and improves
performance by executing code having little ILP faster (due
to its higher throughput on narrow sustained streams of in-
Monolithic Superscalar Homogeneous Cluster
8−way 6−way
4−way
2−way
4−way
Heterogeneous Cluster
Figure 1. Architecture organizations
structions) than could be achieved in the wide core.
The two types of cores can be considered clusters and
share some similarity with clustered microarchitectures.
The design philosophy in a basic clustered organization is
to develop simple cores that run at a high frequency and
instantiate multiple copies on the chip for parallelism. We
refer to traditional cluster design as a homogeneous clus-
tered architecture because all the cores are identical. A
performance bottleneck in such an architecture is often the
cost of transmitting results between the cores. In the design
we propose here, the cores can be considered “specialists”
with one supporting high ILP computations and the other
supporting low ILP computations. We refer to this type of
design as a heterogeneous clustered architecture.
Figure 1 compares the three basic architectures: mono-
lithic superscalar, our heterogeneous cluster design, and
a traditional homogeneous cluster design. In the figure,
larger blocks indicate a more complex design and slower
clock frequency (though relative differences are not propor-
tional to block size). Inter-block crossings imply a delay
penalty. Intuitively, while the homogeneous cluster may
permit the cores to run at the highest possible clock rate,
phases with high degrees of parallelism may suffer a perfor-
mance penalty due to frequent inter-cluster communication.
Conversely, the monolithic superscalar design has nomi-
nal inter-ALU communication overhead to support high de-
grees of parallelism efficiently, but due to its lower clock
frequency it is less efficient on instruction streams having
little parallelism.
The proposed heterogeneous cluster architecture at-
tempts to strike a balance between the two extremes by
having cores target particular types of computation. By
matching windows of instructions to the best core type, a
heterogeneous cluster architecture can improve overall per-
formance through dynamically exploiting either parallelism
or high sustained throughput on narrow instruction streams.
2.1 Pipeline stages in the heterogeneous clusters
The narrower pipeline is simpler than the wider pipeline
and can run at a higher clock rate. Shown in Figure 2(a)
is a basic pipeline based on the SimpleScalar toolset [2].
Shown in Figure 2(b) is the heterogeneous cluster organi-
zation. The clusters share the first three stages and the final
stage. The middle stages Issue, Register Read, and Exe-
cute are the same in the wide (6-way) pipeline as the base
pipeline. However, in the narrow (2-way) pipeline each of
these three stages are further pipelined into an additional
stage, but running at twice the frequency. Steering logic
streams a window of instructions to either the wide pipeline
or the narrow pipeline.
Each pipeline has a copy of the register file. Results gen-
erated in one pipeline are copied to the other register file
after a delay to transport the data to the other pipeline.
To double the clock rate, the stages can be partitioned
in the narrow pipeline in the following manner. The is-
sue logic can be partitioned into a wake up stage and a se-
lect stage as described in [7]. The register file access is
pipelined into two stages. In the execute stage, the ALUs
are designed to run at twice the base frequency. This would
require fast ALU circuit structures such as those used in the
Pentium 4 microprocessor. Since simple operations such as
addition and logical operations require only a single cycle,
the stage 5b performs no useful work other than to resync
the instruction back to the shared pipeline clock. How-
ever, no synchronization is required due to the fact that both
clocks are derived from the same source, and one is an in-
tegral multiple of the other. Note that we present here one
possible example of a heterogeneous design. We have made
the simplifying assumption that doubling the clock of these
pipeline stages requires doubling the number of stages, even
with the pipeline width reduced by two-thirds. In actuality,
fewer, or perhaps more, stages may be required. With our
assumption of twice the number of stages at twice the fre-
quency, the branch mispredict penalty is the same (in abso-
lute time, not “cycles”) in both pipelines.
Fetch Rename Issue Reg Read Execute Memory
0 1 3 4 5 62
(a) Basic architecture pipeline
    
    
    
    
    
    
   
   
   
   
   
   
Issue Reg Read Execute
3 4 5
Fetch Memory
0 1 62
Issue Reg Read Execute
3a 3b 4a 5a4b 5b
Narrow (faster) pipeline
Wide (slower) pipeline
Rename
(b) Heterogeneous cluster pipeline
Figure 2. Pipeline structures
2.2 Instruction steering
In this study, we use two mechanisms to steer instruc-
tions to one cluster or the other. In typical cluster architec-
tures, all clusters may be used simultaneously to increase
parallelism so steering decisions are made on an instruction
by instruction basis. In our heterogeneous design, we select
a cluster to best match the parallelism across a window of
instructions. Thus, steering decisions are made for groups
of instructions. As this group of instruction executes on
one cluster, results are written to its register file as well as
to that of the other cluster (but delayed). Thus, delays due
to inter-cluster register communication can occur only on
window boundaries. Even a multi-cycle delay in updating
results between the pipelines has a tolerable performance
degradation since the overhead is amortized over the time
to execute the instruction window. We test window sizes
with as few as 16 and as many as 512 instructions.
To select the best pipeline for each instruction group, we
profile the application on each of the heterogeneous clusters
individually and record the performance to execute each
window of committed instructions on each pipeline. Dur-
ing the actual run, the recorded performance numbers are
compared for each window and instructions are steered to
the cluster providing the best performance during profil-
ing. This method is imperfect because toggling the steering
between clusters affects instruction timing. Perturbing the
timing affects branch prediction history updates and how
much speculative execution is performed. However, this
technique removes to a large degree the uncertainty in how
to steer so the results reflect much of the potential of the
heterogeneous architecture.
Our second method utilizes the dynamic Data Depen-
dence Tracking (DDT) mechanism proposed in [5]. As
shown in Figure 3, the IDC (Instruction Dependences
Counter) filter counts the number of instructions depen-
dent on each instruction. This is an important ILP met-
ric. Figure 3(a) shows the IDC values for a group of highly
data-dependent instructions. A narrow, fast pipeline is more
suitable for this workload with low ILP. By contrast, Fig-
ure 3(b) shows a group of instructions with high ILP. A wide
pipeline is necessary to exploit this high ILP. As a measure
of ILP, we calculate the average IDC value of a given group
of instructions. For Figure 3(a), the IDC value average is
3.5 while it is 1.8 for the instructions in Figure 3(b). Thus,
the larger the average IDC value, the lower the ILP. There-
fore, it is more suitable to use the narrow but fast pipeline
for instructions with a large average IDC value. Figure 3(c)
shows another group of instructions. The ILP of this group
of instructions is larger than those in Figure 3(a) but smaller
than the group in Figure 3(b). As a result, the performance
difference between running this group of instructions on a
wide pipeline and on a narrow pipeline is small. In other
words, the group of instructions in Figure 3(c) has less per-
formance sensitivity for the wide or narrow pipeline, com-
pared with the groups in Figures 3(b) and 3(a).
However, the IDC value average of Figure 3(c) (1.7) is
almost the same as Figure 3(b) (1.8). We introduce the av-
erage “deviation” of the IDC values, calculated as:
Deviationavg =
1
SizeIDC
∗
b∑
i=a
|IDCi − IDCavg |,
where a and b are the instruction boundaries of the valid
IDC region.
Register number
2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
9
IDCDDT
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Valid
1
1
1
1
1
1
0
0
add
or
sub
add
add
p1
p5
p6
p7
p8
(p2)
p1 + p3
p4 or p1
p6 + 1
p4 + p7
p5 − p4
x
x
x
x
x
p4
6
4
3
1
5
2
0
load
1
(a) IDC values for a group of instructions with low ILP.
DDT
2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
9
Register number
x x
x
x
x
x x
x
x
x
x
Valid
1
1
1
1
1
1
0
0
load
add
or
sub
add
add
p1
p5
p6
p7
p8
(p2)
p1 + p3
p3 or p1
p1 + 1
p1 − p3
p4
6
1
0
p3 − p1
1
1
1
1
IDC
1
(b) IDC values for a group of instructions with high ILP.
IDC
2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
9
Register number
x
x
x
x
x
x
x
x
Valid
1
1
1
1
1
1
0
0
load
or
sub
add
add
p1
p5
p6
p7
p8
(p2)
(p3)
p2 or p1
p1 + 1
p4 + 1
p4
3
1
0
p3 − p4
1
1
1
3ld x x
DDT
1
(c) IDC values for a group of instructions with small IDC
value average and small deviation average.
Figure 3. Dynamic Data Dependence Tracking
(DDT) and Instruction Dependences Counter (IDC)
filter.
Therefore, the average deviation for Figure 3(b) is:
(6− 1.8 + (1.8− 1) ∗ 5)/6 = 1.4.
The average deviation for Figure 3(c) is:
((3 − 1.7) ∗ 2 + (1.7− 1) ∗ 4)/6 = 0.9.
These results show that for similar average IDC values, the
larger the average deviation, the larger the ILP.
The DDT and IDC are updated on a cycle-by-cycle ba-
sis. When each instruction commits, the associated DDT
and IDC entries are released. In order to calculate the devi-
ation, we maintain the IDC values in an IDC value buffer as
shown in Figure 4. The buffer size is the same as the steer-
ing instruction window size. After the average IDC value
is produced, each IDC value deviation is obtained using the
average IDC value and each IDC value. Since we choose
IDC
value3
IDC
value4
IDC
value5
IDC
value6
IDC
value7
IDC
value1
IDC
value0
IDC
   
     		
 
   
IDC value buffer
ADD Tree
ADD ADD ADD ADD
ADD
ADD
ADD
SHIFT
Average IDC value
Deviation calculation
Total IDC value
ADD Tree
Total deviation
ADD ADD ADD ADD
ADD
ADD
ADD
value2
Figure 4. Calculating the average IDC value and
deviation.
the steering instruction window to be a power of two, the
average can be obtained using a simple shift right operation.
Note that instead of calculating the average deviation, the
total deviation can be used to compare against the threshold,
which can also be presented as the total deviation (average
deviation times the steering instruction window size). We
estimate the above steps take four cycles. According to our
simulation results, this delay does not have much impact on
system performance. It is totally hidden if the instruction
steering result for the next instruction window is the same
as the current one.
3 Methodology
Our simulation environment is based on the Sim-
pleScalar toolset [2]. We have modified the simulator to
support clustering of the integer ALU units with different
clock frequencies for each cluster. The DDT and the IDC
filter are added to collect the dependence information on a
cycle-by-cycle basis. A summary of our simulation param-
eters appears in Table 1.
In the experiments, our baseline is a pipeline having
eight integer ALUs. We compare this baseline to a six
ALU design at the same frequency; a two ALU design with
the issue and execution stages running at twice the base-
line frequency; a homogeneous clustered microarchitecture
with two identical four ALU clusters; and a heterogeneous
cluster architecture having a six ALU cluster at the baseline
frequency and a two ALU cluster at the 2X frequency. The
instruction steering mechanism for the homogeneous clus-
ters is based on load balancing and instruction criticality as
described in [1].
For the heterogeneous cluster architecture, we simulated
two instruction steering mechanisms. The first mechanism
is based on the IPC profile information of the 6-way and
2-way pipelines. The instruction stream is grouped into
Table 1. Architectural parameters for simulated
processor.
Parameter Value
Fetch Queue 16 entries
Branch predictor 2Bc-gskew
BTB 512 sets, 4-way
Branch Miss Penalty 7 cycles
Decode Width 8 instructions
Issue Width 8 instructions
Retire Width 16 instructions
L1 D/I-Cache 64KB, 4-way set associative
L2 Unified Cache 512KB, 4-way set associative
L1 D/I-Cache Latency 1 cycles
L2 Cache Latency 6 cycles
Memory Latency 60 cycles(first), 2 cycles(subsequent)
FP ALUs 4 + 1 mult/div/sqrt unit
Issue Queue 32 entries
Load/Store Queue 32 entries
Reorder Buffer 256 entries
windows of 16 committed instructions each of which are
directed to one of the two clusters in the heterogeneous de-
sign.
The second instruction steering mechanism is based on
the DDT and the IDC filter as discussed earlier. The aver-
age IDC value and deviation are calculated at run time for
each group of 16 committed instructions. Note that the tail
end execution of a window may overlap with the start up of
the next window running on a different cluster. A 1-cycle
communication cost is assumed when one cluster needs a
value from the other cluster.
The steering decision of the next 16 committed instruc-
tions is made by comparing the calculated values against
the pre-selected thresholds.
We select a mix of applications from the MediaBench
and Spec2000 benchmark suites. Tables 2 and 3 specify the
benchmarks along with the input data set and simulation
window for the simulation runs.
4 Results
Figures 5 and 6 show the normalized IPC with different
architectures.
The difference between the IPC-profile based and the
DDT-IDC based instruction steering mechanisms is clearly
demonstrated through the examples of adpcm and adpcm2.
The relative performance of the IPC-profile based scheme
(1.06 and 1.03, respectively) is worse than using the 2-way
pipeline alone for these applications (1.13 and 1.07, respec-
tively). By contrast, the DDT-IDC scheme measures the
dynamic average IDC value and deviation and selects the
2-way pipeline for most of the time. The performance is
similar to that of using the 2-way pipeline alone.
The performance potential of the heterogeneous archi-
tecture is best demonstrated by gsm2. We collected the
number of cycles spent on each 16-instruction window for
the 6-way and 2-way monolithic architectures. We then
adpcm adpcm2 epic epic2 jpeg jpeg2 g721 g7212 gsm gsm20
0.2
0.4
0.6
0.8
1
1.2
1.4
8 Way
6 Way
2 Way Fast
2 4−Way
IPC Profile Steer
DDT Commit Steer
ghostscript mesa mesa2 mesa3 mpeg2 mpeg22 pegwit pegwit2 pegwit3 HM
0
0.2
0.4
0.6
0.8
1
1.2
1.4
8 Way
6 Way
2 Way Fast
2 4−Way
IPC Profile Steer
DDT Commit Steer
Figure 5. Normalized IPC for MediaBench.
bzip2 cc1 gzip mcf parser vortex vpr HM
0
0.2
0.4
0.6
0.8
1
1.2
1.4
8 Way
6 Way
2 Way Fast
2 4−Way
IPC Profile Steer
DDT Commit Steer
Figure 6. Normalized IPC for SPEC2000.
Table 2. MediaBench benchmark applications.
Benchmark Datasets Simulation window
adpcm ref encode (6.6M)
decode (5.5M)
epic ref encode (53M)
decode (6.7M)
jpeg ref compress (15.5M)
decompress (4.6M)
g721 ref encode (0–200M)
decode (0–200M)
gsm ref encode (0–200M)
decode (0–74M)
ghostscript ref 0–200M
mesa ref mipmap (44.7M)
osdemo (7.6M)
texgen (75.8M)
mpeg2 ref encode (0–200M)
decode (0–171M)
pegwit ref encrypt key (12.3M)
encrypt (32.4M)
decrypt (17.7M)
Table 3. Spec2000 Integer Benchmark applica-
tions.
Benchmark Datasets Simulation window
bzip2 source 58 1000M–1100M
gcc 166.i 2000M–2100M
gzip source 60 1000M–1100M
mcf ref 1000M–1100M
parser ref 1000M–1100M
vortex ref 1000M–1100M
vpr ref 1000M–1100M
combined the smaller number of cycles among the two for
each instruction window and calculated the “combined”
IPC. The result is a 29.7% improvement over the 8-way
machine, which is closely matched by the heterogeneous
approach. Using the heterogeneous clustered architecture
effectively reduces the fraction of the time when the fetch
queue is full (from 46.1%, 46.7%, and 37.7% for the 8-way,
6-way, and 2-way superscalar, to 16.4% for the heteroge-
neous cluster).
Overall, the DDT-IDC instruction steering scheme
achieves a performance improvement of 4.3%, slightly bet-
ter than the improvement of 3.9% from the IPC-profile
based instruction steering mechanism.
The homogeneous cluster of two 4-way pipelines does
not perform very well mainly due to the communication
cost. Figures 7 and 8 show the performance differences be-
tween the same homogeneous clustered architecture with
and without a 1-cycle communication cost for different ap-
plications. The abnormal case of mpeg22 is due to the
branch predictor performance change (prediction accuracy
decreased from 95.6% to 93.3% when removing the com-
munication cost). The block based instruction steering
schemes effectively remove most communication cost.
Figure 9 shows the utilization of the 6-way and 2-way
adpcm adpcm2 epic epic2 jpeg jpeg2 g721 g7212 gsm gsm20
0.2
0.4
0.6
0.8
1
1.2
1.4
2 4−Way
2 4−Way No Comm
ghostscript mesa mesa2 mesa3 mpeg2 mpeg22 pegwit pegwit2 pegwit3 HM
0
0.2
0.4
0.6
0.8
1
1.2
1.4
2 4−Way
2 4−Way No Comm
Figure 7. Homogeneous cluster performance dif-
ference for MediaBench.
clusters using different steering mechanisms. For the SPEC
2K integer applications shown in Figure 6, the 6-way cluster
shows the same performance as the baseline 8-way architec-
ture, which reflects the relatively low ILP of these bench-
marks. However, this low ILP can be exploited by the 2-
way fast cluster. Using the 2-way cluster alone achieves the
best overall performance (2.9% performance improvement
over the baseline). The DDT-IDC based instruction steer-
ing scheme uses the 2-way cluster most of the time, result-
ing in similar performance as the 2-way cluster (2.8% per-
formance improvement over the baseline). The IPC-profile
based steering scheme performs slightly worse (2.2% over
the baseline) due to its more frequent use of the 6-way clus-
ter. In summary, the DDT can be used to derive an effective
steering mechanism for heterogeneous clustered microar-
chitectures, with an overall performance improvement of
4.3% over a variety of integer benchmarks.
bzip2 cc1 gzip mcf parser vortex vpr HM
0
0.2
0.4
0.6
0.8
1
1.2
1.4
2 4−Way
2 4−Way No Comm
Figure 8. Homogeneous cluster performance dif-
ference for SPEC2000.
5 Related Work
There has been a significant body of work in homoge-
neous clustered microarchitectures. We summarize those
efforts that are most related to our heterogeneous design.
The multicluster architecture [6] distributes the register
files, dispatch queue, and functional units to multiple clus-
ters. It applies a static instruction scheduling algorithm to
assign instructions to each cluster. A conventional super-
scalar processor can be considered as two clusters, i.e., an
integer cluster and a floating-point cluster. Canal et al. pro-
pose several run-time mechanisms to distribute simple inte-
ger instructions to both integer and FP clusters [3]. In later
work, Canal et al. propose Simple Register Mapping Based
Steering (RMBS), Balanced RMBS, and Advanced RMBS
for instruction steering between two clusters [4].
Parcerisa et al. propose point-to-point interconnects for
clustered microarchitectures [11]. A topology-aware steer-
ing heuristic is designed to minimize communication cost
while keeping cluster load balanced.
Ranganathan et al. categorize the decentralized execu-
tion models to execution unit dependence based, control
dependence based, and data dependence based [12]. It
is shown that with a ring-type interconnect, data depen-
dence based decentralization performs the best with mod-
erate number of processing elements. When the number
of processing elements gets larger, the control dependence
based decentralization performs better.
Balasubramonian et al. apply dynamic techniques to
tune the clustered architecture to better suit application
needs [1]. The techniques are based on program metrics
and gathered at periodic intervals. It is also shown that re-
configuration at basic block boundaries can further improve
system performance.
Kumar et al. integrate heterogeneous processor cores
in [9, 10]. The cores are complete designs of the Alpha fam-
ily of processors including L1 data cache. The cores range
adpcm adpcm2 epic epic2 jpeg jpeg2 g721 g7212 gsm gsm20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Wide/Narrow Pipeline Usage (IPC−profile Based)
Use Narrow Pipeline
Use Wide Pipeline
(a) MediaBench 1, IPC-profile based steering
adpcm adpcm2 epic epic2 jpeg jpeg2 g721 g7212 gsm gsm20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Wide/Narrow Pipeline Usage (DDT−IDC Based)
Use Narrow Pipeline
Use Wide Pipeline
(b) MediaBench 1, DDT-IDC based steering
ghostscript mesa mesa2 mesa3 mpeg2 mpeg22 pegwit pegwit2 pegwit3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Wide/Narrow Pipeline Usage (IPC−profile Based)
Use Narrow Pipeline
Use Wide Pipeline
(c) MediaBench 2, IPC-profile based steering
ghostscript mesa mesa2 mesa3 mpeg2 mpeg22 pegwit pegwit2 pegwit3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Wide/Narrow Pipeline Usage (DDT−IDC Based)
Use Narrow Pipeline
Use Wide Pipeline
(d) MediaBench 2, DDT-IDC based steering
bzip2 cc1 gzip mcf parser vortex vpr
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Wide/Narrow Pipeline Usage (IPC−profile Based)
Use Narrow Pipeline
Use Wide Pipeline
(e) SPEC2K, IPC-profile based steering
bzip2 cc1 gzip mcf parser vortex vpr
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Wide/Narrow Pipeline Usage (DDT−IDC Based)
Use Narrow Pipeline
Use Wide Pipeline
(f) SPEC2K, DDT-IDC based steering
Figure 9. Wide versus narrow core usage breakdown.
from simple, low power embedded designs to the EV8 de-
sign. In [9], the work explores reducing energy by match-
ing applications to the simplest core (and lowest energy)
that offers sufficient performance. In [10], a multi-core
architecture is used to improve the performance of multi-
threaded workloads. Because the caches are included with
each core, switching cores necessitates flushing the active
cache of dirty data. This overhead restricts the steering to
be very coarse-grained. Our proposed design, in contrast,
targets performance and relies on the ability to make fine-
grained steering decisions in order to exploit differences in
ILP in sections of the code.
6 Conclusion
The static microarchitecture of a conventional super-
scalar processor microarchitecture may be poorly matched
to the ILP characteristics of the application at hand. The
result may be lower performance than can be achieved on
a pipeline that achieves high throughput for narrow streams
of instructions. We propose a heterogeneous clustered mi-
croarchitecture that attempts to bridge the gap between
these two design styles. In our design, windows of instruc-
tions are steered to either a wide pipeline optimized for high
ILP or a narrower pipeline optimized for narrow streams of
instructions. This steering approach largely hides the inter-
cluster communication penalties of prior, homogeneous,
clustered architectures. The result is up to a 30% perfor-
mance improvement compared to a conventional approach.
This paper represents an initial step in our research in
this area. For future work, we plan to investigate the dy-
namic choice of instruction window sizes, and mechanisms
to identify code regions where simultaneous use of both the
wide and narrow pipelines would yield additional perfor-
mance benefits.
References
[1] R. Balasubramonian, S. Dwarkadas, and D. Albonesi. Dy-
namically Managing the Communication-Parallelism Trade-
off in Future Clustered Processors. Proceedings of the 30th
International Symposium on Computer Architecture, pages
275–286, 2003.
[2] D. Burger and T. Austin. The Simplescalar Tool Set, Ver-
sion 2.0. Technical Report CS-TR-97-1342, University of
Wisconsin, Madison, Wisconsin, June 1997.
[3] R. Canal, J. M. Parcerisa, and A. Gonzalez. Dynamic Clus-
ter Assignment Mechanisms. Proceedings of the 6th Inter-
national Symposium on High-Performance Computer Archi-
tecture, pages 133–142, 2000.
[4] R. Canal, J. M. Parcerisa, and A. Gonzalez. Dynamic Code
Partitioning for Clustered Architectures. International Jour-
nal of Parallel Programming, pages 59–79, 2001.
[5] L. Chen, S. Dropsho, and D. Albonesi. Dynamic data depen-
dence tracking and its application to branch prediction. In
Ninth International Symposium on High-Performance Com-
puter Architecture, pages 65–76, Feb. 2003.
[6] K. Farkas, P. Chow, N. Jouppi, and Z. Vranesic. The multi-
cluster architecture: Reducing cycle time through partition-
ing. 30th International Symposium on Microarchitecture,
pages 149–159, December 1997.
[7] M. Goshima, K. Nishino, Y. Nakashima, S. ichiro Mori,
T. Kitamura, and S. Tomita. A High-Speed Dynamic In-
struction Scheduling Scheme for Superscalar Processors.
IPSJ Transactions on High Performance Computing Sys-
tems, pages 225–236, December 2001.
[8] R. E. Kessler, E. J. McLellan, and D. A. Webb. The Al-
pha 21264 Microprocessor Architecture. 1998 International
Conference on Computer Design, pages 24–36, October
1998.
[9] R. Kumar, K. Farkas, N. P. Jouppi, P. Ranganathan, and
D. M. Tullsen. Processor Power Reduction via Single-ISA
Heterogeneous Multi-Core Architectures. IEEE Architec-
tural Letters, March 2003.
[10] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and
K. Farkas. Single-ISA Heterogeneous Multi-Core Architec-
tures for Multithreaded Workload Performance. Proceedings
of the 31st International Symposium on Computer Architec-
ture, 2004.
[11] J. M. Parcerisa, J. Sahuquillo, A. Gonzalez, and J. Duato. Ef-
ficient Interconnects for Clustered Microarchitectures. Pro-
ceedings of the 2002 International Conference on Parallel
Architectures and Compilation Techniques, pages 291–300,
2002.
[12] N. Ranganathan and M. Franklin. An Empirical Study of De-
centralized ILP Execution Models. Proceedings of ASPLOS-
VIII, pages 272–281, 1998.
