A performance comparison of several superscalar processsor [sic] models with a VLIW processor by Lenell, John & Bagherzadeh, Nader
UC Irvine
ICS Technical Reports
Title
A performance comparison of several superscalar processsor [sic] models with a VLIW 
processor
Permalink
https://escholarship.org/uc/item/1kg8b61b
Authors
Lenell, John
Bagherzadeh, Nader
Publication Date
1992
 
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California
A Performance Comparison of Several Superscalar
ProcesssorModels with a VLIW Processor
John Lenell and Nader Bagherzadeh
Departmentof Electrical and Computer Engineering
Department of Informationand Computer Science
University of California, Irvine
Irvine, California 92717
Technical Report No. 92-92
-z
-r .i:?
s-
A Performance Compairison of Several
Superscalar Processor Models with a VLIW
Processor
John Lenell and Nader Bagherzadeh
Department of Electrical and Computer Engineering
University of California, Irvine
Irvine, CA 92717
Abstract
Superscalar and VLIW processors canboth execute multiple in
structions each cycle. Each employs a different instruction schedul
ing method to achieve multiple instruction execution. Superscalar
processors schedule instructions dynamically, and VLIW proces
sors execute statically scheduled instructions. This paper quanti
tatively compares various superscalar processor architectures with
a Very Long Instruction Word architecture developed at the Uni
versity ofCalifornia, Irvine. An architectural overview and perfor
mance analysis of the superscalar processor models and VIPER,
a VLIW processor designed to take advantage of the parallelizing
capabilities of Percolation Scheduling, are presented. The motiva
tion for this comparison is to study the capability ofa dynamically
scheduled processor to obtain the same performance achieved by a
statically scheduled processor, and examinethe hardware resources
required by each.
1 Introduction
RISC microprocessors achieve high performance by executing close to one op
eration per cycle, and employing aggressive technology dependent hardware
techniques. These techniques decrease the processor cycle time, thereby re
ducing the time to perform a task. As the rate of performance improvement
due to technological advances subsides, othermethods for improving processor
1
performance must be developed. Exploiting parallelism within the instruction
stream is one method for improving processor performance. Utilizing instruc
tion parallelism is achieved by executing multiple instructions concurrently.
A microprocessor executing more than one instruction each cycle improves
performance by reducing the number ofcycles required to execute a program.
VLIW and Superscalax processors are two promising design techniques
for executing more than one instruction each cycle. Both architectures are
RISC-like with multiplepipelined execution units for executing instructions in
parallel. However, each architecture exploits the instruction parallelism in a
different manner.
The VLIW architecture requires statically scheduled code for instruction
parallelization. Trace or Percolation Scheduling(PS) compilers perform global
code optimization for the VLIW and schedule the optimized code into groups
of operations which are fetched simultaneously as one instruction [5, 9]. Each
operation in the instruction word controls a single execution unit. All of the
operations in a VLIW word are executed in parallel, and results are written
to a globally shared register file[3].
Alternatively, a superscalar processor provides complex hardware resources
to detect and issue parallel instructions dynamically as they are fetched from a
linear sequence of instructions. All instructions issued in a cycle are executed
in parallel, and a global register file is updated with the instruction results.
Some superscalar architectures maintain the order of the instruction stream
at issue, while more elaborate hardware can be added to allow out-of-order
instruction issue. Additionally, the superscalar processors execute specula
tively to increase the number of instructions available to issue, and reduce the
delays associated with conditional branches. Performance of the superscalar
processor generally increases with the complexity of the hardware as it at
tempts to look farther ahead into the instruction stream to issue and execute
independent instructions, out of order, and speculatively.
Theperformance ofbothprocessor architectures is limited by data hazards,
control hazards, and available resources. An instruction scheduler attempts to
free an instruction from hazards and resource conflicts so it can be issued to
an execution imit in parallel with other instructions, yet independent of the
operation and result ofthe other instructions. Theperformance of a processor
is dependent upon the performance of the instruction scheduler. Instructions
can be scheduled statically (during compile time) or dynamically (at run time).
1.1 Instruction Scheduling
Static compilation requires a complex compiler to exploit a large amount of
instruction parallelism by scheduling beyond basic blocks. The compiler at
tempts to compact the output code for the target VLIW architecture so that
each instruction wordfield is occupied by an operation from the original pro
gram sequence. In so doing, the compiler is constrained by data dependencies,
control dependencies, resource conflicts, and register usage. As a result, the
compiler is forced to schedule no-ops in fields it has been unable to schedule
valid operations. The use of no-ops in VLIW's reduces the average number of
operations executed per cycle, and can greatly increase thesize of the compiled
code.
Dynamic scheduling in superscalar processors refers to the ability of the
hardware to detect and issue multiple instructions at run time. An effective
dynamic scheduler is important for a superscalar processor to maintain par
allel instruction execution. Numerous dynamic scheduling techniques have
been explored [13, 12, 14, 11]. The advantages of dynamic scheduling are the
following [6]:
tt Efficient scheduling of dependencies unknown at compile time.
B Simplifies compiler design.
e Maintains code compatibility between generations of processors.
® No code expansion due to scheduling.
Unfortunately, these advantages are gained at significant hardware expense.
This expense is mitigated by limiting the number of instructions which can
be scheduled for execution each cycle. As a result, dynamic scheduling is
disadvantaged byits inability to perform global code schedMing as is done by
static scheduling.
The dynamic scheduler may be designed to issue instructions in their cor
rect program sequence, or lookahead into the instruction stream and issue
instructions out of their original order. These issue policies are known as in-
order issue and out-of-order issue respectively[7]. An in-order-issue policy is
the simplest to design, but its performance suffers in the presence of hazards
and resource conflicts. Instructions are issued in program order from the de
coder until an instruction has a hazard or resource conflict with a preceding
instruction.
, ' <l
Instruction execution bandwidth can be increased by allowing the processor
to continue to issue instructions which follow stalled instructions. This imphes
an out-of-order issue policy because the original program sequence will not be
maintained. An instruction window is used to hold instructions after they
have been decoded and are waiting to execute. All of the instructions in the
instruction window are available to the issue unit. This allows the issue umt
to lookahead in the instruction sequence and issue the maximrun number of
instructions to the execution units.
The purpose of this research is to compare these two processor design
alternatives by analyzing performance and hardware requirements. For this
purpose, a scalable instruction-level processor simulator has been developed
to evaluate the performance of superscalar models. The simulator has been
designed to explore the performance of both in-order issue and out-of-order
issue policies, as well as, the influence ofthe size ofcritical hardware elements
on the performance of the processor. These superscalar models are compared
with the VIPER, a VLIW processor, which has been developed at the Univer
sity ofCalifornia, Irvine[l]. An overview ofthe VIPER architecture is given in
Section 2. The superscalar model is presented in Section 3, and the following
section explains the simulation methods. Section 5 discusses the simulation
results of the VIPER and superscalar models.
2 The VIPER Processor
VIPER is an integer VLIW processor which fetches a single long instruction
specifying four operations each cycle. The operations are independent and
execute in parallel onfour functional units. Afunctional unit consists ofone or
more execution units. Each functional unit includes an arithmetic/logic and
either a load/store or control transfer execution unit. The arithmetic/logic
execution unit (ALU) is capable of executing all simple integer operations
including shift and compares. The load/store execution unit (LS) provides
the off-chip memory interface. The control transfer unit (CT) provides the
function ofaltering the program counter address as a result of control transfer
operations. Two ALU/LS functional units and two ALU/CT functional units
are used on the processor. Each functional unit can read two 32-bit operands
from a multi-ported global register file containing 32 registers. The operation
set and pipeline structure are discussed below. Amore detailed description of
the VIPER architecture can be found in Reference [1].
2.1 Operation Set
VIPER implements a RISC-type operation set. An operation is specified by
a 32-bit field of the instruction word. Atotal of 29 operations are defined for
the processor. These operations are divided into arithmetic/logic, load/store
and control transfer categories to facilitate the assignment of an operation to
a functional unit. The complete operation set is shown in Table 1.
The arithmetic/logic category defines the arithmetic, logic, comparison,
and shift operations. The operations are executed by the ALU execution
units.
Two instructions, load word (LDW) and store word (STW), are defined in
the load/store category. These operations perform register indirect loads and
stores, and are executed by the LS units.
Both conditional and unconditional branches are defined in the control
transfer operation category. Two types of unconditional branches are specified,
CALL and JUMP. These instructions cause the program counter to be loaded
with a target address. The target can be specified with aregister or immediate
value. CALL instructions are used for procedure calling. They write the return
address into register 31. JUMP operations are a more general unconditional
branch and only cause the program counter to change to the specified address.
Conditional branches are performed with advanced conditioning. Condi
tions are set with explicit compare instructions. The compare instructions set
the least significant bit of a specified register which becomes the branch con
dition code. VIPER can perform multi-way branches by testing two condRion
codes per branch operation, and executing branches on up to two functional
units simultaneously[2]. The branch operations have the following form.
BRciC2 CCi,CC2,offset
where ci and C2 are conditions having a true or false value. The condition
codes are the least significant bit of cci and CC2 which each specify one general
purpose register. The offset is added to the value of the program counter to
form the target address. Four conditional branch operations are available for
testing the four possible conditions. Three-way branches can be effected by
executing two branch operations together. Three targets can be generated in
this case. Ifthefirst branch test succeeds then thetarget becomes the program
counter plus the offset of the first branch. Otherwise, if the second branch
succeeds, the program counter is added to the offset of the second branch. If
Operation Operands Description
ADD STcl,src2,dest integer add
SUB STcl,src2,dest integer subtract
AND srcl,src2,dest logical AND
OR srcl,src2,dest logical OR
XOR srcl,src2,dest logical exclusive-OR
NOT srclfdest bitwise complement
LSI srcl,src2,dest logical shift left
LSLI srcl,#SA,dest logical shift left (constant)
LSR srcl,src2,dest logical shift right (variable)
LSRI srcl,#SA,dest logical shift right (constant)
ASR srcl,src2,dest arithmetic shift right (variable)
ASRI srcl,#SA,dest arithmetic shift right (constant)
SEQ srcl,src2,dest set if equal
SNE srcl,src2,dest set if not equal
SLT srcl,src2,dest set if less than
SLTU srcl,src2,dest set if less, than unsigned
SGE srcl,src2,dest set if greater than or equal
SGEU srcl,src2,dest set if greater than or equal unsigned
ADDI srcl,#I,dest integer add immediate
SUBI srcl,#I,dest integer subtract with immediate
LUI #I,dest load upper immediate
LDW srcl,dest Load data word from M[srcl]
STW srcl,src2 Store data word to M[srcl]
BRFF srcl,src2,#0 branch if cl=false, c2=:false
BRET srcl,src2,#0 branch if cl=false, c2=true
BRTF srcl,src2,#0 branch if cl=true, c2=false
BRTT srcl,src2,#0 branch if cl=true, c2=true
CALL srcl or #LI unconditional procedure call
JUMP srcl or #LI unconditional jump
Table 1: VIPER Arithmetic/Logic Operation Set
Pipe Stage Description.
IF Instruction fetch.
ID Instruction decode and operand fetch.
EX Operation execute.
WB Result write back.
Table 2: VIPER Pipeline Stages
both branch operations fail, then the program counter is incremented to the
next address.
2.2 Pipeline Structure
VIPER has a four stage instruction pipeline. The stages are shown in Table 2.
Each stage completes in one cycle. The instruction fetch stage reads one long
instruction word each cycle and latches the instruction into the decoder at
the end of the cycle. An instruction cache miss will stall the instruction fetch
mechanism.
Operands for the operations are obtained during the instruction decode
stage from either the register file orfrom a functional unit through a bypassing
network. Operations are distributed to their corresponding functional unit at
the end of the cycle.
The execute stage performs all operations in a single cycle, and result
bypassing between all functional units can occur during this stage to eliminate
stalls due to data hazards. Control hazards are handled with a delayed branch
of one cycle.
Results of the functional units are written to the register file during the
write back stage. The write occurs during the first phase of the cycle, so a
read can be made to the register during the second phase in the instruction
decode stage.
2.3 The PS Compiler
A component ofVIPER is the Percolation Scheduling compiler[10]. The com
piler attempts to increase the parallelism available to the processor by com
pacting across basic block boundaries, performing loop pipelining, and register
renaming. The compaction process attempts to move operations as high as
possible in the program by extending the instructions horizontally. The pro
gram is scanned in a top-down manner and instructions are moved up the pro
gram graph if the original semantics of the program can be maintained. The
compiler also performs loop pipelining with a method called Perfect Pipelin
ing. Perfect Pipelining is an algorithm which pipelines general loops, includ
ing loops with conditional jumps inside the loop body. Finally, the compiler
eliminates false dependencies due to reusing registers by employing register
renaming during the compaction process.
3 The Superscalar Processor Model
The superscalar model performs 32-bit integer operations. Multiple instruc
tions are fetched each cycle, and the processor is able to issue and complete
up to four instructions per cycle, requiring an eight read, four write port reg
ister file. The processor model has 32 general purpose registers. Memory is
accessed through explicit load/store operations. Additionally, the following
architectural features are defined for the processor model:
9 Instruction Set
Instruction Fetch Mechanism
• Branch Prediction
3.1 The Instruction Set
With the exception of the control transfer instructions, the instruction set of
the superscalar model is identical to that of the VIPER processor. However,
the VIPER processor is capable of performing multi-way branches which is
a feature not supported by the superscalar model. Instead, the four branch
operations of VIPER are replaced by two two-way branch instructions. The
two branch instructions defined for the superscalar model are the following:
BRT srcl,#0
BRF srcl,#0
BRF and BRTperform branch if false and branch if true operations, respec
tively on the least significant bit of the register designated by the srcl field.
The value 0 is an offset added to the current program counter to compute the
branch target address.
3.2 Instruction Fetch Mechanism
Instructions are fetched nat a time (n is 2or 4, depending on the simulation).
The program counter is always a multiple of ra, and contains the address of the
first instruction to be fetched, and it is used to access a cache ofline size n. A
line of instructions is fetched from the cache and latched into the instruction
decoder at the end of the instruction fetch stage. If a control instruction
transfers instruction flow to an instruction other than the first of a line, the
whole line containing the target instruction is fetched. The misalignment is
compensated during decode by masking out instructions preceding the target
of the control transfer.
3.3 Branch Prediction
A dynamic branch predictor is utilized to reduce the number of branch delay
cycles and maintain instruction fetch bandwidth. Branch prediction is imple
mented with a branch target buffer (BTB)[4, 8]. The program address, branch
target, and predicted direction of all control transfer instructions are stored in
the BTB.
3.4 Machine Configurations
To complete the description of the processor model for simulation, a machine
configuration is specified by:
Set of Functional Units
Dynamic ScheduHng Technique
Instruction Cache Interface
e
3.4.1 FunctionzJ Units
Like the VIPER, each processor uses a combination of three typesofexecution
units for executing instructions after issue. They are ALU, LS, and CT as
described previously in Section 2. Each execution unit can begin and complete
one instruction per cycle. Several different configurations of these execution
nnits will be examined during simulation, so that the configuration with the
best cost/performance ratio can be determined.
3.4.2 Dynamic Scheduling Technique
One of the following three scheduling techniques can be selected for the ma
chine configuration:
I-D This notation specifies a scheduler performing in-order issue from the
instruction decoder. The decoder size is limited to the number of in
structions fetched each cycle. Instruction fetch is stalled until all of the
instructions have been issued from the decoder.
I-W In-order issue from a central instruction window is specified by this nota
tion. The instruction decoder dynamically performs register renaming,
and moves the instructions into the window. When the instruction win
dow is full, the decoder stalls, and the instruction fetch is stalled until
all of the instructions have been moved out of the decoder. Instructions
issue from the window in the original program order. The number of
instructions issued each cycle is determined by the number of available
functional units.
O-W This notation specifies an out-of-order issue policy from a central in
struction window. It also does register renaming during the instruction
decode, but when the instructions issue from the window, they can be
issued in any order.
3.4.3 Instruction Cache Interface
The instruction cachecan be explicitly modeled by defining its miss ratio and
miss penalty. Cache misses randomly occur at the rate specified by the miss
ratio. A miss causes the instruction fetch to stall for the number of cycles
10
specified by the miss penalty. A 100% hit ratio is achieved by setting the miss
ratio to zero.
In the following sections, the performanceof several different machine con
figurations will be presented. The notation used for identifying the configura
tion is to give the scheduling technique, and the size of the instruction window
if applicable. All other parameters will be explicitly presented.
4 Simulation Methods
Two simulators were used for comparing the performance of the superscalar
and VIPER processors. This section describes the two simulators, and the
benchmarks used during simulation are given.
4.1 The VLIW Simulator
A VLIW instruction level simulator is included with the Percolation Schedul
ingcompiler to evaluate architectural alternatives. The simulation path for the
VIPER processor is shown in Figure 1. Two files are specified as inputs to the
simulation process, the benchmark program and the hardware configuration
file. The target architecture, VIPER, is defined in the hardware configuration
file. Each benchmark is compiled by GCC into an intermediate code which
is independent from any machinearchitecture. The intermediate code is rep
resented by a Control/Data Flow Graph (CDFG) with each node containing
one operation. The fourth step compacts the operations of the original CDFG
into a CDFG with multiple operations in each node. The target architecture
is specified for this process to provide resource constraint scheduling. Code
generation uses the hardware configuration file and produces a machine pro
gram which can be executed by the target architecture. An assembly language
program is the final output of the code generator. Finally, the hardware con
figuration and the assembly language program are used by the simulator to
compute the execution time of the benchmark. The VLIW simulator provides
the following outputs:
• The number of cycles taken to execute the program.
• The frequency of individual operations.
• The number of no-ops executed dynamically.
11
Configuration
_ GNUC
RxK^End
Inteiinediaie
Code
Data Flow
Graph
Collection
Simulation
Figure 1: Simulation Path for VIPER
4.2 The Superscalar Simulator
A scalable and reconfigurable simulator has been developed to evaluate the
performance of the superscalar models described in the previous section. The
simulator executes the input program at the instruction level. The path for
generating the simulation input is shown in Figure 2. The benchmark pro
grams and a hardware configuration file are required inputs to the simulation
path. GCC compiles the benchmark source code into a machine independent
intermediate code. The code generation step interprets the intermediate code
into the instruction set defined for the superscalar architectures, and outputs
sequentialcode which is independent of the target machine's hardware config
uration. The sequential code and the hardware configuration are inputs to the
scalable simulator, and the number of cycles taken to execute the program is
the output.
4.3 Benchmarks
Ten benchmark programs have been chosen to compare the performance of
the VIPER and superscalar processors. Table3 gives a listing and description
of the benchmarks. These benchmarks are integer programs which implement
12
Beochmaiic
PxDgnuns
Hardware
Configuration *
GNUC
Front End
Simulation
Figure 2: Superscalar Simulation Path
Benchmark Description
binsearch Binary search algorithm
bubble Bubble sorting algorithm
chain optimal chained matrix multiplication sequence finder
factorial computes the factorial on numbers from 1 to n
fibonacci Fibonacci number sequence generator
floyd Locates shortest path in a graph using Floyd's algorithm
matrix Matrix multiplication routine
merge Merge sort algorithm
quicksort Basic quicksort algorithm
sp Locates shortest path with Dijkstra's algorithm
Table 3: Benchmark Programs
a variety of ba^ic algorithms.
The dynamic frequency ofeach instruction class and the run length distri
bution for the ten benchmarks is shown in Figure 3 and Figure 4, respectively.
ALU operations represent 63% of the total instructions. Load/store instruc
tions are 18%, and the remaining 19% of the instructions are control transfer.
13
80.0
60.0
o
•-= 40.0
20.0 -
O.O
Control Transfer
Load/Store
ALU
Binooarch Bubbta Chain Footorial Fibonocci Royd Matrix Marge Quicknort Sp
Benchmarks
Figure 3: Distribution of Instruction Types in the Benchmark Programs
14
. 'i . ' •
50.0
40.0
30.0
Q 20.0
10.0
0.0
1 2 3 4 5 6 7+8
Number of Instructions between Taken Branches
Figure 4: Run Length Distribution of Benchmark Programs
15
Benchmark Speedup
binsearch 2.62
bubble 1.54
chain 2.25
factorial 2.37
fibonacci 2.61
floyd 1.94
matrix 2.33
merge 2.13
quicksort 2.04
sp 2.19
Harmonic Mean 2.15
Table 4: Bencbmark Performance of VIPER
5 Results
This section presents the results of the simulations for the processors and
benchmarks described in the previous sections. ForVIPERand the superscalar
processors, performance is presented as speedup over the execution time of a
scalar processor implementation.
5.1 Comparing Performance
The objective of the simulations is to find a superscalar configuration which
achieves comparable performance to VIPER, yet requires the least amount of
hardware complexity. The performance of the VIPER processor is given in
Table 4 for each of the ten benchmarks.
• The parameters which should be minimized to reduce hardware complexity
are the following:
'• The number of instructions fetched each cycle.
© The number of execution units.
• The size of the instruction window.
Two graphs have been generated toevaluate the above parameters. The graphs
show the speedup achieved by several configurations ofthe superscalar model.
16
The speedup presented in the graphs is the harmonic mean of the speedups
for each benchmark. For these simulations, the instruction cache has a 100%
hit ratio, the instruction issue is limited to four instructions per cycle, and the
functional units contain only one execution unit.
The first graph. Figure 5shows the speedup of superscalar processors which
r.;^.n fetch two instructions each cycle. Configurations with different numbers
of execution units are distributed along the horizontal axis. Speedup ranges
from a low 1.22 to a high of 1.78. Comparing Table 4 and Figure 5 shows
the VIPER processor has at least a 17% performance margin over any of the
superscalar configurations.
Thus, fetching two instructions per cycle does not supply the superscalar
processors with an adequate instruction fetch bandwidth. So, this parameter
is changed to a limit of four instructions per cycle to improve performance.
Figure 6graphs the speedup achieved by the superscalar configurations which
fetch four instructions each cycle. Points along the horizontal axis represent
configurations with different numbers of execution units. Peak performance is
achieved by the out-of-order issue scheduling model with a sixteen-entry in
struction window; however, the performance of the same scheduling technique
with an eight-entry instruction window is nearly the same. The slight per
formance improvement gained by doubling the instruction window, size from
8 to 16 is not justifiable, so the graph shows that the performance close to
the VIPER is achieved by a superscalar model with an out-of-order scheduler
with an eight entry instruction window. This is an interesting point because
a VLIW is designed to exploit a large amount of parallelism by performing
global code optimizations, but the superscalar model can achieve nearly the
same performance with a very small amount of lookahead ability.
The results have shown that the superscalar model must fetch four instruc
tions each cycle, perform out-of-order instruction issue, and use an eight-entry
instruction window to achieve near equal performance to the VIPER. Next,
the best execution unit configuration can be determined. From Figure 6, four
points along the horizontal axis are shown to have approximately the same
peak performance value for the 0-W8 curve. Table 5 summarizes these con
figurations and the speedup achieved by each. This table shows the same
performance can be achieved with three or four ALUs. Also, the performance
gained by an additional control transfer unit is less than 1% in either case.
Considering these tradeoffs, the superscalar model should be configured with
three ALUs, two load/store units, and one control transfer unit.
17
1 .80
1.60 -
o.
to
1.20
1 .OO
Execution Units (ALU.LS.CT)
l-D
_W4
-W8
4 0-W8
- 0-W16
« 0-W32
Figure 5: Speedup of Superscalar Processors Fetching 2 Instructions/Cycle
ALUs LS units CT units Speedup
3 2 1 2.10
3 2 2 2.11
4 2 1 2.10
4 2 2 2.11
Table 5: Performance Summary for the 0-W8 Scheduling Model
18
2.2
2.0 -
1.8 -
"i 1-6
CL
to
1.4 -
1.2 -
1.0
— I-D
— I-W4
I-W8
-* 0-W8
-E3 0-W1 6
-X 0-W32
J I I I P I I I I P P I L
111 211 212 221 222 311 312 321 322 +11 +12 +21 +22
Execution Units (ALU.LS.CT)
Figure 6: Speedup of Superscalar Processors Fetching 4 Instructions/Cycle
19
Table 6:
VIPER
Instruction Window Wize 8
Instructions Fetched per Cycle 4
Instructions Issued per Cycle 3
Instruction Issue Policy Out-of-order
Arithnietic/Logic Units 3
Load/Store Units 2
Control Transfer Units 1
A Superscalar Configuration with Performance Comparable to
Forthe previous results, each execution unit was considered to be one func
tional unit which requires two operand busses from the instruction window to
each executionunit. The numberofoperand busses can be reduced by combin
ing multiple execution units into functional units. The number of functional
units must be at least as great as the number of instructions which can be
issued each cycle. Currently, the number of instructions issued each cycle is
four, so the number of functional units must be at least four. A significant
reduction in hardware can be achieved by grouping the six execution units
into three functionalimits, and limitinginstruction issue to three instructions.
This would reduce the numberof ports to the instruction window, the number
of busses for distributing instructions and operands to functional units, and
the numberofbypassing networks to the functional units. The functional units
can each be configured with one ALU, and one LS or CT execution unit. Per
forming the benchmark simulations with this configuration results in a mean
speedup of 2.07. This is a performance loss of only 1% which makes this con
figuration a good design alternative. The resulting superscalar configuration
is listed in.Table 6. The VIPER has a 4% performance advantage over this
superscalar model.
5.2 The Instruction Cache Penalty
Superscalar and VLIW processors respond very differently to an instruction
cache miss. Due to dynamic scheduling, when instruction fetching stops as
a result of a cache miss, the superscalar processor may take several cycles
to issue all of the instructions in the instruction window. If the cache miss
penalty can be paid before the window is emptied, then the performance of
20
the superscalar processor will not be affected by the cache miss. In the case
of VIPER, it executes one long instruction each cycle, and must fetch one
instruction each cycle to sustain execution. A cache miss causes VIPER to
stall until instruction fetch can continue. Therefore, the VIPER processor
must pay every cache miss penalty, but the superscalar processor will payonly
a fraction of the cache miss penalty.
The benchmark performance of VIPER with a real instruction cache is
computed by adding the number of instruction cache stall cycles to the total
number of cycles (A/p) required to execute the benchmark without an instruc
tion cache. This sum is used to calculate the new speedup value {speedup')
of the processor for each benchmark. The number of instruction cache stall
cycles is calculated as follows:
Instruction cache stall cycles = A •R • P
where
A = Instruction cache accesses per program
R = Instruction cache miss ratio
P = Instruction cache miss penalty
The VIPER processor accesses the cache every cycle so A is equal to Vp,
and the total number of cycles required to execute the benchmark including
instruction cache stall cycles can be expressed as:
N'j, = Np + Np-R-P = Np-{l + R-P)
The new speedup value is computed as:
speedup' = — =
a; n,-{i + r-p)
where iVis the number of cycles required to execute the benchmark on a scalar
processor. This equation can be written in terms of the speedup computed
without the instruction cache by substituting speedup for The resulting
equation is:
21
Miss Ratio 3% 4% 5% 6%
Speedup with 2 cycle miss penalty 2.03 1.99 1.95 1.92
Speedup with 3 cycle miss penalty 1.97 1.92 1.87 1.82
Speedup with 4 cycle miss penalty 1.92 1.85 1.79 1.73
Table 7: Speedup of VIPER with Instruction Cache Penalty
speedup' = speedup{l + R-P)
which shows the performance of the VLIW processor degrades linearly a
function of R and P. Table 7 lists the resulting speedup for some typical cache
penalties and miss ratios.
The performance of the superscalar processor with an instruction cache
must be found through simulation because it cannot be computed from the
above equations. The two factors, A and P in the equation given above for
calculating instruction cache stallcycles cannot be determined staticallyfor the
superscalar processor. The number of instruction cache accesses per program
is not equal to the number of cycles required to execute the program as it
was for the VLIW processor. A superscalar processor might stall instruction
fetch as a result of a non-empty decoder, and these stall cycles cause the
number of instruction cache accesses to be less than the number of execution
cycles. Also, the average instruction cache miss penalty should be less than
the maximum instruction cache miss penalty for the superscalar processor. If
the superscalar processor is kept busy issuing instructions from the instruction
window during the instruction cache miss cycles, then the miss penalty is zero.
The performance of the superscalar processor wiU only be adversely affected
by the instruction cache miss when the instructions in the window are depleted
before the end of the miss penalty cycles.
The results of the cached simulations are shown in Figures 7, 8, and 9
along with the values computed for VIPER from Table 7. The values used
for the miss ratios range from 3-6% and are shown along the horizontal axis.
Figures 7-9 show the results for a cache with a two, three, and four cycle
miss penalties respectively. These graphs show that the superscalar processor
model described in Table 6 outperforms VIPER when the product of the miss
ratio and miss penalty exceeds a .12 value. However, if the instruction cache
22
- . "
can be designed with a small miss ratio and miss penalty, then the VIPER
processor will continue to perform slightly better than the superscalar model.
6 Conclusion
The simulation results show that VIPER and a superscalar model which per
forms out-of-order issue can achieve similar performance to the selected bench
marks. It has also been shown that the superscalar processor requires only a
small lookahead window to exploit the same amount of fine grain parallelism
as the VIPER processor. After taking into effect the instruction cache inter
face, the superscalar demonstrated an ability to continueinstruction issue and
execute despite instruction cache misses; whereas, the performance of VIPER
degraded linearly with an increasing miss ratio and miss penalty product. As
the miss ratio and miss penalty increase, the performance of the superscalar
model exceeds the performance of VIPER.
References
[1] A. Abnous. Architectural design and analysis ofa VLIW processor. Mas
ter's thesis. University of California, Irvine, 1991.
[2] A. Abnous, R. Potasman, N. Bagherzadeh, and A. Nicolau. A percola
tion based VLIW architecture. In Proceedings of the 1991 International
Conference on Parallel Processing^ pages 144-148, 1991.
[3] Robert Colwell, Robert Nix, John O'Donnell, Cavid Papworth, and Paul
Rodman. A VLIW architecture for a trace scheduling compiler. IEEE
Transactions on computers, 37:967-979, 1988.
[4] Pradeep Dubey and Michael Flynn. Branch strategies: Modeling and op
timization. IEEE Transactions on Computers, 40(10):1159—1167, October
1991.
[5] J. A. Fisher. Trace scheduling: A technique for global microcode com
paction. IEEE Transactions on Computers, C-30:478-490, July 1991.
[6] John Hennessey and David Patterson. Computer Architecture A Quanti
tative Approach. Morgan Kaufmann Publishers, Inc., San Mateo, 1990.
23
A V
k
[7] Mike Johnson. Superscalar Microprocessor Design. Prentice Hall, Engle-
wood Cliffs, 1991.
[8] Johnny Lee and Alan Smith. Branch prediction strategies and branch tar
get buffer design. IEEE Computer Magazine, 17(l):6-22, January 1984.
[9] A. Nicolau. Percolation scheduling: A parallel compilation technique.
Technical Report 85-678, Cornell Universisty, 1985.
[10] R. PotcLsman. Percolation-Based Compiling for Evaluation of Parallelism
and Hardware Design Trade-Offs. PhD thesis, Universisty of California,
Irvine, 1991.
[11] James Smith and Andrew Pleszkun. Implementation of precise interrupts
in pipelined processors. In Proceedings of the 12th Annual International
Symposium on Computer Architecture, pages 36-44, June 1985.
[12] Gurndar Sohi. Instruction issue logic for high-performance, interruptible,
multiple functional unit, pipelined computers. IEEE Transactions on
Computers, 39(3):349-359, March 1990.
[13] R. M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic
units. IBM Journal of research and Development, January 1967.
[14] Shlomo Weiss and James Smith. Instruction issue logic in pipelined su
percomputers. IEEE Transactions on Computers, c-33(ll):1013-1022,
November 1984.
42.2
2.1 -
Q_
to
2.0 -
1.9
0-W8
VIPER
Miss Ratio (%)
Figure 7: Speedup with Instruction Cache Miss Penalty = 2
2S
2.2
2.1
2.0
o.
(/3
1 .9
0-W8
VIPER
1.8 -
1.7
2 3 4
Miss Ratio (%)
Figure 8: Speedup with Instruction Cache Miss Penalty = 3
fit • II «• 11. • • i • (I f li
* ♦
A2.2
2.1 -
2.0 -
o.
CO
1.9
1.8
1.7
Miss Ratio (%)
0-W8
VIPER
Figure 9: Speedup with Instruction Cache Miss Penalty = 4
21
