CycleCounter: an Efficient and Accurate UltraSPARC III CPU Simulation Module by Strazdins, Peter
TR-CS-05-01
CycleCounter: an Efficient and
Accurate UltraSPARC III CPU
Simulation Module
Peter Strazdins
March 2005
Joint Computer Science Technical Report Series
Department of Computer Science
Faculty of Engineering and Information Technology
Computer Sciences Laboratory
Research School of Information Sciences and Engineering
This technical report series is published jointly by the Department of
Computer Science, Faculty of Engineering and Information Technology,
and the Computer Sciences Laboratory, Research School of Information
Sciences and Engineering, The Australian National University.
Please direct correspondence regarding this series to:
Technical Reports
Department of Computer Science
Faculty of Engineering and Information Technology
The Australian National University
Canberra ACT 0200
Australia
or send email to:
Technical.Reports@cs.anu.edu.au
A list of technical reports, including some abstracts and copies of some full
reports may be found at:
http://cs.anu.edu.au/techreports/
Recent reports in this series:
TR-CS-04-04 C. W. Johnson and Ian Barnes. Redesigning the intermediate
course in software design. November 2004.
TR-CS-04-03 Alonso Marquez. Efficient implementation of design patterns
in Java programs. February 2004.
TR-CS-04-02 Bill Clarke. Solemn: Solaris emulation mode for Sparc Sulima.
February 2004.
TR-CS-04-01 Peter Strazdins and John Uhlmann. Local scheduling
out-performs gang scheduling on a Beowulf cluster. January
2004.
TR-CS-03-02 Stephen M Blackburn, Perry Cheng, and Kathryn S
McKinley. A garbage collection design and bakeoff in JMTk: An
efficient extensible Java memory management toolkit. September
2003.
TR-CS-03-01 Thomas A. O’Callaghan, James Popple, and Eric McCreath.
Building and testing the SHYSTER-MYCIN hybrid legal expert
system. May 2003.
CycleCounter: an Efficient and Accurate UltraSPARC III
CPU Simulation Module
Peter Strazdins
May 13, 2005
Abstract
This paper presents a novel technique for cycle-accurate simulation of the Central Processing Unit
(CPU) of a modern superscalar processor, the UltraSPARC III Cu processor. The technique is based on
adding a module to an existing fetch-decode-execute style of CPU simulator, rather than the traditional
method of fully implementing the CPU pipeline and microarchitecture. The main functions of the module
are the simulation of instruction grouping, register interlocks and the store buffer, and has a simple table-
driven implementation which permits easy modification for exploring microarchitectural variations. The
technique results on a 15–30% loss of simulation speed, instead of a 10 × or greater performance loss by
fully implementing the detailed micro-architecture. The accuracy of the technique is validated against an
actual UltraSPARC III Cu processor, and achieves high levels of accuracy in cases of itnerest.
1 Introduction
Architectural performance analysis is an increasingly important technique in modern computer systems
design [2]. Its main component is called simulation, where a model of the system is made; usually this
model can reproduce the functional and, in the case of what is termed execution-driven simulation, the
intended timing behaviour of the system. For symmetric multiprocessors (SMPs), the timing accuracy of
the simulation is particularly important because the relative timing of events on the different processors
affects program behavior as well as performance.
Thus, in order to to perform detailed simulation of a single CPU, or of an SMP system, accurate
modelling of the timing characteristics of the CPU is required. This means that the simulated CPU’s notion
of time, as represented by its clock, must accurately reflect that of the actual processor being simulated.
An example of a project requiring detailed SMP system simulation is in the CC-NUMA Project [1].
Here accurate performance evaluation of threaded computational chemistry computations is required. As
these are memory-intensive, detailed memory system simulation (including NUMA effects), down to the
explicit modelling of pipelined shared memory transactions, is performed. If the agent injecting events into
the memory system (in this case, the simulated CPU) is not accurate, the timing accuracy of the simulated
memory system would be wasted.
The traditional technique used is simulation of the full CPU pipeline. In a modern, post-RISC CPU,
this involves explicit modelling of all stages of the main execution pipeline, that of any sub-pipelines,
instruction grouping and (possibly) reordering, and branch prediction logic, together with the resolution
of any dependencies between instructions. While this offers potentially the most accurate model, it is
substantially more complex to implement and results in a 10 × or more loss of simulation speed [7, 8], as
compared with the fetch-decode-execute style of CPU simulator.
However, it is possible to accurately predict performance, in terms of the number of clock cycles re-
quired for the execution of a sequence of machine instructions, using techniques which only model as much
as the CPU’s execution pipeline as is necessary for this purpose. In particular, the resource allocation of
the functional units, the register interlocks and the store buffer need be modelled.
For this purpose, 100% accuracy is not necessary; approximations and simplifications can be tolerated,
for the sake of performance, provided they have only a small impact on accuracy for situations of interest.
1
In this paper, we will show how accurate CPU simulation can be implemented as a module which can
be added to an existing fetch-decode-execute simulator, called Sparc-Sulima [4], of the UltraSPARC III Cu
processor [11]. The implementation is table-driven, and hence can be easily changed to explore the effects
of varying instruction execution characteristics.
Related work is discussed in Section 2. Section 3 gives background on relevant aspects of the Ultra-
SPARC III Cu execution pipeline for this work. Section 4 describes the design of the cycle counting module
into the framework of Sparc-Sulima, with the calibration and performance of the module being given in
Section 5. Extensions and outstanding issues for the module are given in Section 6, with conclusions being
given in Section 7.
2 Related Work
Related work includes a number of cycle-accurate CPU simulators which simulate the full CPU pipeline,
of which the MIPS-based SimOS/MXS [8], SPARC-based RSIM [7] and SimpleScalar [3] are well known
examples. The cost of fully pipelined cycle-accurate simulation in these cases are slowdowns of the order
of a thousand in the case of SimpleScalar, and several thousand in the case of SimOS/MXS, as compared
with several hundred for fetch-decode-execution simulators such as SimOS/Mipsy [8].
An interesting approach to speed up cycle accurate CPU simulation is the technique of memoization
in conjunction with direct execution [9]. The resulting simulator, FastSim, is reported to be approximately
10 × faster than SimpleScalar [9]. While this is an impressive result, there are several drawbacks to
this approach. Firstly, it is a very complex technique that must be added to an already complex (fully
pipelined) simulator1. Secondly, for memoization to be fully effective, a host processor with a very large
memory is required. Thirdly, and most importantly, these techniques would lose accuracy if applied in an
SMP context. This is because both direct execution and memoization result in the effective skipping of
the simulation over a number of cycles. While this permits large speedups to be achieved, it means that
accurate interleavings of memory events is not possible.
While such simulators are termed cycle accurate, most studies on simulator accuracy compare simula-
tors, or equivalent techniques, with each other, and there are very few studies that actually calibrate these
simulators against real architectures. Indeed, in the case of RSIM and FastSim, the simulator executes
a SPARC-like instruction set with a MIPS-based micro-architecture and thus cannot even be calibrated
against any existing system. A study on SimOS/MXS configured for the FLASH system against the actual
FLASH system has revealed that a cycle accurate simulator may not be accurate at all unless it is calibrated
and debugged against an actual existing system [5].
Recently, a more similar approach to ours has been made with the Sam CMT Simulator kit [6], a series
of modules which can be plugged into the SimICS simulator [12] in order to simulate UltraSPARC-based
chip-multithreaded processors such as the Niagra. While simulation of chip-multithreading seems to be of
principal interest, it claims to permit accurate CPU simulation by the means of an instruction timer module,
which can vary the latency of a particular instruction type. However, no details are given, and it is unclear
if the module takes into instruction grouping and register dependency effects.
3 UltraSPARC III Cu Instruction Execution Characteristics
The UltraSPARC III Cu is a 4-way superscalar processor, in which instructions are dispatched, but do not
necessarily complete, in program order [11]. Its functional units include 2 ALU pipelines (for the execution
of simple integer instructions), a floating-point/graphics multiply (FGM) and add pipeline (FGA), a branch
prediction pipeline (BP) and a memory/special instruction (MS) pipeline. Each instruction within a group
consumes at least one of the pipelines, and no two instructions can share a pipeline.
The MS pipeline is used for complex instructions such as integer multiply and instructions with vari-
able latencies such a load instructions; in this case, the instruction is recirculated into the main execution
pipeline until the operation completes. These instructions thus have a blocking latency Lb ≥ 0 which
1As a result of this, the authors have begun work on providing automatic support for this approach [10]
2
F e t c h I n s t r ( pc ) ; / / pc = v a l u e o f t h e Program Coun te r
d i = D e c o d e I n s t r ( CI ) ; / / d i i s a decoded i n s t r u c t i o n s t r u c t u r e
i f ( c y c l e C o u n t ) / / a c y c l e c o u n t e r module i s i n s t a l l e d
c l o c k += cyc leCoun t−>I n s t r C o u n t C y c l e s ( d i . opcode , &d i . o p e r an d s , c l o c k ) ;
e l s e
c l o c k += 1 ;
i f ( i t r a c e r ) / / i n s t r u c t i o n t r a c i n g module i s i n s t a l l e d
i t r a c e r −>T r a c e I n s t r ( c lo ck , pc , CI , d i ) ;
c l o c k += (∗ e v a l I n s t r F u n c t i o n T a b l e [ d i . opcode ] ) (& d i . o p e r a n d s ) ;
Figure 1: Simplified body of Sparc-Sulima run loop, indicating the incorporation of a cycle counting
module
blocks the execution of all future instructions for Lb cycles. With the exception of load and store opera-
tions, most instructions using the MS pipeline cause the breaking of an instruction group (before or after
the instruction), or must execute in a group of their own.
ALU instructions have a latency of one cycle; FGM and and FGM instructions generally have latencies
of 3 or 4 cycles. These locking latencies only delay the dispatch of future instructions using the output
register operand of the current instruction. MS instructions can have a locking latency as well a blocking
latency.
The UltraSPARC III Cu has 32 general purpose registers (available in the current register window. It
also has 32 double precision floating point registers; the first 16 of these are also used to implement 32
single precision floating point registers.
Load instructions to floating point registers can be serviced from a special prefetch cache (P-cache)
instead of the normal top-level data cache. As the P-cache is virtually indexed and tagged, a hit requires
no address translation; thus a load floating point instruction can be serviced by one of the ALU pipelines,
rather than the MS pipeline. This permits two floating point loads to execute within the same group. The
CPU may steer such an instruction if a hit to the P-cache is predicted, i.e. the load hit the P-cache on its
previous execution. However, details such as whether the CPU will always steer such an instruction, and
what occurs on a P-cache miss, are not clear from the documentation [11].
4 Design
This section describes the design of an efficient cycle counter module for the UltraSPARC III Cu.
4.1 Incorporation in the Fetch-Decode-Execute Loop
Sparc-Sulima simulates the execution of a program using a fetch-decode-execute ‘run loop’, as indicated
in Figure 1. An instruction evaluation function table, indexed by the opcode, is used to perform the execute
stage. To improve the speed of the call DecodeInstr(CI), a cache of recently decoded instruction
structures is maintained [4]. The variable itracer is a pointer to an instruction tracing module; when
this is non-null, this module can be used to print the current instruction.
The CPU’s clock is simulated by the variable clock; this can get updated from a memory system
stall upon instruction fetch (not shown in Figure 1) or execution. The clock is also nominally updated by
L = 1 cycles upon executing each instruction. However, if a cycle counter module is installed (the pointer
cycleCount is non-null), this latency L can be varied: if the current instruction is the first in the group,
L = Lb +1, where Lb was the blocking latency of the previous group. If it is not the first, then L = 0 if no
dependencies arise through input register availability; otherwise L > 0 will be the minimum time for that
dependency to be resolved.
Thus, the value of clock when TraceInstr() (and the instruction evaluation function) is called
represents the time when the instruction is actually dispatched for execution. If the instruction is a load or
store, this corresponds to the time the memory system first ‘sees’ the instruction.
3
clock: pc: opcode: operands
297621: 0x19298: ldd [%g1],%f2
297621: 0x1929c: fmuld %f4,%f18,%f6
297621: 0x192a0: faddd %f14,%f8,%f14
297621: 0x192a4: add %g1,%o4,%g1
297622: 0x192a8: ldd [%g2+%i1],%f18
297622: 0x192ac: fmuld %f4,%f20,%f8
297622: 0x192b0: faddd %f16,%f10,%f16
297623: 0x192b4: ldd [%g2+%i3],%f20
297623: 0x192b8: fmuld %f4,%f24,%f10
297623: 0x192bc: faddd %f2,%f12,%f2
297624: 0x192c0: ldd [%g2+%i0],%f24
297624: 0x192c4: fmuld %f4,%f26,%f12
297625: 0x192c8: faddd %f4,%f6,%f4
297626: 0x192cc: ldd [%i2],%f26
...
Figure 2: Instruction trace from an optimized matrix-vector multiply program, showing the effect of the
cycle counter
Figure 2 shows an example instruction trace, where the cycle counter has been installed to update the
processor clock. Empty lines separate instruction groups. As the code is highly optimized, a load, floating
point multiply and add operation occur on almost every cycle; an exception is at pc=0x192c8, where a
dependency on register %f4 from the instruction at pc=0x1929c causes a 1 cycle delay. Note that this
delays the instruction with the dependency, and all later instructions in the group [11, section 4.4.1].
One approximation arises so far in the design: with respect to the memory system, the differences in
the value of clock between FetchInstr(pc) and the instruction evaluation function would be smaller
than on a real machine. However, for memory systems having top-level instruction and data caches with
independent pathways, this can only make a difference when misses occur to both. In section 6.5, we will
describe how this can be overcome, at the cost of introducing some complexity.
An alternative design not requiring a cycle counting module would be for each of the instruction eval-
uation functions to calculate L and incorporate this in their return values. While this has potential efficien-
cies in that much of the information required would be more readily available from these functions, this
presents software engineering difficulties as many functions would have to be modified (over 300, in the
case of Sparc-Sulima), and common data structures for the purpose of cycle counting would still have to
be maintained by each one. Furthermore, modifying instruction execution characteristics (e.g. for future
versions of UltraSPARC) would be difficult.
All latencies associated with the execution of the current instruction are accumulated into the return
value of InstrCountCycles(). The following sections describe how these latencies may be efficiently
computed..
4.2 Table-Driven Design
The following data associated with each instruction needs to be recorded for the purpose of cycle counting:
the functional units (pipelines) used by the instruction, whether the instruction causes a group break (before
and/or after its execution), its blocking latency (Lb), the latency for its destination register, and the types of
4
the register operands (two source and one destination 2).
It is natural to encode this information in tabular form. Noting the large number of instructions (op-
codes) for the case of Sparc-Sulima, and observing that many of them share the same characteristics with
respect to the above information, our design uses two tables: one mapping opcodes to opcode categories,
and another table mapping opcode categories to the above information. While this creates an extra lookup
step per instruction, this is a relatively small extra overhead compared with the total required work, and, as
there are some 40 opcode categories, this reduces the overall space requirement and so the extra overhead
may be offset by reduced cache misses.
However, the real advantage of this design is that it enables easy modification of instruction character-
istics for the purpose of microarchitecture exploration, and to an extent makes the cycle counter executable
code more microarchitecture-independent.
4.3 Instruction Grouping
The modelling of the instruction grouping by InstrCountCycles() amounts to deciding whether the
current instruction can be included in the current group.
This involves firstly determining whether it can be included according to whether all required functional
units (pipeline) resources are available. A bitmask of all functional units used so far by the current group is
maintained by the cycle counter; a simple ‘logical or’ of this mask with the current instruction’s functional
units mask entry in the table determines this. The only complication is that there are two ALU pipelines.
This is resolved by having a single ALU bit field in the current group’s bitmask; this field is only set by the
second instruction requiring an ALU pipeline, through the use of a flag that is set by the first.
Secondly, the group breaking characteristics of the current instruction are examined to determine if a
break of a group should occur before and/or after the current instruction.
After a group breakage occurs, the maximum of all blocking latencies for the group (plus 1) is returned
by the function.
For each instruction, typically only 11 load/store operations, 4 comparisons and 10 arithmetic/logic
operations are required.
The bitmask implementation implicitly assumes each instruction locks its required functional units for
at most one cycle, unless its has a blocking latency of Lb > 0, in which case it locks all functional units for
another Lb cycles. This is valid for all UltraSPARC III Cu instructions except for the floating point divide
and square root, which only lock the FGM pipeline for Lb cycles. This limitation could be overcome at
the loss of some efficiency by using timestamps for each functional unit, in a similar way as it is done for
register resources, as will be described in the next section.
4.4 Register Interlocks
The determination of any extra latency cycles due to inter-instruction dependencies from register resource
requirements is known as register interlocks [11].
Our design needs to be efficient on highly optimized floating point codes (see Figure 2). In such cases,
with up to three instructions per group that may be updating a floating point register with a maximum
latency of Lmax = 4 cycles, there could be as many as 12 floating point registers at any time which are
interlocked, i.e. unavailable due to pending operations.
Under these circumstances, maintaining a list of registers currently interlocked (together with a list of
timestamps when the interlocks expire) would be expensive, requiring traversal and compaction this list.
A preliminary solution, described in detail in Appendix A, was to use bitmasks to represent the inter-
locks; since registers can be locked for up to Lmax = 4 cycles in the future, a circular array of length Lmax
of two 64-bit bitmasks suffices This requires some care, with the structure needing to be updated with the
introduction of any latencies.
A simpler and more general solution (suitable for large Lmax) is have individual timestamps for each
register; the timestamp signifies the time when the register will first become available. These timestamps
2A special ‘no register’ type can be used to represent a missing operand.
5
are organized as an array indexed by the register file type (GPR, single and double floating point), and the
index of the register within that file3.
The cycle counter uses these timestamps as follows. Let t denote the current time (inside
InstrCountCycles(), this will be the value of the clock parameter, with any latencies from the
previous group added). If a source register of the current instruction has timestamp t′ > t, t is updated to
t′ (thus introducing a latency of t′ − t). The destination register’s time stamp is then set to t′ + Lb + L,
where Lb is the current instruction’s blocking latency and L is its locking latency.
The worst-case overhead (realized in the case of a floating point arithmetic instruction) involves ap-
proximately 10 load, 10 comparison and 2 store operations.
4.5 Store Buffer
While its documentation suggests that the UltraSPARC III Cu processor can sustain one store operation per
cycle [11] , at least under ideal conditions (e.g. for computations with working sets fitting in the top-level
caches and having unit stride memory access patterns), a series of store-intensive benchmarks with double
precision data indicated that the maximum sustained rate was one store every three cycles. Thus, to gain
accuracy for these computations, a store buffer having a draining rate of one store per ∆tS = 3 cycles was
required.
Such a buffer can be efficiently modelled as follows. The values tS , the time the buffer last emitted a
store operation, and lS , the number of pending stores in the buffer, are maintained. The maximum depth of
the store buffer is lmaxS = 8 [11]. Upon a store operation, the following is performed. If tS < t, the store
buffer can be brought up-to-date by draining nS = b t−tS∆tS c stores from the buffer (lS is decremented by
nS) and incrementing tS by nS∆tS .
If the store buffer is full (lS = lmaxS ), tS is bought up to the next emission time (tS ← tS + ∆tS), and
now a latency of δ = tS − t is introduced (δ is added to the return value of InstrCountCycles()).
Otherwise, the store is merely added to the buffer (by incrementing lS), and no other action need be taken.
4.6 Counting CPU-Specific Events
The UltraSPARC III Cu has hardware support for counting events for the purposes of performance analysis.
For the purposes of calibrating the simulator, it is useful to implement these as well. The cycle counter
module has sufficient information to monitor CPU-related events such as counts of instruction executed,
elapsed cycles, and FGM and FGA instructions executed (using the bitmasks for functional units usage in
the instruction category table). Stall cycles from GPR dependencies, floating point register dependencies
and a full store buffer can also be easily counted by the module.
5 Evaluation
In this section, we will evaluate the accuracy of the cycle counter module and its effect on simulation
performance, using optimized numerical programs as benchmarks.
5.1 Calibration
The first stage of calibration is to manually examine instruction traces such as that in Figure 2 and see if
the timestamps are consistent with the rules provided in the UltraSPARC III Cu manual [11]. This tests
whether the module is accurate with respect to the CPU, as defined by its documentation.
However, the real implementation of the CPU may behave in some cases differently to what is specified
in its documentation (usually slower); so the next stage involves comparison of benchmark performance
3Treating the single and double precision registers as being in separate files is a simplification introduced here, deemed acceptable
since codes using single and double data within the same registers over very short time intervals are rare. Exact modelling could be
achieved by having individual timestamps for the upper and lower halves of each double precision register, at the expense of extra
implementation overhead.
6
computation simulated actual
no CC CC-SB cc+SB
y ← x (8× ldd; 8× std) 309 445 298 297
y ← x (1× ldd; 1× std) 150 298 298 224
y ← 2x 125 216 148 136
y ← y + x 170 285 298 254
y ↔ x 170 219 149 118
a← Σn−1i=0 |xi| 179 179 179 179
x← 2x 210 446 298 256
x← 0 497 866 296 292
y ← Ax 475 1142 1142 1309
y ← A−1x 436 984 984 1062
Table 1: Speed in MOPs for vector-vector computations (length 4000) and in MFLOPs for vector-matrix
computations (A is 64 × 64 and row-major) on an 900 MHz UltraSPARC III Cu (CC = cycle counter, SB
= store buffer)
when run on the simulator and on the real machine. The benchmark codes were compiled without prefetch
instructions, but otherwise high levels of compiler optimizations were performed.
Table 1 gives results for vector operations; the computations were repeated a sufficient number of times
(normally 100− 10000×) for accurate timings with a high resolution timer, after both the instruction and
data caches were warmed. The vector operation results are expressed in terms of MOPs, which is millions
of vector elements processed per second. Note that the working sets fit within the top-level caches in all
cases. The results show that for store-intensive computations, modelling the store buffer substantially im-
proved accuracy. With the store buffer, the results are always optimistic and within 30% accuracy. Manual
inspection of the traces verified that the simulator was counting cycles according to the UltraSPARC III
Cu grouping rules and assumptions made in Section 4.5. We conclude that the real machine must be more
complex; for example, in the second row of the table, a single floating point register is used in a tight
loop; it may be that the load of the register has to be delayed an extra cycle while the previous store is
transmitting its result to the store buffer.
The table also gives results for matrix-vector based computations; the algorithm used for y ← A−1x
was an iterative solver based on matrix-vector multiply. Here, the simulator results are slightly pessimistic;
manual inspection of the traces indicates the inner matrix-vector multiply loop contained 24 multiply in-
structions, and was completed within 36 cycles (corresponding to 1200 MFLOPs); the cycles ‘lost’ were
again consistent with UltraSPARC III Cu grouping rules, assuming all loads were being steered to the MS
pipeline. On the real machine, the loop was being completed in 32 cycles; this could be explained by the
steering of some of the loads to the ALU pipelines.
5.2 Performance
Table 2 indicates the overheads of the cycle counter module. These measurements were based on execution
time (process user time) of the simulator running the whole test program, with the indicated computation
being repeated 1000 times in order to ensure other parts of the program (e.g. data initialization and result
checking) made a negligible contribution. The column “no CC: CC-RIL” indicates the slowdown of the
simulator with a cycle counter module with register interlock checking suppressed over that of the simulator
with no module. The column “no CC: CC” indicates the slowdown of the simulator with a complete cycle
counter module over that of the simulator with no module.
The column “host: no CC” indicates the slowdown of simulator program (without a cycle counter
module) over the same test program being run natively on an UltraSPARC III Cu. This gives an indication
of simulator efficiency. However, it is profoundly influenced by the speed (degree of instruction-level
parallelism) of the computation when run natively. For example, the simulator’s slowdown for a LINPACK
Benchmark program with n = 500, which ran at 640 MFLOPs, was 544; thus the matrix-vector multiply,
7
computation host: no CC no CC: CC-RIL no CC: CC+RIL
y ← x 680 1.11 1.19
y = y + x 993 1.10 1.20
a← Σn−1i=0 |xi| 408 1.16 1.28
y ← Ax 1070 1.15 1.28
Table 2: Slowdowns for selected computations of Table 1 indicating overall speed of simulator and added
overheads due to the cycle counter module, for an 900 MHz UltraSPARC III Cu (CC = cycle counter, RIL
= register interlocks)
running at twice as many MFLOPs, has approximately twice as large a slowdown.
These results indicate that computing instruction groupings adds 10–15% overhead, and the computing
register interlocking adds a further 10–15% overhead. For the LINPACK computation mentioned above,
the overheads were similar to that of y ← Ax.
FastSim achieves slowdowns of between 150-300 on the SPEC95 Benchmarks [9]. In terms of CPI and
the frequency of memory system events, the a← Σn−1i=0 |xi| computation above would be the closest match
to these. Taking this into account, FastSim seems to be faster by a factor of at least two. However, the
results are not directly comparable since while both simulators are SPARC-based, SparcSulima simulates
a much larger ISA, which adds a significant degree of overhead to the basic simulator [4], and the FastSim
slowdowns are based on a much earlier host system (a 167 MHz UltraSPARC I). As simulation is memory-
intensive, it is advantageous in terms of slowdowns to use a host with a lower memory latency with respect
to CPU speed.
6 Extensions to the Design
This section indicates how the design of section 4 may be extended for more accurate and complete mod-
elling of the CPU’s timing characteristics.
6.1 Prefetching and Floating Point Loads
As mentioned in Section 3, floating point loads can be steered to an ALU pipeline if a hit to the P-cache is
predicted. This can affect the performance of floating point-intensive codes significantly.
This effect can be modelled by using the decoded operands structure to record whether a hit to the P-
cache occurred, for floating point load instructions. When such an instruction is passed to InstrCountCycles(),
it overrides the default pipeline (MS) with one of the ALU pipelines. It also records the address of the de-
coded operand’s structure. The simulated P-cache must then must record the whether a hit occurs when the
instruction is evaluated. Upon the next call to InstrCountCycles(), this result is passed to it, and it
then records this in the load’s decoded operands structure, from the address recorded previously.
6.2 Branch Prediction
The current design assumes perfect branch prediction, since branch misprediction is unlikely to be a sig-
nificant performance factor in the applications of interest [1]. This section outlines how it might be imple-
mented in a modular cycle counter design, using similar techniques to those in Section 6.1.
On the UltraSPARC III Cu, a branch misprediction occurs when a (conditional) branch instruction
computes a different target address to what is predicted (the target address of its previous execution, or, if
fresh to the I-cache, the initial prediction bit of the instruction). Misprediction causes a group breakage
(after the delay slot instruction) and adds an extra blocking latency (8 cycles) to the group containing the
branch.
The decoded instruction operands structure can be used to record a branch instruction’s prediction
bit. When a branch instruction is passed to InstrCountCycles(), its operands structure’s address is
recorded. Whether the branch is taken or not is then recorded in the main execute loop, and this information
8
is passed to InstrCountCycles() with the next instruction (the branch’s ‘delay slot’). At this point,
the cycle counter can deal with a misprediction (if one occurred), and stores whether the branch was taken
in the branch’s operands structure for future use.
6.3 Other UltraSPARC III Features not Currently Modelled
Currently, block loads and store instructions are not recognized (these are encoded as floating point loads
and stores with special values in their ASI (Address Space Identifier) fields. In general, memory opera-
tions with special ASI values are not treated differently. In any case, there is scant information on their
performance characteristics. It may also be better for the memory system to handle their latencies.
The latencies of floating-point compare instructions is not currently modelled.
6.4 Performance Enhancement via Caching
The caching of expensive computations, such as the decoding of instructions, has been an important tech-
nique in improving the performance of Sparc-Sulima [4]. In the case of the cycle counter, one way
of applying this idea would be to record the result (the accumulated latency) of the previous call to
InstrCountCycles() with a given instruction in a lookup table. The table would be tagged and
indexed by the instruction’s address, with the pc now needing to be passed to the function as well.
This result could be re-used on the next execution of this instruction if it were known that the conditions
over the last Lmax cycles were identical on the previous execution. A sufficient condition for this would be
whether the instruction groupings were identical over the last Lmax cycles. In turn, this can be determined
from whether the values of the pc were the same over the last Lmax cycles (if an instruction caused a
multi-cycle stall, the corresponding pc value would be repeatedly recorded; if multiple instructions were
issued in the one cycle, the first could be used).
This information would be maintained InstrCountCycles() for the current instruction, and also
stored in the lookup table. To save space and improve lookup speed, a 64-bit hash value (e.g. for the case
Lmax = 4, the bits 17:2 of the pc’s) could be used.
On a hit, the lookup table’s latency would be returned, without having to maintain the cycle counter’s
data structures for determining instruction groupings and register interlocks. The overhead should be quite
small.
The difficulty arises upon the first miss, as these data structures have to be reconstructed. In principle
this would be possible, provided the cycle counter also maintained a list of all parameters (clock values, op-
code and decoded operand structure addresses) over the last Lmax cycles, and ‘replayed’ these instructions
in order to perform the reconstruction.
However, the fact that the cycle counter must maintain state data persisting over a considerable number
of invocations makes it less amenable to the computation caching technique. Furthermore, branch and
P-cache load mis-predictions (Sections 6.2 and 6.1) would raise further complications. For these reasons,
and the fact that the cycle counter’s overheads has been acceptable (within 30%), this has not yet been
implemented.
6.5 Improving Fetch-Execute Timing Accuracy and Extensions to Out-of-order
Execution
As mentioned in Section 4, a minor source of inaccuracy in the current design is that the timestamps, as
could be seen by the memory system, between the fetch and execute stage of a load or store instruction
would appear closer than on the real machine, due to the number of pipeline stages (5 in the case of the
UltraSPARC III) between these stages.
Furthermore, while UltraSPARC-specific behaviour is mainly confined to the tables, one implicit as-
sumption is that instructions always execute in-order. This prevents the design from easily being adapted
to other post-RISC processors, such as the MIPS and Alpha.
These limitations could be overcome by a cycle counter design that ‘buffered’ a number of instructions.
Upon each invocation, it would firstly chose which instruction in its buffer is to be executed, then calculate
9
as before its latency contribution, and then emit the instruction (correspondingpc value, opcode and decode
operands) for the instruction evaluation function and the rest of the main run-loop.
This of course would add complexity, particularly if the reordering rules are complex, and considerable
overhead, particularly if a large instructions buffer is needed. However, with this extension, a modular
cycle counter design could still be used to capture CPU timing behaviour without needing to resort to full
pipeline simulation.
6.6 Integration into a SimICS-like Simulator
SimICS is a complete-machine simulator that has been used in many places to simulate UltraSPARC-based
systems (see e.g. [6]). The core SimICS simulator provides ’hooks’ for user-defined modules to be plugged
in [12] and called upon certain events, such as the execution of each instruction.
The cycle counter design presented so far is independent of the other aspects of Sparc-Sulima, except
in the decoded instruction operand structure. Its efficiency also relies heavily on having the decoding
available. Thus, a ‘wrapper’ function would be needed to be called from SimICS instead; this function
would take the un-encoded instruction as its parameter, maintain its own instruction decode cache [4], and
then call InstrCountCycles() with the result of the decodings.
7 Conclusions
The cycle counter presented above substantially improved the timing accuracy of CPU aspects of an Ultra-
SPARC simulator with only modest implementation overhead (< 30%) and complexity of implementation,
compared with the traditional approach of fully modelling the CPU pipelines. The two main features of the
implementation, determination of instruction groups and that of stall cycles due to register dependencies,
contributed approximately equally to the overhead.
Its modular design permits it to be activated only in the parts of the simulation where timing is impor-
tant. Most platform-specific information is recorded in tables, permitting the easy exploration of varying
timing-related characteristics of the micro-architecture in order to determine its effect on overall application
performance. The algorithms and methods used can also be reasonably easily adapted to other post-RISC
architectures, except that modelling out-of-order execution would require considerable extension.
Manual inspection of the timings made by the cycle counter demonstrate that it closely implements
the main documented timing characteristics of the UltraSPARC III Cu. However, in order to model store-
intensive applications (even when the working set can be kept within the top-level cache), undocumented
information on the store buffer had to be reverse engineered by experiments.
In a few cases, the cycle counter is somewhat inaccurate compared with the real machine. This is partly
due to the fact some aspects of the CPU are not currently modelled. Given the resources to do so, this
could be done, and it would be expected that each new feature would add a small amount of overhead.
However, the real limiting factor in improving simulation accuracy is the lack of complete and accurate
performance-related information available on the real, as opposed to the documented, version of the CPU.
Acknowledgments
The author thanks the other members of the Sparc-Sulima team for their support, and in particular Bill
Clarke and Andrew Over for helpful discussions on modeling CPU performance.
References
[1] Australian National University. The CC-NUMA project: Computational Chemistry on Non-Uniform
Memory-access Architectures. http://cs.anu.edu.au/CC-NUMA.
[2] Pradip Bose and Thomas M. Conbte. Performance Analysis and its Impact on Design. IEEE Com-
puter, pages 41–49, May 1998.
10
[3] D. Burger and T. Austin. The SimpleScalar Tool Set, Version 2.0. Technical Report TR-1342, Uni-
versity of Wisconsin-Madison Computer Sciences Department, 1997.
[4] Bill Clarke, Adam Czezowski, and Peter Strazdins. Implementation aspects of a SPARC V9 complete
machine simulator. In Michael Oudshoorn, editor, Computer Science 2002, volume 4 of Conferences
in Research and Practice in Information Technology, pages 23–32, Monash University, Melbourne,
January 2002. Australian Computer Society.
[5] Jeff Gibson, Robert Kunz, David Ofelt, Mark Horowitz, John Hennessy, and Mark Heinrich. FLASH
vs. (Simulated) FLASH: Closing the Simulation Loop. In Proceedings of the Ninth International
Conference on Architectural Support for Programming Languages and Operating Systems, pages
49–58. ACM Press, 2000.
[6] Daniel Nussbaum, Alexandra Fedorova, and Christopher Small. An overview of the Sam CMT sim-
ulator kit. Technical Report TR-2004-133, Sun Microsystems Research Labs, March 2004.
[7] Vijay S. Pai, Parthasarathy Ranganathan, and Sarita V. Adve. RSIM: An Execution-Driven Sim-
ulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors. In Proceedings of the
Third Workshop on Computer Architecture Education, February 1997. Also appears in IEEE TCCA
Newsletter, October 1997.
[8] Mendel Rosenblum, Edouard Bugnion, Scott Devine, and Steve Herrod. Using the SimOS Machine
Simulator to Study Complex Computer Systems. ACM TOMACS Special Issue on Computer Simula-
tion, 1997.
[9] Eric Schnarr and James R. Larus. Fast out-of-order processor simulation using memoization. In Pro-
ceedings of the 8th International Conference on Architectural Support for Programming Languages
and Operating Systems, pages 283–294. ACM Press, New York, New York, 1998.
[10] Eric C. Schnarr, Mark D. Hill, and James R. Larus. Facile: a language and compiler for high-
performance processor simulators. In PLDI ’01: Proceedings of the ACM SIGPLAN 2001 conference
on Programming language design and implementation, pages 321–331. ACM Press, 2001.
[11] Sun Microsystems, Palo Alto, California. UltraSPARC III Cu User’s Manual, May 2002.
[12] Virtutech. Simics 1.0. http://www.simics.com/.
A Appendix: Implementation of Register Interlocks using Circular
Arrays of Bitmaps
In this section we describe a preliminary design for register interlocking, involving a circular array of length
Lmax, where each entry in the array is a bitmask with a bit field for each register that may be interlocked.
If t is the current time (note that the first parameter to InstrCountCycles() is the CPU’s clock), the
entry v(t + i) records the interlocks i cycles in advance, where v(t) = tmodLmax, for 0 ≤ i < Lmax.
For the UltraSPARC, two 64-bit words (one word for general purpose registers and one word for floating
point registers) are required per entry. In order to model single precision registers being implemented in
the upper and lower halves of a double precision register, two bits are allocated to each double precision
register.
For an instruction which updates a register with a locking latency of L ≤ Lmax cycles, the correspond-
ing bit is set in the array for entries v(t), . . . , v(t + L− 1).
To check whether an input register for the current instruction is available, entries v(t), v(t+1), . . . , v(t+
δ) are scanned, where 0 ≤ δ < Lmax and t + δ gives the earliest time the bit field for that register is clear.
δ is then included in the extra latency cycles returned by the function. Note that for instructions with two
register operands, they will be both either general-purpose or both floating point; thus only one of the words
of the bitmasks need be examined, and a combined bitmask for both registers can be used.
11
InstrCountCycles() must also maintain the time tR reflected by the current state of the register
interlock array. Let t be the value of the clock parameter, with the previous group’s blocking latency
added if the current instruction is the first of a new group. If t > tR, it must be brought up-to-date by
clearing the entries from v(tR), . . . , v(t− 1) and then setting tR to t. Note that latencies from the memory
system will cause t > tR. At this point, register dependencies can now be checked.
If there is a delay δ > 0 due to register dependencies, then similarly the array (and tR) must be brought
up-to-date by δ cycles. If the current instruction has a blocking latency Lb > 0, the array (and tR) must
again be brought up-to-date by that amount. At this point, the register interlock array is ready to updated
according to the current instruction’s output register.
For a typical floating point instruction, approximately 20 load/store operations, 25 comparisons and
30 arithmetic/logic operations are required. The cost is proportional to Lmax, and considerably more
arithmetic is required if Lmax is not a power of two. This is considerably more than the timestamp method;
however it requires only 2Lmax double words of storage, whereas the timestamp method requires as many
double words (timestamps are double words) as lockable registers (48 on UltraSPARC III Cu).
12
