PCL - The Performance Counter Library: A Common Interface to Access Hardware Performance Counters on Microprocessors by Berrendorf, Rudolf & Ziegler, Heinz
FORSCHUNGSZENTRUM JÜLICH GmbH
Zentralinstitut für Angewandte Mathematik
D-52425 Jülich, Tel. (02461) 61-6402
Interner Bericht
PCL - The Performance Counter Library:
A Common Interface to Access Hardware
Performance Counters on Microprocessors
(Version 1.2)
Rudolf Berrendorf, Heinz Ziegler
FZJ-ZAM-IB-9816
Oktober 1998
(letzte Änderung: 19.08.99)
PCL - The Performance Counter Library:
A Common Interface to Access Hardware Performance
Counters on Microprocessors
(Version 1.2)
Rudolf Berrendorf, Heinz Ziegler
Central Institute for Applied Mathematics
Research Centre Juelich GmbH
D-52425 Juelich, Germany
r.berrendorf@fz-juelich.de
ii
Abstract
A performance counter is that part of a microprocessor that measures and gathers performance-relevant
events on the microprocessor. The number and type of available events differ significantly between existing
microprocessors, because there is no commonly accepted specification, and because each manufacturer has
different priorities on analyzing the performance of architectures and programs. Looking at the supported
events on the different microprocessors, it can be observed that the functionality of these events differs from
the requirements of an expert application programmer or a performance tool writer.
PCL, the Performance Counter Library, establishes a common platform for performance measurements
on a wide range of computer systems. With a common interface on all systems and a set of application-
oriented events defined, the application programmer is able to do program optimization in a portable way
and the performance tool writer is able to rely on a common interface on different systems.
PCL has functions to query the functionality, to start and to stop counters, and to read the values of
counters. Performance counter values are returned as 64 bit integers on all systems. PCL supports nested
calls to PCL functions thus allowing hierarchical performance measurements. Counting may be done either
in system or in user mode. All interface functions are callable in C, C++, Fortran, and Java.
iv
Contents
1 Introduction 1
2 Requirements of Application Programmers 2
2.1 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Status of Functional Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.4 Rates and Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 PCL – The Performance Counter Library 5
3.1 Countable Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Interface Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 PCLquery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2 PCLstart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.3 PCLread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.4 PCLstop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Programming Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Supported Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5.1 Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5.2 Example with Nested Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5.3 Example in Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Related Projects 19
5 Summary 20
6 Acknowledgments 21
A Performance Counters on Microprocessors 23
A.1 DEC Alpha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
A.1.1 DEC Alpha 21164 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
A.1.2 DEC Alpha 21264 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
A.2 MIPS R10000/R12000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
A.3 SUN ULTRASparc I/II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.4 IBM PowerPC 604e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
A.5 Intel Pentium Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A.5.1 Intel Pentium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A.5.2 Intel PentiumPro/Pentium II/Pentium III . . . . . . . . . . . . . . . . . . . . . . . 44
v
vi
Chapter 1
Introduction
This report describes performance counters on 5 microprocessor families and introduces a common inter-
face to access these counters. With performance counters, performance critical events can be counted. This
includes all aspects concerning the memory hierarchy (loads/stores, misses/hits, different cache levels, etc.),
functional units or pipelines (operation counts, stalls, issues), duration of requests, etc.
As will be shown, the number of, type of, and access to events differs significantly between the pro-
cessors and the type of supported events might be not very helpful to the application programmer or tool
builder who might have different demands of countable events.
To overcome this lack of common platform, we developed PCL, the Performance Counter Library. We
first defined a set of events useful to the application programmer and tool builder, and second, established
a set of access functions to control and access the performance counters on different platforms. PCL is
implemented on many of todays machines ranging from a PC running Linux to a SGI/CRAY T3E with
hundreds of Gigaflops and it is callable from application programs as well as from tools.
The Performance Counter Library PCL is available at
http://www.fz-juelich.de/zam/PCL/.
1
Chapter 2
Requirements of Application
Programmers
People from different areas of computer science and electrical engineering may see different events as
most useful for their optimization purposes. Most of the events described so far in the description of the
microprocessors are likely most useful to the computer architect, hardware engineer, or low-level device
driver writer.
Application programmers optimizing their programs or performance tool writers wish to get perfor-
mance relevant information related to their programs rather than counting signal switches on certain pins
of a chip module. Therefore, those parts of the microprocessor which have appropriate counterparts in a
program are most likely to be used by the application programmer to optimize programs. The memory
hierarchy in a computer system corresponds directly to program variables and the functional units execute
the operations specified in a program. Therefore, we concentrate on those aspects of a computer system.
Our impression is, that taking the union of all available events of all microprocessors is not the right way
to define an application interface for an application programmer or tool writer. Our approach is to define
a set of events relevant to the user. If microprocessor architecture or programming methodology precedes
in a different direction (we don’t see that for the near future!), the set of events might then be extended or
changed.
Although hardware counters give numbers for a processor, performance numbers should be related to a
process (representing the program). Therefore, either the executing process should be bound to a processor,
or migrating a process to another processor should be transparent to the process (related to performance
counting). Using the second approach needs support of the operating system.
We have categorized the useful events into categories as shown in the following sections.
2.1 Memory Hierarchy
Currently, most computer systems support four levels in the memory hierarchy: registers, 1st level cache,
2nd level cache, main memory. Registers are directly controlled by a compiler, so for example, the informa-
tion how many registers keep live values could be better managed by a compiler. Although main memory
statistics could be quite useful in performance analysis (e.g. bank conflicts), performance counters in mi-
croprocessors mostly see the main memory as a black box. Therefore, we concentrate on 1st and 2nd level
caches.
Accesses to caches can be distinguished by read or write accesses, instruction loads and instruction
stores (fetches from a higher level in the hierarchy), or data load/stores. An important performance aspect
is the hit and miss rate, which can be calculated from the total number of accesses and either the number
of misses or hits. Most microprocessors use (small) translation look-aside buffers (TLB) to speed up the
translation of virtual to physical addresses. As misses in the TLB are time consuming, this number (and its
relation to the number of hits or the total number of address lookups) is a relevant number for performance
optimization.
We distinguish between instruction and data caches on each level. For unified caches (i.e. instruction and
data are buffered in the same cache), it is often possible to distinguish instruction and data loads. Therefore
on those caches, PCL LxICACHE xxx and PCL LxDCACHE xxx refer to events concerning instruction and
data accesses, respectively.
The available events concerning memory hierarchy are given in table 2.1.
Due to the definition, the sum of cache reads and cache writes should be equal to cache read/writes and
the the sum of cache hits and cache misses should be equal to cache read/writes, too. Additionally, if two
2
cache
PCL LxCACHE READ number of level-x cache reads
PCL LxCACHE WRITE number of level-x cache writes
PCL LxCACHE READWRITE number of level-x cache reads or writes
PCL LxCACHE HIT number of level-x cache hits
PCL LxCACHE MISS number of level-x cache misses
data cache
PCL LxDCACHE READ number of level-x data cache reads
PCL LxDCACHE WRITE number of level-x data cache writes
PCL LxDCACHE READWRITE number of level-x data cache reads or writes
PCL LxDCACHE HIT number of level-x data cache hits
PCL LxDCACHE MISS number of level-x data cache misses
instruction cache
PCL LxICACHE READ number of level-x instruction cache reads
PCL LxICACHE WRITE number of level-x instruction cache writes
PCL LxICACHE READWRITE number of level-x instruction cache reads or writes
PCL LxICACHE HIT number of level-x instruction cache hits
PCL LxICACHE MISS number of level-x instruction cache misses
TLB
PCL TLB HIT number of hits in TLB
PCL TLB MISS number of misses in TLB
Instruction TLB
PCL ITLB HIT number of hits in instruction TLB
PCL ITLB MISS number of misses in instruction TLB
Data TLB
PCL DTLB HIT number of hits in data TLB
PCL DTLB MISS number of misses in data TLB
Table 2.1: Events concerning memory hierarchy (x=1 or 2 for 1st or 2nd level cache)
first level caches exist (instruction and data), the sum of instruction cache reads and data cache reads should
be equal to cache reads (and so on).
2.2 Instructions
Instructions correspond to operations and flow control specified in a program. There are several categories
of operations (e.g. integer, logical, floating point) which might be executed by different functional units in
the microprocessor. Another aspect (in multiprocessor systems) is atomic operations (e.g. a primitive for a
test-and-set-operations) which can be executed successful (the lock could be set) or unsuccessful (the lock
could not be acquired as it was already set). We distinguish between the instruction categories as shown in
table 2.2.
Additionally, we have included a cycle count which gives the number of cycles spent in this process or
on behalf of the process/thread (when counting in user-and-system mode). For clarification, it should be
noted that the cycle count should not be used to count the number of elapsed cycles as on multiprogramming
systems other processes might be scheduled to the same processor. To count the number of elapsed cycles,
an additional event can be used (PCL ELAPSED CYCLES).
On some systems, the number of issued instructions might be different to the number of completed
instructions due to some error conditions. We have chosen completed instructions, as they correspond more
closely to the operations the programmer specified in his program.
Getting the number of operations out of the number of instructions is difficult. For example, on some
systems a floating-point add and a floating-point multiply can be initiated by a single add-and-multiply
instruction. Therefore, 1 floating point instruction is counted but 2 floating point operations are executed.
With PCL (and most of all hardware performance counter implementations) it is not possible to count the
number of floating point operations and related number.
2.3 Status of Functional Units
Functional units might be stalled due to blocked resources, missing operands etc. Table 2.3 gives the events
defined for stalls. Measuring such an event results (different to all other events) not in the number of stalls
3
PCL CYCLES spent cycles in process/thread (and eventually in system calls)
PCL ELAPSED CYCLES elapsed cycles
PCL INTEGER INSTR number of completed integer (or logical) instructions
PCL FP INSTR number of completed floating point instructions
PCL LOAD INSTR number of completed load instructions
PCL STORE INSTR number of completed store instructions
PCL LOADSTORE INSTR number of completed load or store instructions
PCL INSTR sum of all completed instructions
PCL JUMP SUCCESS number of correctly predicted branches
PCL JUMP UNSUCCESS number of mispredicted branches
PCL JUMP sum of all branches
PCL ATOMIC SUCCESS number of successful atomic instructions
PCL ATOMIC UNSUCCESS number of unsuccessful atomic instructions
PCL ATOMIC sum of all instructions concerning atomic operations
Table 2.2: Events concerning instruction categories
PCL STALL INTEGER number of cycles the integer/logical unit is stalled
PCL STALL FP number of cycles the floating point unit is stalled
PCL STALL JUMP number of cycles the branch unit is stalled
PCL STALL LOAD number of cycles the load unit is stalled
PCL STALL STORE number of cycles the store unit is stalled (write buffer)
PCL STALL sum of all cycles a unit is stalled
Table 2.3: Events concerning functional unit stalls (numbers given in cycles)
but in the number of cycles all stalls of this event type have taken.
2.4 Rates and Ratios
Often, it is useful to get a ratio or rate rather than an absolute number. Good examples are cache miss rates
or floating point operations per second. Table 2.4 gives the events defined for such rates and ratios.
Measuring these events will mostly be done by deriving the values from other performance numbers
(see [1]). The definitions are as follows:
• PCL MFLOPS : PCL FP INSTRPCL CY CLES ×MHzrate
• PCL IPC : PCL INSTRPCL CY CLES
• PCL L1DCACHE MISSRATE : PCL L1DCACHE MISSPCL LOADSTORE INSTR
• PCL L2DCACHE MISSRATE : PCL L2DCACHE MISSPCL L1DCACHE MISS
• PCL MEM FP RATIO : PCL LOADSTORE INSTRPCL FP INSTR
PCL MFLOPS number of million floating point instructions per second
PCL IPC number of completed instructions per cycle
PCL L1DCACHE MISSRATE miss rate of L1 data cache
PCL L2DCACHE MISSRATE miss rate for L2 data cache
PCL MEM FP RATIO ratio of memory references to floating point operations
Table 2.4: Events concerning rates and rations (numbers are floating point values)
4
Chapter 3
PCL – The Performance Counter
Library
The Performance Counter Library has a programming interface to access a set of performance counters
with a defined set of countable events. In section 3.1, we specify which of the events defined in chapter 2
are available on what systems and in section 3.2 we define the programming interface.
3.1 Countable Events
In the following tables we compare the events defined in the last section in tables 2.1 to 2.3 with the available
events on the microprocessors currently supported by PCL.
The tables are given in the following scheme. The first column gives the event family, followed by the
precise event, one in each row. Each additional column contains entries for one microprocessor (family).
The entry names correspond to the event names in the description of the microprocessors (see chapter A.
Empty entries signal that such an event is not available on that microprocessor. Entries marked with a star
are indirect events as a combination of several other events directly countable by a (hardware) performance
counter. The combinations for these indirect events are discussed below. Counters used for indirect events
can not be used at the same time to measure their own events.
Table 3.1 shows events relevant to the 1st level cache (instruction, data, instruction and data), table
3.2 shows events relevant to the 2nd level cache (instruction, data, instruction and data). If there is a
unified cache for data and instructions (as it is on most systems), events defined for 2nd level instruction
cache refer to cache references done by instruction fetches, and for the data cache accordingly. Table 3.3
shows events for the translation look-aside buffers (instruction, data, instruction and data). Table 3.4 shows
events relevant to instructions and functional units. Table 3.5 shows events concerning units which are
blocked/stalled. Instead of counting the number of events, the number in this table gives the number of
cycles for the event type. Table 3.6 shows the events concerning rates and ratios. Indirect events are given
in italics.
.
1It seems, that on CRAY T3E’s the additional logic built around the L2-cache (E-registers, back-map, stream buffers) may lead to
wrong L2-cache numbers.
2read Processor Cycle Counter
3read Processor Cycle Counter
4read Tick Counter
5read Time Stamp Counter
6read Time Stamp Counter
7Floating point operations instead of instructions are counted.
8See comments on PE0PE1 30.
9Issued instructions are counted instead of completed instructions.
10Integer multiplication and division increments the counter by two
11only on Pentium MMX
5
category event event name DEC Alpha MIPS SUN IBM Intel
21164 21264 R10000 ULTRA PPC604e Pentium-MMX PPro/PII/PIII
1st level read PCL L1CACHE READ
cache write PCL L1CACHE WRITE
read or write PCL L1CACHE READWRITE
hit PCL L1CACHE HIT
miss PCL L1CACHE MISS MI1 9 +MI0 9 IB0 5 + IB1 6
1st level read PCL L1DCACHE READ SU0 5 PE0PE1 0
cache (data) write PCL L1DCACHE WRITE SU0 6 PE0PE1 1
read or write PCL L1DCACHE READWRITE AL1 14
hit PCL L1DCACHE HIT AL1 14− AL2 5
miss PCL L1DCACHE MISS AL2 5 MI1 9 SU0 11 IB1 6 PE0PE1 37 PP0PP1 1
1st level read PCL L1ICACHE READ PE0PE1 12 PP0PP1 5
cache (instruction) write PCL L1ICACHE WRITE
read or write PCL L1ICACHE READWRITE AL1 13 SU0 4
hit PCL L1ICACHE HIT AL1 13− AL2 3 SU1 4
miss PCL L1ICACHE MISS AL2 3 MI0 9 SU0 4− SU1 4 IB0 5 PE0PE1 14 PP0PP1 6
T
able3
.1
:1
stlev
el
cach
e
6
category event event name DEC Alpha MIPS SUN IBM Intel
21164 21264 R10000 ULTRA PPC604e Pentium-MMX PPro/PII/PIII
2nd level read PCL L2CACHE READ AL1 16
cache write PCL L2CACHE WRITE AL1 17
read or write PCL L2CACHE READWRITE AL1 151 SU0 8 PP0PP1 17
hit PCL L2CACHE HIT AL1 15− AL2 14 SU1 8
miss PCL L2CACHE MISS AL2 14 MI1 10 +MI0 10 SU1 9 PP0PP1 13
2nd level read PCL L2DCACHE READ PP0PP1 11
cache (data) write PCL L2DCACHE WRITE PP0PP1 12
read or write PCL L2DCACHE READWRITE PP0PP1 11 + PP0PP1 12
hit PCL L2DCACHE HIT
miss PCL L2DCACHE MISS MI1 10
2nd level read PCL L2ICACHE READ
cache (instruction) write PCL L2ICACHE WRITE
read/write PCL L2ICACHE READWRITE
hit PCL L2ICACHE HIT
miss PCL L2ICACHE MISS MI0 10
T
able3
.2
:L
ev
el
-2
-C
ach
e
7
category event event name DEC Alpha MIPS SUN IBM Intel
21164 21264 R10000 ULTRA PPC604e Pentium-MMX PPro/PII/PIII
TLB hit PCL TLB HIT
miss PCL TLB MISS MI1 7 IB1 7 + IB0 6
TLB hit PCL ITLB HIT
(instruction) miss PCL ITLB MISS AL2 4 AL264 1 5 IB1 7 PE0PE1 13 PP0PP1 7
TLB hit PCL DTLB HIT
(data) miss PCL DTLB MISS AL2 6 IB0 6 PE0PE1 2
T
able3
.3
:T
ran
sfer
-L
o
ok
-asid
e
-B
uffer
8
category event event name DEC Alpha MIPS SUN IBM Intel
21164 21264 R10000 ULTRA PPC604e Pentium-MMX PPro/PII/PIII
cycles PCL CYCLES AL0 0 AL264 0 0 MI0 0 SU0 0 IB3 1 PE0 4 PP0PP1 61
elapsed cycles PCL ELAPSED CYCLES PCC2 PCC3 TC4 TSC5 TSC6
completed integer PCL INTEGER INSTR AL1 9 IB0 14
instructions floating-point PCL FP INSTR AL1 107 MI1 5 IB0 15 PE0PE1 308 PP0 0
load PCL LOAD INSTR AL1 11 MI1 2 IB0 16
store PCL STORE INSTR AL1 12 MI1 3
load or store PCL LOADSTORE INSTR PE0PE1 36 PP0PP1 0
sum PCL INSTR AL0 19 AL264 0 1 MI0 1510 SU1 1 IB0 2 PE0PE1 20 PP0PP1 44
branch succ. predicted PCL JUMP SUCCESS MI0 6−MI1 8 PE1 411 PP0PP1 52
instructions wrong predicted PCL JUMP UNSUCCESS AL2 2 AL264 1 2 MI1 8 PE0PE1 16 − PE1 4 PP0PP1 51
sum PCL JUMP AL264 1 1 MI0 6 IB1 16 PE0PE1 16 PP0PP1 50
atomic with success PCL ATOMIC SUCCESS AL2 13 MI1 4−MI0 5 IB1 9
instructions without success PCL ATOMIC UNSUCCESS MI0 5
sum PCL ATOMIC MI1 4Table3
.4
:In
stru
ctio
n
s
and
fu
n
ctio
n
al
u
nits
9
category event event name DEC Alpha MIPS SUN IBM Intel
21164 21264 R10000 ULTRA PPC604e Pentium-MMX PPro/PII/PIII
blocked integer PCL STALL INTEGER
functional floating-point PCL STALL FP IB2 19
units branch PCL STALL JUMP IB2 12
load PCL STALL LOAD PE0PE1 24
store PCL STALL STORE PE0PE1 23
sum PCL STALL PP0PP1 58
T
able3
.5
:Blo
ck
ed
u
nits
10
11
event event name DEC Alpha MIPS SUN IBM Intel
21164 21264 R10000 ULTRA PPC604e Pentium-MMX PPro/PII/PIII
MFLOPS PCL MFLOPS AL1 10/AL2 11 ∗Mhz MI1 5/MI0 0 ∗Mhz IB0 15/IB1 1 ∗Mhz PE0PE1 30/PE0 4 ∗Mhz PP0 0/PP0PP1 61 ∗Mhz
instr./sec PCL IPC AL0 1/AL2 11 SU0 0/SU1 0 IB0 15/IB1 1 PE0PE1 20/PE0 4 PP0PP1 44/PP0PP1 61
L1 Dcache missrate PCL L1DCACHE MISSRATE AL2 5/AL1 14 PE0PE1 37/PE0PE1 36 PP0PP1 1/PP0PP1 0
L2 Dcache missrate PCL L2DCACHE MISSRATE SU1 9/SU0 11
memory-ops/FP-ops PCL MEM FP RATIO PE0PE1 36/PE0PE1 30
T
able3
.6
:R
ates
and
R
atio
s
12
3.2 Interface Functions
The interface functions to control the performance counters are given below. All functions are callable from
C, C++, Fortran, and Java. All functions return status codes with the following meaning:
PCL SUCCESS function successful finished
PCL NOT SUPPORTED requested event is not supported on this hardware
PCL TOO MANY EVENTS more events requested than performance counters are available
PCL TOO MANY NESTINGS there are more nested calls than allowed (PCL MAX NESTING LEVEL
)
PCL TOO ILL NESTING either a different number or different types of events are requested in nested
calls
PCL ILL EVENT event identifier illegal
PCL MODE NOT SUPPORTED performance counting for that mode is not supported
PCL FAILURE failure for some unspecified reason
3.2.1 PCLquery
With this function, queries are done if a certain functionality is available on this machine. The user sup-
plies in counter list an array of size ncounter of event names (of type integers). Event names are any
of those introduced in the tables 3.1 to 3.5 in the last section. In mode, the user specifies the execution
mode for which performance data should be gathered: PCL MODE USER specifies counting in user mode,
PCL MODE SYSTEM specifies counting in system mode, and PCL MODE USER SYSTEM specifies ei-
ther of both modes. The function returns PCL SUCCESS if the requested functionality is possible (i.e. if
the requested events can be counted in parallel), otherwise an error code is returned why the requested
events are not supported on this system. No resources are allocated on this call.
int PCLquery(
int *counter list, /* I: requested event counters */
int ncounter, /* I: number of counters */
unsigned int mode /* I: mode flags (PCL MODE xxx) */
);
3.2.2 PCLstart
With PCLstart, performance counting is started (if it is possible). The user supplies in counter list an array
of size ncounter of event names. Event names are any of those introduced in the tables 3.1 to 3.5 in the
last section. mode has the same meaning as in the description of PCLquery. If the requested functionality
is available, the appropriate performance counters are cleared and started. On success, PCL SUCCESS is
returned, otherwise an error code is returned.
int PCLstart(
int *counter list, /* I: events to be counted */
int ncounter, /* I: number of counters */
unsigned int mode /* I: mode flags (PCL MODE xxx) */
);
3.2.3 PCLread
Reads out performance counters and returns counter values. Each of the the result values is either written
into the (user supplied) integer-typed buffer i results list or into the (user supplied) floating point typed
buffer fp results list both of size ncounter. PCL CNT TYPE is a 64-bit integer type, PCL FP CNT TYPE
is a 64-bit floating point type. Which of the buffers is used for the i-th result depends on the requested i-th
event type. If the i-th event type is less than PCL MFLOPS, the result is an integer value which is stored in
i results list[i]. If the i-th event type is greater than or equal to PCL MFLOPS (i.e. belongs to the category
rates and ratios), the result is a floating point value stored in fp results list[i]. If the i-th result is stored in
i results list[i], the content of fp results list[i] is undefined, and the same holds for the other way.
13
Processor OS software used counters saved on
context switches
Alpha 21164 Digital Unix 4.0x yes 14
Alpha 21264 Digital Unix 4.0e yes 15
Alpha 21164 CRAY Unicos/mk not necessary 16
R10000 SGI IRIX 6.x yes
UltraSPARC I/II Solaris 2.x perfmon no
PowerPC 604e AIX 4.1, 4.2 PMapi yes
Pentium/PPro/Pentium II/Pentium III Linux 2.x msr no
Table 3.7: Supported systems
The arguments supplied with the call to PCLread must correspond to the latest call to PCLstart, i.e. the
number of requested performance counters must be equal. If no error occurs, PCL SUCCESS is returned,
otherwise an error code. The performance counters are (logically) not stopped.
int PCLread(
PCL CNT TYPE * i result list, /* O: int counter values */
PCL FP CNT TYPE * fp result list, /* O: fp counter values */
int ncounter /* I: number of events */
);
3.2.4 PCLstop
Stops performance counting and returns counter values. Result values are written into the (user supplied)
buffers i result list or fp result list both of size ncounter. See PCLread for a description how the results
are stored in the two arrays. The arguments supplied with the call to PCLstop must correspond to the latest
call to PCLstart, i.e. the number of requested performance counters must be equal. If no error occurs,
PCL SUCCESS is returned, otherwise an error code.
int PCLstop(
PCL CNT TYPE * i result list, /* O: int counter values */
PCL FP CNT TYPE * fp result list, /* O: fp counter values */
int ncounter /* I: number of events */
);
3.3 Programming Aspects
The allowed calling sequence is one call to PCLstart followed by zero or more calls to PCLread followed
by one call to PCLstop. Between a call to PCLstart and PCLstop (and possible calls to PCLread) may be
nested calls to other allowed calling sequences with the same number of events and the same event types.
On system with virtual (low level) performance counters, migrating a process to another processor
is possible (SGI, AIX). On the other systems, we bind the executing process to a processor (DEC, SO-
LARIS)12, or the process can not migrate (CRAY). On Solaris systems, if the process is not bound to a
specific processor, the process gets bound to the processor 0 when executing the PCLstart function. On
DEC systems, the process gets bound to the processor the process is currently running on.
Currently, performance counters are not saved on context switches on Solaris and Linux systems by our
library and therefore performance measurements should be done only on a lightly loaded system.
Currently, we do not check if any other process uses the performance counters as well13. Therefore, on
certain systems if two distinct processes use performance counters in parallel, they may disturb each other.
To avoid overflow e.g. on systems with 32-bit hardware counters, an interval timer is called on these
systems (Solaris, AIX, Linux) which interrupts the process every second. Programs which use the setitimer
system call (or the SIGALRM signal), may be in conflict with PCL.
3.4 Supported Systems
Currently, the Performance Counter Library is available on the systems listed in table 3.7.
12On Linux systems, currently it is not possible to bind a process to a processor.
13This may be a program using the performance counters directly, or through a different application interface.
14
3.5 Examples
3.5.1 Simple Example
Figure 3.1 shows a simple example program how to use the Performance Counter Library. First, the list of
requested events (PCL LOAD INSTR for load instructions, and PCL L1DCACHE MISS for 1st level data
cache misses) is put into the array counter list. With the call to PCLquery we test, if it is possible to serve
these two requested events simultaneously on the computer system where the program is executed. If this
is possible, event counting is started with the call to PCLstart. After that follows the code to be measured
and a call to PCLstop to stop performance counting and to read out the performance counter values. Then,
the results are printed.
3.5.2 Example with Nested Calls
Figure 3.2 shows an example how to use nested calls. In this example, for the outer loop as well as for each
iteration the number of cycles spent in this code section is measured.
3.5.3 Example in Java
Figure 3.3 shows an example how to use PCL in Java.
14Only one process can open the pfm-device, but spawned children have access to this device as well.
15Only one process can open the pfm-device, but spawned children have access to this device as well.
16There is no multiprogramming on application nodes.
15
#include <pcl.h>
void do_work()
int main(int argc, char **argv)
{
int counter list[2];
int ncounter, res;
unsigned int mode;
PCL CNT TYPE i result list[2];
PCL FP CNT TYPE fp result list[2];
/* Define what we want to measure. */
ncounter = 2;
counter_list[0] = PCL_CYCLES;
counter_list[1] = PCL_INSTR;
/* define count mode */
mode = PCL_MODE_USER;
/* Check if this is possible on the machine. */
if( PCLquery(counter list, ncounter, mode) != PCL_SUCCESS)
printf("requested events not possible");
/* Start performance counting.
We have checked already the requested functionality
with PCL_query, so no error check would be necessary. */
res = PCLstart(counter list, ncounter, mode);
if(res != PCL_SUCCESS)
printf("something went wrong");
/* Here comes the work to be measured. */
do_work();
/* Stop performance counting and get the counter values. */
if( PCLstop(i result list, fp result list, ncounter) != PCL_SUCCESS)
printf("problems with stopping counters");
/* print out results */
printf("%f instructions in %f cycles",
(double)i_result_list[1], (double)i_result_list[0]);
}
Figure 3.1: Example program on how to use PCL
16
#include <pcl.h>
#define NITER 4
int main(int argc, char **argv)
{
int counter_list[1];
int ncounter, res, iter;
unsigned int mode;
PCL_CNT_TYPE i_all_result_list, i_result_list[NITER];
PCL_FP_CNT_TYPE fp_all_result_list, fp_result_list[NITER];
/* Define what we want to measure. */
ncounter = 1;
counter_list[0] = PCL_CYCLES;
/* define count mode */
mode = PCL_COUNT_USER;
/* Start performance counting. */
res = PCLstart(counter list, ncounter, mode);
for(iter = 0; iter < NITER; ++iter)
/* Start performance counting. */
res = PCLstart(counter list, ncounter, mode);
/* Here comes the work to be measured. */
do_work();
/* Stop performance counting and get counter values. */
res = PCLstop(&i result list[iter], &fp result list[iter], ncounter);
/* Stop performance counting and get the counter values. */
res = PCLstop(&i all result list, &fp all result list, ncounter);
/* print out results */
printf("used cycles: %f %f %f %f, total: %f",
(double)i_result_list[0], (double)i_result_list[1],
(double)i_result_list[2], (double)i_result_list[3],
(double)i_all_result_list);
}
Figure 3.2: Example program on how to use nested calls to PCL
17
// import PCL class description
import PCL;
public class pcl_jtest
{static final int N = 200; // matrix dimension
static double[][] a = new double[N][N];
static double[][] b = new double[N][N];
static double[][] c = new double[N][N];
// test method
static void matadd(double[][] a, double[][] b, double[][] c)
{int i, j;
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
a[i][j] = b[i][j] + c[i][j];
}
// main program
public static void main(String[] args)
{int event;
PCL pcl = new PCL(); // instantiate PCL
int mode = pcl.PCL_MODE_USER_SYSTEM; // count mode
int[] events = new int[1]; // events; array required
long[] i_result = new long[1]; // int results; array required
double[] fp_result = new double[1]; // fp results
// test supported events
for(event = 0; event < pcl.PCL_MAX_EVENT; ++event)
{events[0] = event;
if( pcl.PCLquery(events, 1, mode) == pcl.PCL_SUCCESS)
{// start counting
if( pcl.PCLstart(events,1,mode) != pcl.PCL_SUCCESS)
System.out.println("problem with starting event");
// test program
matadd(a,b,c);
// stop counting
if( pcl.PCLstop(i result,fp result,1) != pcl.PCL_SUCCESS)
System.out.println("problem with stopping event");
// print result for event i
if(event < pcl.PCL_MFLOPS)
// integer result
System.out.println(pcl.PCLeventname(i)+":"+i_result[0]);
else
// floating point result
System.out.println(pcl.PCLeventname(i)+":"+fp_result[0]);
}
}
}
}
Figure 3.3: Example program in Java.
18
Chapter 4
Related Projects
In the Parallel Tools Consortium there is a subproject defined called PerfAPI. Its main aspect is to define
an API to access all system specific hardware performance counters, i.e. to start/read out/stop all hardware
performance counters on a microprocessor with all events available on that system. This is a different
approach than ours as we focus on a single framework on all systems, i.e. a uniform application interface as
well as a well-defined set of events accessible with uniform names on all systems. For the PerfAPI project,
have a look at http://www.cs.utk.edu/ mucci/pdsa/.
There are a lot of interfaces to access performance counters on one specific system, e.g. libperfex on SGI
systems with the R10000-processor or the pfm-device on Digital Unix systems (21064 or 21164 processors).
To establish a common platform for performance counting on all POWER and PowerPC microprocessors,
IBM has defined an application interface called PMapi. Their approach is as well, to define the set of
possible events as the union of all possible events on all POWER and PowerPC microprocessors. On Linux
systems, libpperf supports all Pentium, PentiumPro, and Pentium II processors through a common interface.
19
Chapter 5
Summary
PCL – the Performance Counter Library – is a common interface for portable performance counting on
modern microprocessors. It is intended to be used by the expert application programmer who wishes to do
detailed analysis on program performance, and it is intended to be used by tool writers who need a common
platform to base their work on.
The application interface supports query for functionality, start and stop of performance counting and
reading out the values of the performance counters. Nested calls to the functions are possible (with the
same events) therefore allowing to do hierarchical performance measurements on sections and subsections
of a program. Further, performance counting in user mode, system, and user-or-system mode can be distin-
guished. Language bindings are available for C, C++, Fortran, and Java.
PCL is available at http://www.fz-juelich.de/zam/PCL/.
20
Chapter 6
Acknowledgments
We would like to thank the people who have written the software we based our work on. Namely, Richard
Enbody for perfmon on UltraSPARC-systems, and M. Patrick Goda and Michael S. Warren for libpperf
which itself is based on the msr device implemented by Stephan Meyer on Linux version 2.0.x, 2.1.x, and
2.2.x.
21
Bibliography
[1] Kirk W. Cameron and Yong Luo. Performance evaluation using hardware performance counters.
http://www.c3.lanl.gov/ kirk/isca99/.
[2] Digital Equipment Corporation, Maynard, Massachusetts. man 7 pfm.
[3] Digital Equipment Corporation, Maynard, Massachusetts. Alpha AXP Architecture Handbook, version
2 edition, 1994.
[4] Silicon Graphics Inc. man libperfex.
[5] MIPS Technologies Inc., Mountain View, California. Definition of MIPS R12000 Performance-
counter.
[6] Marco Zagha and et.al. Performance Analysis using the MIPS R10000 Performance Counters. In
Supercomputing 96. IEEE Computer Society, 1996.
[7] Sun Microsystems, Palo Alto, California. UtraSPARC User’s Manual, 1997.
[8] SPARC International, Inc. The SPARC Architecture Manual, Version 9, 1997.
[9] Motorala Inc., IBM. The PowerPC Family : The Bus Interface for 32-Bit Microprocessors, 3 1997.
[10] James E. Smith Shlomo Weiss. POWER and PowerPC. Morgan Kaufmann Publishers, Inc., 1994.
[11] Motorola Inc., IBM. PowerPC 604e RISC Microprocessor User’s Manual, 3 1998.
[12] http://developer.intel.com/drg/mmx/AppNotes/perfmon.htm.
[13] Intel Corp. Pentium Pro Family Developers Manual 1-3, 1997.
22
Appendix A
Performance Counters on
Microprocessors
This chapter introduces performance counting aspects of commonly used microprocessors. Each section
introduces a microprocessor family and is divided into three subsections: base information on the micro-
processor, performance counter events sorted by each performance counter, and in the third subsection
additional comments and references to existing implementations to access the performance counters on that
specific microprocessor. The second part of each section, the description of the performance counters, is
given for each event as follows. The first line contains an internal identifier (2 letters corresponding to the
name of the microprocessor), the number of the performance counter, and after a underline another number
giving the event number. We will refer to the whole name as a unique identifier in subsequent chapters.
The next line contains a manufacturer-specific name or definition (in italics) of the event as found in the
manufacturer’s literature. After that, a description of the event follows.
A.1 DEC Alpha
To use performance counters on DEC Alpha microprocessors, additional software support is necessary as
the low-level interface is given in PAL-Code. Tru64 (formely Digital Unix) has the pseudo device pfm [2]
which has a high-level interface based on ioctl-calls to access the performance counters. The pfm-device on
systems distinguishes between user and system mode event counting. Only one process per CPU can open
the device, but child processes can be spawned which influence the performance counters as well.
On the CRAY T3E, which uses the 21164 microprocessor too, there is no software interface published
to access the performance counters.
A.1.1 DEC Alpha 21164
The RISC-processor DEC Alpha 21164 has 3 performance counters. First, let’s have a closer look at
the architecture of the microprocessor. The first level of caches contain an instruction (ICACHE) and a
data cache (DCACHE), each having a size of 8 KB. The second level cache (SCACHE) has a size of 96
KB buffering instructions and data. An additional option is an external third level cache (BCACHE). The
memory hierarchy is given in figure A.1. A detailed description of the Alpha architecture can be found in
[3].
The 21164 contains pipelines of the following types:
• 7-stage integer pipelines
• 9-stage floating point pipelines
• 13-stage memory reference pipeline
The performance counter part on the DEC Alpha 21164 contains 3 counters with distinct purposes.
Roughly speaking, counter 0 counts machine cycles or issued instructions, counter 1 counts successful
operations, and counter 2 counts unsuccessful operations. For the counters, 2, 24, and 23 different events
are defined, respectively, and the counters can operate in parallel. There is one restriction that when counting
certain events on counter 2, counter 1 gathers special events.
Events countable on the DEC Alpha 21164 are:
• Counter 0:
23
Register (32 * 64 Bit)
I-Cache (8KByte) D-Cache (8KByte)
Level-2-Cache (96KByte)
Level-3-Cache (0...64MByte)
Remark : optional
Main Memory
CPU DEC Alpha 21164
Level-1-Cache
Figure A.1: Principal memory architecture of the DEC Alpha 21164
– AL0 0
CYCLES
machine cycles
– AL0 1
ISSUES
issued instructions
• Counter 1:
– AL1 0
NON ISSUE CYCLES
Either no instructions have been issued to the pipeline in the number of cycles, or the pipeline
has been stalled for that number of cycles.
– AL1 1
SPLIT ISSUE CYCLES
Not all startable instructions have been included into the instruction pipeline.
– AL1 2
PIPELINE DRY
A parallel execution of instructions was not possible.
– AL1 3
REPLAY TRAP
If a started instruction could not be further processed, the instruction is issued again in the
instruction pipeline, which is called a replay trap.
– AL1 4
SINGLE ISSUE CYCLES
Exactly 1 instruction was issued in a cycle.
– AL1 5
DUAL ISSUE CYCLES
Exactly 2 instructions were issued in a cycle.
– AL1 6
TRIPLE ISSUE CYCLES
Exactly 3 instructions were issued in a cycle.
24
– AL1 7
QUAD ISSUE CYCLES
Exactly 4 instructions were issued in a cycle.
– AL1 8
FLOW CHANGE
A jump instruction was executed. Conditional and unconditional jumps are distinguished.
Remark:
∗ If counter 3 counts branch-mispredictions, then branches are counted.
∗ If counter 3 counts pc-mispredictions, then jsr (subroutine calls, returns) are counted.
– AL1 9
INTEGER OPERATE
Executed operations in the integer pipelines.
– AL1 10
FP INSTRUCTIONS
Executed operations in the floating point pipelines.
– AL1 11
LOAD INSTRUCTIONS
Executed load instructions.
– AL1 12
STORE INSTRUCTIONS
Executes store instructions.
– AL1 13
ICACHE ACCESS
Accesses to the 1st level instruction cache (ICACHE).
– AL1 14
DCACHE ACCESS
Accesses to the 1st level data cache (DCACHE).
– AL1 15-AL1 21
”CBOX1”
Accesses to 2nd or 3rd level cache. There need to be defined additional options [3]:
∗ AL1 15
SCACHE ACCESS
Accesses to 2nd level cache (SCACHE).
∗ AL1 16
SCACHE READ
Read accesses to 2nd level cache (SCACHE).
∗ AL1 17
SCACHE WRITE
Write accesses to 2nd level cache (SCACHE).
∗ AL1 18
SCACHE VICTIM
Number of non-completed memory frees in 2nd level cache (SCACHE).
∗ AL1 19
BCACHE HIT
Hits in 3rd level cache (BCACHE).
∗ AL1 20
BCACHE VICTIM
Number of non-completed memory frees in 3rd level cache (SCACHE).
∗ AL1 21
SYS REQ
Requests of additional hardware (multiprocessor system).
• Counter 2:
– AL2 0
LONG STALLS
Number of events that instruction pipeline was blocked for more than 12 cycles.
25
– AL2 1
PC MISPR
Program counter mispredictions.
– AL2 2
BRANCH MISPREDICTS
Branch mispredictions.
– AL2 3
ICACHE MISSES
Misses in the 1st level instruction cache (ICACHE).
– AL2 4
ITB MISSES
Misses in instruction TLB.
– AL2 5
DCACHE MISSES
Misses in 1nd level data cache (DCACHE).
– AL2 6
DTB MISS
Misses in data TLB.
– AL2 7
LOADS MERGED
An entry in the Miss-Address-File corresponds to a memory request.
– AL2 8
LDU REPLAYS
A replay trap was triggered by a missed load operation.
– AL2 9
WB MAF FULL REPLAYS
A replay trap was triggered by a missed write-back operation or by an inconsistency in the
miss-address-file.
– AL2 10
EXTERNAL
A signal change at the pin ”perf mon h” occurred.
– AL2 11
CYCLES
Number of cycles.
– AL2 12
MEM BARRIER
Executed memory barrier instructions.
– AL2 13
LOAD LOCKED
A locked load instruction was executed.
– AL2 14-AL2 21
”CBOX2”
Accesses to 2nd or 3rd level cache. There need to be defined additional options [3]:
∗ AL2 14
SCACHE MISS
Misses on 2nd level cache.
∗ AL2 15
SCACHE READ MISS
Read misses on 2nd level cache.
∗ AL2 16
SCACHE WRITE MISS
Write misses on 2nd level cache.
∗ AL2 17
SCACHE SH WRITE
Number of write-operations which go to caches other than the processor-specific 2nd level
cache.
26
∗ AL2 18
SCACHE WRITE
Write accesses to 2nd level cache.
∗ AL2 19
BCACHE MISS
Misses in 3rd level cache.
∗ AL2 20
SYS INV
Requests of additional hardware to invalidate a cache line (multiprocessor).
∗ AL2 21
SYS READ REQ
Requests of additional hardware to read-copy a cache line (multiprocessor).
A.1.2 DEC Alpha 21264
The DEC Alpha 21264 is a four-way out-of-order-issue microprocessor that performs dynamic scheduling,
register renaming, and speculative execution. There are 4 integer execution units and 2 floating-point exe-
cution units. The processor includes a 64 KB 1st level instruction cache and a 64 KB 1st level data cache.
The 21264 has 2 performance counters of 20 bit width each. Counters 0 is capable of counting one of 2
different events, and counter 1 is capable of counting one of 7 different events. Therefore, the ability to do
a detailled performance analysis on the 21264 is significantly reduced compared to the 21164.
Events countable on the DEC Alpha 21264 are:
• Counter 0:
– AL264 0 0
machine cycles
– AL264 0 1
retired instructions
• Counter 1:
– AL264 1 0
machine cycles
– AL264 1 1
retired conditional branches
– AL264 1 2
retired branch mispredicts
– AL264 1 3
retired DTB single misses * 2
– AL264 1 4
retired DTB double double misses
– AL264 1 5
retired ITB misses
– AL264 1 6
retired unaligned traps
– AL264 1 7
replay traps
A.2 MIPS R10000/R12000
The microprocessors R10000 and R12000 of MIPS are 64 Bit RISC-microprocessors with integrated perfor-
mance counters. The differences of the two processors concerning performance counting will be discussed
at the end of this section. The R10000 processor has 64 physical registers and 32 logical registers. The
1st level cache is split between a data cache and an instruction cache, both of size 32 KB. The 2nd level
cache can be between 512 KB and 16 MB and the cache is a unified buffer at it caches data as well as
instructions. The main memory can be up to 1 TB. Figure A.2 shows a picture of the memory architecture
of the processor.
27
Register (64* 64 Bit)
I-Cache (32KByte) D-Cache (32KByte)
Level-2-Cache (512K-16MByte)
CPU MIPS R10000
Level-1-Cache
Main Memory
Figure A.2: Memory hierarchy of the MIPS R10000
The R10000 microprocessor has 2 performance counters (a description can be found at
http://www.sgi.com/processors/r10k/performance.html) each capable of counting one of 16 different events.
The R10000 has 5 execution pipelines executing decoded instructions. There are 2 integer pipelines (ALU1,
ALU2), 2 floating point pipelines (FPU1, FPU2), and 1 address pipeline (LOAD/STORE). The integer and
floating point pipelines can operate in parallel. For a better understanding we define the two following
terms:
• issued: An instruction was decoded and supplied to the executing unit.
• graduated: An execution of an instruction has finished and all instruction issued before the instruction
have finished, too.
Another term to be defined is SCTP-Logic which is the Secondary Cache Transaction Processing Logic,
which has the task to store up to 4 internally generated or 1 externally generated 2nd level cache transactions.
• Counter 0:
– MI0 0
Cycles
Machine cycles.
– MI0 1
Instructions issued
The counter is incremented with the sum of the following events:
∗ integer operations completed at this cycle. There can be 0-2 operations each cycle.
∗ floating-point-operations completed at this cycle. There can be 0-2 operations each cycle.
∗ load/store operations which have been delivered in the last cycle to the address pipeline.
There can be 0 or 1 each cycle.
– MI0 2
Load/prefetch/sync/CacheOp issued
Each of these instructions is counted when started.
– MI0 3
Stores(including store-conditional) issued
Each time a store operations is delivered to the address calculation unit, the counter is incre-
mented.
– MI0 4
Store conditional issued
Each time a conditional store operations is delivered to the address calculation unit, the counter
is incremented.
28
– MI0 5
Failed store conditional
The counter is incremented each time a conditional store failed.
– MI0 6
Conditional Branch resolved
Count all resolved conditional branches.
– MI0 7
Quadwords written back from secondary cache
Counter is incremented each time a quad-word is written from the 2nd level cache to the output
buffer.
– MI0 8
Correctable ECC errors on secondary cache data
A correctable 1-bit ECC error occurred while reading a quadword from the 2nd level cache.
– MI0 9
Instruction cache misses
Misses in the instruction cache.
– MI0 10
Secondary cache misses (instruction)
Instruction misses in the 2nd level cache.
– MI0 11
Secondary cache way mispredicted (instruction)
An attempt was made to load an instruction from the 2nd level cache and the entry is marked as
invalid.
– MI0 12
External intervention requests
Number of requests to the SCTP-Logic from outside of the processor (I/O devices, multiproces-
sor etc.) for a copy of a cache line marked as shared.
– MI0 13
External invalidate requests
Number of requests to the SCTP-Logic from outside of the processor (I/O devices, multiproces-
sor etc.) for invalidation of a cache line marked.
– MI0 14
Functional unit completion cycles
The counter is incremented if at least one of the functional units has completed an operations in
this cycle.
– MI0 15
Instruction graduated
The counter is incremented with the number of instructions which have been completed in the
last cycle. An integer multiplication or division increments the counter by 2.
• Counter 1:
– MI1 0
Cycles
Machine cycles.
– MI1 1
Instructions graduated
The counter is incremented by the number of instructions which have been completed in the last
cycle. An integer multiplication and division increments by 2.
– MI1 2
Load/prefetch/sync/CacheOp graduated
Every completed instruction of this type is counted.
– MI1 3
Stores (including store-conditionals) graduated
Every completed store operation is counted.
29
– MI1 4
Store conditionals graduated
Every conditional store is counted independently of success. This is possible at most once a
cycle.
– MI1 5
Floating-point instructions graduated
Floating point instructions completed in the last cycle (0-4 each cycle).
– MI1 6
Quadwords written back from primary cache
The counter is incremented by 1, if in a cycle at least one quadword is written back from the 1st
level cache to the 2nd level cache.
– MI1 7
TLB refill exceptions
TLB misses are counted in the cycle after they occur.
– MI1 8
Branches mispredicted
The counter is incremented on every mispredicted branch.
– MI1 9
Primary data cache misses
Miss in the primary data cache.
– MI1 10
Secondary cache misses (data)
Miss in the secondary cache caused by a data access.
– MI1 11
Secondary cache way mispredicted (data)
The counter is incremented if the 2nd level cache controller tries to access the 2nd level cache
after a previous access failed.
– MI1 12
External intervention request is determined to have hit in secondary cache
The processor got an external request for a copy of a 2nd level cache block.
– MI1 13
External invalidate request is determined to have hit in secondary cache
The processor got an external request to invalidate a 2nd level cache block.
– MI1 14
Stores/prefetches with store hint to CleanExklusive secondary cache blocks
The SCTP-logic got a request for status change of a cache line from CleanExclusive to DirtyExk-
lusive.
– MI1 15
Stores/prefetches with store hint to Shared secondary cache blocks
The status of a cache line was changed from Shared to DirtyExklusive.
Software support for the performance counters on R10000 processors is available either on a lower
level in IRIX 6.x through the /proc file system or on a higher level through the perfex library [4]. The
kernel maintains data structures for 32 virtual performance counters with a size of 64 bits each. It is
possible to distinguish between counting in user mode, system mode, or both. When running in user mode,
performance counters are saved on context switches. For the perfex library, the routine start counters zeroes
out the internal counters, and read counters stops the counters after reading them.
Different to the R10000, the R12000 has 4 counters each capable of counting one of 32 events. For
counter 1, a trigger mechanism was included such that an event is counted by counter 1 if any of the
other counters reached a certain value. Additionally, conditional counting is possible. For example, it is
possible to count the number of cycles in which 4 instructions have been completed. Also, some semantic
inaccuracies concerning the definition of events have been clarified [5]. An introduction to measurement
and interpretation of events can be found in [6].
A.3 SUN ULTRASparc I/II
The UltraSPARC I/II 64-bit microprocessors of SUN have the possibility to count performance relevant
events. A detailed description of the SPARC V9 architecture can be found in [7]. Both variants have
30
Register 8*24*64 Bit
I-Cache (16KByte) D-Cache (16KByte)
Level-2-Cache (512K-16MByte)
CPU SUN ULTRASparc II
Level-1-Cache
Main Memory
Figure A.3: Memory hierarchy of the SUN ULTRASparc II
8 times 24 64-bit registers which are organized in so-called windows to optimize argument passing on
subroutine calls without time-consuming copying of registers to memory. The 1st level cache has a 16 KB
data (D-cache) and a 16 KB instruction cache (I-Cache). The 2nd level cache (E-cache) has a size of 512
KB up to 4 MB on UltraSPARC I, and 512 KB up to 16 MB on UltraSPARC II. The main memory can be
as large as 2 TB (see figure A.3).
Another important component of the supporting logic is the UPA, the Universal Port Architecture, which
connects several processors over a high-speed crossbar-switch.
The microprocessor contains two performance counters (PIC0, PIC1), which are able to count different
events. Each counter can count one of 12 different events, two events can be counted on both counters,
which sums up to a total of 22 different events [8]. Additionally, there exists a elapsed cycle counter.
• Counter PIC0:
– SU0 0
Cycle cnt
Machine cycles.
– SU0 1
Instr cnt
Instructions graduated.
– SU0 2
Dispatch0 IC miss
Number of cycles waiting after a miss in the 1st level instruction cache (including handling of a
follow-on E-cache miss).
– SU0 3
Dispatch0 storeBuf
Number of cycles a write buffer could not store new values (next instruction is a store instruc-
tion).
– SU0 4
IC ref
1st level instruction cache references.
– SU0 5
DC rd
1st level data cache read references.
– SU0 6
DC wr
1st level data cache write references.
– SU0 7
Load use
Number of cycles instructions are waiting on a previous load operation.
31
– SU0 8
EC ref
Number of 2nd level cache references.
– SU0 9
EC write hit RDO
Number of hits on 2nd level cache read accesses in a read for ownership-UPA-transaction.
– SU0 10
EC snoop inv
Number of cache line invalidations due to a UPA-transactions.
– SU0 11
EC rd hit
Number of E-cache read hits caused by 1st level data cache miss.
• Counter PIC1 counts:
– SU1 0
Cycle cnt
Machine cycles.
– SU1 1
Instr cnt
Instructions graduated.
– SU1 2
Dispatch0 mispred
Number of cycles waiting with an empty instruction buffer after a wrong branch prediction.
– SU1 3
Dispatch0 FP use
Number of cycles which waits the first instruction in a group because the result of a previous
floating-point operation is not available.
– SU1 4
IC hit
Number of 1st level instruction cache hits.
– SU1 5
DC rd hit
Number of 1st level data cache read hits.
– SU1 6
DC wr hit
Number of 1st level data cache write hits.
– SU1 7
Load use RAW
Number of cycles load operations spent in the instruction pipeline while at the same time a
read-write-inconsistency exists because of a not-completed load operation.
– SU1 8
EC hit
Number of 2nd level cache hits.
– SU1 9
EC wb
Number of 2nd level cache misses causing a write-back operation.
– SU1 10
EC snoop cb
Number of UPA-transactions which caused a copy-back of a 2nd level cache line.
– SU1 11
EC ic hit
Number of 2nd level cache read hits caused by a 1st level instruction cache miss.
The performance registers are controlled by the Performance Control Register (PCR) which can be
accessed only in privileged mode. Accesses to the PIC-registers may be either in user or privileged mode,
dependent on a bit in the PCR which can be changed in privileged mode. Event counting can be done either
32
Register 32*32 Bit
I-cache (32KByte) D-cache (32KByte)
Level-2-Cache (optional)
CPU IBM PowerPC 604e
Level-1-Cache
Main Memory
Figure A.4: Memory hierarchy of the IBM PowerPC 604e
for the user mode, system mode, or both. Overflow of the counters is silently. For accurate timing, event
counting should be done as taking the difference between two reads of a performance counter.
The actual version 2.6 of the Solaris operating system has not support for the performance counters in
form of a programming interface. A software library to access the performance counters in a convenient
way is perfmon from Richard Enbody. A drawback of this package is, that neither process migration to
another CPU on a multiprocessor machine nor a context switch to another process on the same CPU is
handled.
A.4 IBM PowerPC 604e
The PowerPC 604e is a 32-bit microprocessor with 32 32-bit integer and 32 32-bit floating point registers.
The 1st level cache consists of a 32 KB data cache (D-cache) and a 32 KB instruction cache (I-cache).
Different to other microprocessors, the PowerPC 604e has no on-chip logic to control a 2nd level chip but
signals are available for additional cache logic [9]. On figure A.4, the additional logic has been included as
most of the non-embedded uses of the PowerPC 604e use a 2nd level cache. Additionally, there exist perfor-
mance counter events concerning the 2nd level cache. A detailed description of the PowerPC architecture
can be found in [10].
The pipelines of the PowerPC 604e consist of:
• a 5-stage branch unit (BPU/CRU)
• a 6-stage integer unit (SCIU1/SCIU2/MCIU)
• a 7-stage load/store unit (LSU)
• an 8-stage floating-point unit (FPU)
Sub-unit names are:
• BPU branch prediction unit
• CRU control register unit
• SCIUx single-cycle integer unit
• MCIU multiple-cycle integer unit
The PowerPC 604e has 4 performance counters (PMC1/PMC2/PMC3/PMC4) capable of counting 116
different events [11].
• Counter PMC1 counts:
33
– IB0 0
000 0000 Nothing. Register counter holds current value.
The counter keeps its current value.
– IB0 1
000 0001 Processor cycles 0b1. Count every cycle.
Number of cycles the processor executes ”0b1”.
– IB0 2
000 0010 Number of instructions completed every cycle.
Number of instructions completed each cycle.
– IB0 3
000 0011 RTCSELECT bit transition. 0 = 47, 1 = 51, 2 = 55, 3 = 63 (bits from the time base
lower register).
Bit-transitions on the RTCSELECT-Pin.
– IB0 4
000 0100 Number of instructions dispatched.
Number of instructions arrived at the 3rd stage of the instruction pipeline.
– IB0 5
000 0101 Instruction cache misses.
Number of 1st level instruction cache misses.
– IB0 6
000 0110 Data TLB misses (in order).
Number of misses in the translation look-aside buffer for data.
– IB0 7
000 0111 Branch misprediction correction from execute stage.
Number of correctable branch misses in the execution phase of the 4th stage of the pipeline.
– IB0 8
000 1000 Number of reservations requested. The lwarx instruction is ready for execution in the
LSU.
Number of reservations for an atomic load instruction in the LSU.
– IB0 9-IB0 10
000 1001 Number of data cache load misses exceeding the threshold value with lateral L2 cache
intervention.
000 1010 Number of data cache store misses exceeding the threshold value with lateral L2
cache intervention.
Number of 1st level data cache misses which exceeded a limit value and additionally, L2 INT
signal was active.
– IB0 11
000 1011 Number of mtspr instructions dispatched.
Number of mtspr instructions arrived at the 3rd stage of the pipeline.
– IB0 12-IB0 15
000 1100 Number of sync instructions completed.
000 1101 Number of eieio instructions completed.
000 1110 Number of integer instructions completed every cycle (no loads or stores).
000 1111 Number of floating-point instructions completed every cycle (no loads or stores).
Number of completed mtspr/sync/eieio/integer/floating-point instructions.
– IB0 16-IB0 18
001 0000 LSU produced result.
001 0001 SCIU1 produced result for an add, subtract, compare, rotate, shift, or logical instruc-
tion.
001 0010 FPU produced result.
Number of results generated at the LSU/SCIU1/FPU units.
– IB0 19-IB0 21
001 0011 Number of instructions dispatched to the LSU.
001 0100 Number of instructions dispatched to the SCIU1.
001 0101 Number of instructions dispatched to the FPU.
Number of instructions issued from the 3rd stage of the instruction pipeline to the
LSU/SCIU1/FPU unit.
34
– IB0 22
001 0110 Valid snoop requests received from outside the 604e. Does not distinguish hits or
misses.
Number of snoop requests.
– IB0 23-IB0 24
001 0111 Number of data cache load misses exceeding the threshold value without lateral L2
intervention.
001 1000 Number of data cache store misses exceeding the threshold value without lateral L2
intervention.
Number of 1st level data cache misses which exceeded a limit value and additionally, L2 INT
signal was not active.
– IB0 25-IB0 27
001 1001 Number of cycles the branch unit is idle.
001 1010 Number of cycles MCIU0 is idle.
001 1011 Number of cycles the LSU is idle. No new instructions are executing; however, active
loads or stores may be in the queues.
Number of cycles the BPU/MCIU0/LSU units were idle.
– IB0 28
001 1100 Number of times the L2 INT is asserted (regardless of TA state).
Number of times L2 INT signal was asserted.
– IB0 29
001 1101 Number of unaligned loads.
Number of unaligned loads.
– IB0 30
001 1110 Number of entries in the load queue each cycle (maximum of five). Although the load
queue has four entries, a load miss latch may hold a load waiting for data from memory.
Number of load queue entries per cycle (max. of 5).
– IB0 31
001 1111 Number of instruction breakpoint hits.
Number of times instructions hit a breakpoint.
• Counter PMC2 counts:
– IB1 0
00 0000 Nothing. Register counter holds current value.
The counter keeps its current value.
– IB1 1
00 0001 Processor cycles 0b1. Count every cycle.
Number of cycles the processor executes ”0b1”.
– IB1 2
00 0010 Number of instructions completed every cycle.
Number of instructions completed every cycle.
– IB1 3
00 0011 RTCSELECT bit transition. 0 = 47, 1 = 51, 2 = 55, 3 = 63 (bits from the time base
lower register).
Number of bit transitions on the RTCSELECT-pin.
– IB1 4
00 0100 Number of instructions dispatched.
Number of instructions dispatched to the 3rd stage of the instruction pipeline.
– IB1 5
00 0101 Number of cycles a load miss takes.
Number of load miss cycles.
– IB1 6
00 0110 Data cache misses (in order).
Number of 1st level data cache misses.
– IB1 7
00 0111 Number of instruction TLB misses.
Number of misses in the translation look-aside buffer for instructions.
35
– IB1 8
00 1000 Number of branches completed. Indicates the number of branch instructions being
completed every cycle (00 = none, 10 = one, 11 = two, 01 is an illegal value).
Number of completed branch instructions every cycle (max. of 2).
– IB1 9
00 1001 Number of reservations successfully obtained (stwcx. operation completed success-
fully).
Number of successfully completed atomic store instructions.
– IB1 10
00 1010 Number of mfspr instructions dispatched (in order).
Number of mfspr-instructions arrived at the 3rd stage of the instruction pipeline.
– IB1 11
00 1011 Number of icbi instructions. It may not hit in the cache.
Number of icbi-instructions without necessary hitting the cache.
– IB1 12
00 1100 Number of pipeline ”flushing” instructions (sc, isync, mtspr (XER), mcrxr, floating-
point operation with divide by 0 or invalid operand and MSR[FE0, FE1] = 00, branch with
MSR[BE] = 1, load string indexed with XER = 0, and SO bit getting set)
Number of instructions flushing the pipeline.
– IB1 13-IB1 15
00 1101 BPU produced result.
00 1110 SCIU0 produced result (of an add, subtract, compare, rotate, shift, or logical instruc-
tion).
00 1111 MCIU produced result (of a multiply/divide or SPR instruction).
Number of results produced by the BPU/SCIU0/MCIU-units.
– IB1 16-IB1 17
01 0000 Number of instructions dispatched to the branch unit.
01 0001 Number of instructions dispatched to the SCIU0.
Number of instructions issued from the 3rd stage of the instruction pipeline to the BPU/SCIU0-
units.
– IB1 18
01 0010 Number of loads completed. These include all cache operations and tlbie, tlbsync,
sync, eieio and icbi instructions.
Number of completed load instructions.
– IB1 19
01 0011 Number of instructions dispatched to the MCIU.
Number of instructions issued from the 3rd stage of the instruction pipeline to the MCIU-unit.
– IB1 20
01 0100 Number of snoop hits occurred.
Number of snoop hits.
– IB1 21
01 0101 Number of cycles during which the MSR[EE] bit is cleared.
Number of cycles during which the MSR[EE] bit is cleared.
– IB1 22-IB1 24
01 0110 Number of cycles the MCIU is idle.
01 0111 Number of cycles SCIU1 is idle.
01 1000 Number of cycles the FPU is idle.
Number of cycles the SCIU1/MCIU/FPU-unit is idle.
– IB1 25
01 1001 Number of cycles the L2 INT signal is active (regardless of TA state).
Number of cycles the L2 INT-pin had an active level.
– IB1 26-IB1 30
01 1010 Number of times four instructions were dispatched.
01 1011 Number of times three instructions were dispatched.
01 1100 Number of times two instructions were dispatched.
01 1101 Number of times one instruction was dispatched.
Number of times 1/2/3/4 instructions arrived at the 3rd stage of the instruction pipeline.
36
– IB1 31
01 1110 Number of unaligned stores.
Number of unaligned stores.
– IB1 32
01 1111 Number of entries in the store queue each cycle (maximum of six).
Number of entries in the store-queue every cycle (max. of 6).
• Counter PMC3 counts:
– IB2 0
0 0000 Nothing. Register counter holds current value.
The counter keeps its current value.
– IB2 1
0 0001 Processor cycles 0b1. Count every cycle.
Number of cycles the processor executes ”0b1”.
– IB2 2
0 0010 Number of instructions completed every cycle.
Number of instructions completed every cycle.
– IB2 3
0 0011 RTCSELECT bit transition. 0 = 47, 1 = 51, 2 = 55, 3 = 63 (bits from the time base
lower register).
Number of bit-transitions on the RTCSELECT-pin.
– IB2 4
0 0100 Number of instructions dispatched.
Number of instructions arrived at the 3rd stage of the instruction pipeline.
– IB2 5-IB2 7
0 0101 Number of cycles the LSU stalls due to BIU or cache busy. Counts cycles between when
a load or store request is made and a response was expected. For example, when a store is
retried, there are four cycles before the same instruction is presented to the cache again. Cycles
in between are not counted.
0 0110 Number of cycles the LSU stalls due to a full store queue.
0 0111 Number of cycles the LSU stalls due to operands not available in the reservation station.
Number of cycles the LSU-unit was blocked either because the LSU-unit was busy or the cache
was busy or the store queue was full or an operand was not available.
– IB2 8
0 1000 Number of instructions written into the load queue. Misaligned loads are split into two
transactions with the first part always written into the load queue. If both parts are cache hits,
data is returned to the rename registers and the first part is flushed from the load queue. To count
the instructions that enter the load queue to stay, the misaligned load hits must be subtracted.
Number of instructions in the load queue.
– IB2 9
0 1001 Number of cycles that completion stalls for a store instruction.
Number of cycles that completion stalls for a store instruction.
– IB2 10
0 1010 Number of cycles that completion stalls for an unfinished instruction.
Number of cycles that completion stalls for an unfinished instruction.
– IB2 11
0 1011 Number of system calls.
Number of system calls.
– IB2 12
0 1100 Number of cycles the BPU stalled as branch waits for its operand.
Number of cycles the BPU waits for an operand.
– IB2 13
0 1101 Number of fetch corrections made at the dispatch stage. Prioritized behind the execute
stage.
Number of fetch corrections made at the 3rd stage of the instruction pipeline.
37
– IB2 14
0 1110 Number of cycles the dispatch stalls waiting for instructions.
Number of cycles the 1st stage of the instruction pipeline waited for instructions.
– IB2 15
0 1111 Number of cycles the dispatch stalls due to unavailability of reorder buffer (ROB) entry.
No ROB entry was available for the first non-dispatched instruction.
Number of cycles the 1st stage of the instruction pipeline waited because the reorder buffer was
not available.
– IB2 16
1 0000 Number of cycles the dispatch unit stalls due to no FPR rename buffer available. First
non-dispatched instruction required a floating-point reorder buffer and none was available.
Number of cycles the 1st stage of the instruction pipeline waited because the FPR-rename buffer
was not available.
– IB2 17-IB2 18
1 0001 Number of instruction table search operations.
1 0010 Number of data table search operations. Completion could result from a page fault or a
PTE match.
Number of search operations in the data/instruction table.
– IB2 19-IB2 20
1 0011 Number of cycles the FPU stalled.
1 0100 Number of cycles the SCIU1 stalled.
Number of cycles the FPU-/SCIU1-unit was blocked.
– IB2 21
1 0101 Number of times the BIU forwards non-critical data from the line-fill buffer.
Number of transfers of uncritical data from the line-fill buffer done by the bus-interface unit and
initiated by the BIU. to the
– IB2 22
1 0110 Number of data bus transactions completed with pipelining one deep with no additional
bus transactions queued behind it.
Number of completed data bus transactions without additional bus transactions queued.
– IB2 23
1 0111 Number of data bus transactions completed with two data bus transactions queued
behind.
Number of completed data bus transactions with two additional bus transactions queued.
– IB2 24
1 1000 Counts pairs of back-to-back burst reads streamed without a dead cycle between them
in data streaming mode
Number of paired back-to-back-burst-read accesses without intervening idle cycles.
– IB2 25
1 1001 Counts non-ARTRY d processor kill transactions caused by a write-hit-on-shared con-
dition
Number of invalidated cache lines caused by a write hit to a shared line.
– IB2 26
1 1010 This event counts non-ARTRY d write-with-kill address operations that originate from
the three castout buffers. These include high-priority write-with-kill transactions caused by a
snoop hit on modified data in one of the BIU’s three copy-back buffers. When the cache block
on a data cache miss is modified, it is queued in one of three copy-back buffers. The miss is
serviced before the copy-back buffer is written back to memory as a write-with-kill transaction.
Number of Write-with-kill-address operations.
– IB2 27
1 1011 Number of cycles when exactly two castout buffers are occupied.
Number of cycles when exactly two castout buffers are occupied. Castout-buffer are used to
write 1st level data cache lines to memory.
– IB2 28
1 1100 Number of data cache accesses retried due to occupied castout buffers.
Number of retried 1st ;level data cache accesses due to occupied castout buffer.
38
– IB2 29
1 1101 Number of read transactions from load misses brought into the cache in a shared state.
Number of read transactions which (after a miss) brought a 1st level cache line into the cache
with a status of shared.
– IB2 30
1 1110 CRU Indicates that a CR logical instruction is being finished.
Number of logical instructions completed in the CRU.
• Counter PMC4 counts:
– IB3 0
0 0000 Nothing. Register counter holds current value.
The counter keeps its current value.
– IB3 1
0 0001 Processor cycles 0b1. Count every cycle.
Number of cycles the processor executes ”0b1”.
– IB3 2
0 0010 Number of instructions completed every cycle.
Number of instructions every cycle.
– IB3 4
0 0011 RTCSELECT bit transition. 0 = 47, 1 = 51, 2 = 55, 3 = 63 (bits from the time base
lower register).
Number of bit-transitions on the RTCSELECT-pin.
– IB3 5
0 0100 Number of instructions dispatched.
Number of instructions arrived at the 3rd stage of the instruction pipeline.
– IB3 6-IB3 8
0 0101 Number of cycles the LSU stalls due to busy MMU.
0 0110 Number of cycles the LSU stalls due to the load queue full.
0 0111 Number of cycles the LSU stalls due to address collision.
Number of cycles the LSU stalled because of a busy MMU, full load queue, or address collision.
– IB3 9
0 1000 Number of misaligned loads that are cache hits for both the first and second accesses.
Number of misaligned loads that are cache hits for both the first and second accesses.
– IB3 10
0 1001 Number of instructions written into the store queue.
Number of instructions written into the store queue.
– IB3 11
0 1010 Number of cycles that completion stalls for a load instruction.
Number of cycles the completion of an instructions stalled because of a load instruction.
– IB3 12
0 1011 Number of hits in the BTAC. Warning-if decode buffers cannot accept new instructions,
the processor re-fetches the same address multiple times.
Number of hits in the Branch Target Address Cache.
– IB3 13
0 1100 Number of times the four basic blocks in the completion buffer from which instructions
can be retired were used
Number of times the four basic blocks in the completion buffer from which instructions can be
retired were used.
– IB3 14
0 1101 Number of fetch corrections made at decode stage.
Number of corrections made between the 1st and 2nd stage of the instruction pipeline.
– IB3 15-IB3 18
0 1110 Number of cycles the dispatch unit stalls due to no unit available. First non-dispatched
instruction requires an execution unit that is either full or a previous instruction is being dis-
patched to that unit.
0 1111 Number of cycles the dispatch unit stalls due to unavailability of GPR rename buffer.
39
First non-dispatched instruction requires a GPR reorder buffer and none are available.
1 0000 Number of cycles the dispatch unit stalls due to no CR rename buffer available. First
non-dispatched instruction requires a CR rename buffer and none is available.
1 0001 Number of cycles the dispatch unit stalls due to CTR/LR interlock. First non-dispatched
instruction could not dispatch due to CTR/LR/mtcrf interlock.
Number of cycles spent at the 3rd stage of the instruction pipeline waiting for any of the condi-
tions:
∗ in the 4th stage of the pipeline (MCIU/SCIU0/SCIU1..) was no unit available
∗ no GPR-Rename-Buffer was available
∗ no CR-Rename-Buffer was available
∗ the Counter- or Link-Register was locked
– IB3 19-IB3 20
1 0010 Number of cycles spent doing instruction table search operations.
1 0011 Number of cycles spent doing data table search operations.
Number of cycles spent searching in the data/instruction table.
– IB3 21-IB3 22
1 0100 Number of cycles SCIU0 was stalled.
1 0101 Number of cycles MCIU was stalled.
Number of cycles the MCIU/SCIU0 was stalled.
– IB3 23
1 0110 Number of bus cycles after an internal bus request without a qualified bus grant.
Number of bus-cycles after an internal bus request without a qualified bus grant.
– IB3 24
1 0111 Number of data bus transactions completed with one data bus transaction queued behind
Number of completed data-bus transactions with one data bus transaction queued behind.
– IB3 25
1 1000 Number of write data transactions that have been reordered before a previous read data
transaction using the DBWO feature
Number of write data transactions that have been reordered before a previous read data transac-
tion.
– IB3 26
1 1001 Number of ARTRY d processor address bus transactions.
Number of address bus transactions caused by a signal change at the ARTRY d-pin.
– IB3 27
1 1010 Number of high-priority snoop pushes. Snoop transactions, except for write-with-kill,
that hit modified data in the data cache cause a high-priority write (snoop push) of that modified
cache block to memory. This operation has a transaction type of write-with-kill. This event
counts the number of non-ARTRY d processor write-with-kill transactions that were caused
by a snoop hit on modified data in the data cache. It does not count high-priority write-with-kill
transactions caused by snoop hits on modified data in one of the BIU’s three copy-back buffers.
Number of high-priority snoop pushes.
– IB3 28-IB3 29
1 1011 Number of cycles for which exactly one castout buffer is occupied
1 1100 Number of cycles for which exactly three castout buffers are occupied
Number of cycles for which exactly one/three castout buffer is/are occupied.
– IB3 30
1 1101 Number of read transactions from load misses brought into the cache in an exclusive (E)
state
Number of read transactions caused by a load miss and which got brought into the cache in
exclusive state.
– IB3 31
1 1110 Number of un-dispatched instructions beyond branch
Number of undispatched instructions beyond branch.
IBM has the PMapi library which supports access to the performance counters on different PowerPC
and POWER chips. PMapi supports the distinction between supervisor mode, problem (user) mode, or both.
On AIX versions 4.2 and higher, performance counter status is saved and restored on context switches.
40
A.5 Intel Pentium Family
A.5.1 Intel Pentium
The Intel Pentium is a 32-bit CISC microprocessor. The Pentium has 2 performance counters with most of
the events countable by either of the counters and only some events countable only by a specific counter
(as noted). With the introduction of the MMX-extensions, Pentium’s with MMX have defined more events
as stated (MMX-extensions). We have left out all events which are specific to the MMX functional unit as
compilers normally do not generate code for this unit.
The events countable by both counters are:
• PE0PE1 0
00H DATA READ
Number of memory data read operations.
• PE0PE1 1
01H DATA WRITE
Number of memory data write operations.
• PE0PE1 2
02H DATA TLB MISS
Number of misses to the data cache translation look-aside buffer.
• PE0PE1 3
03H DATA READ MISS
Number of memory read accesses that miss the internal data cache.
• PE0PE1 4
04H DATA WRITE MISS
Number of memory write accesses that miss the internal data cache.
• PE0PE1 5
05H WRITE HIT TO M- OR E-STATE LINES
Number of write hits to exclusive or modified lines in the data cache.
• PE0PE1 6
06H DATA CACHE LINES WRITTEN BACK
Number of dirty lines that are written back.
• PE0PE1 7
07H EXTERNAL SNOOPS
Number of accepted external snoops.
• PE0PE1 8
08H EXTERNAL DATA CACHE SNOOP HITS
Number of external snoops to the data cache.
• PE0PE1 9
09H MEMORY ACCESSES IN BOTH PIPES
Number of data memory reads or writes that are paired in both pipes of the pipeline.
• PE0PE1 10
0AH BANK CONFLICTS
Number of actual bank conflicts.
• PE0PE1 11
0BH MISALIGNED DATA MEMORY OR I/O REFERENCES
Number of memory or I/O reads or writes that are misaligned.
• PE0PE1 12
0CH CODE READ
Number of instruction reads.
• PE0PE1 13
0DH CODE TLB MISS
Number of instruction reads that miss the code TLB.
41
• PE0PE1 14
0EH CODE CACHE MISS
Number of instruction reads that miss the internal code cache.
• PE0PE1 15
0FH ANY SEGMENT REGISTER LOADED
Number of writes into any segment register in real or protected mode.
• PE0PE1 16
12H Branches
Number of taken or not taken branches, including conditional branches, jumps, calls, returns, soft-
ware interrupts, and interrupt returns.
• PE0PE1 17
13H BTB HITS
Number of BTB hits that occur.
• PE0PE1 18
14H TAKEN BRANCH OR BTB HIT
Number of taken branches or BTB hits that occur.
• PE0PE1 19
15H PIPELINE FLUSHES
Number of pipeline flushes that occur.
• PE0PE1 20
16H INSTRUCTIONS EXECUTED
Number of instructions executed (up to two per clock).
• PE0PE1 21
17H INSTRUCTIONS EXECUTED VPIPE
Number of instructions executed in the V pipe. It indicated the number of instructions that were
paired.
• PE0PE1 22
18H BUS CYCLE DURATION
Number of clocks while a bus cycle is in progress. This event measures bus use.
• PE0PE1 23
19H WRITE BUFFER FULL STALL DURATION
Number of clocks while the pipeline is stalled due to full write buffers.
• PE0PE1 24
1AH WAITING FOR DATA MEMORY READ STALL DURATION
Number of clocks while the pipeline is stalled while waiting for data memory reads.
• PE0PE1 25
1BH STALL ON WRITE TO AN E- OR M-STATE LINE
Number of stalls on writes to E- or M-state lines..
• PE0PE1 26
1CH LOCKED BUS CYCLE
Number of locked bus cycles that occur as the result of the LOCK prefix or LOCK instruction, page-
table updates, and descriptor table updates.
• PE0PE1 27
1DH I/O READ OR WRITE CYCLE
Number of bus cycles directed to I/O space.
• PE0PE1 28
1EH NONCACHEABLE MEMORY READS
Number of non-cacheable instruction or data memory read bus cycles.
• PE0PE1 29
1FH PIPELINE AGI STALLS
Number of address generation interlock (AGI) stalls.
42
• PE0PE1 30
22H FLOPS
Number of floating-point operations that occur. Transcendental instructions consist of multiple adds
and multiplies and will signal this event multiple times. Instructions generating the divide-by-zero,
negative square root, special operand, or stack exceptions will not be counted. Instructions generat-
ing all other floating-point exceptions will be counted. The integer multiply instructions and other
instructions which use the FPU will be counted.
• PE0PE1 31
23H BREAKPOINT MATCH ON DR0 REGISTER
Number of matches on register DR0 breakpoint.
• PE0PE1 32
24H BREAKPOINT MATCH ON DR1 REGISTER
Number of matches on register DR1 breakpoint.
• PE0PE1 33
25H BREAKPOINT MATCH ON DR2 REGISTER
Number of matches on register DR2 breakpoint.
• PE0PE1 34
26H BREAKPOINT MATCH ON DR3 REGISTER
Number of matches on register DR3 breakpoint.
• PE0PE1 35
27H HARDWARE INTERRUPTS
Number of taken INTR and NMI interrupts.
• PE0PE1 36
28H DATA READ OR WRITE
Number of memory data reads and/or writes.
• PE0PE1 37
29H DATA READ MISS OR WRITE MISS
Number of memory read and/or write accesses that miss the internal data cache.
• Counter-specific events:
– Specific to counter 0:
∗ PE0 0
2AH BUS OWNERSHIP LATENCY
The time from LRM bus ownership request to bus ownership granted (MMX extension).
∗ PE0 1
2CH CACHE M-STATE LINE SHARING
Number of times a processor identified a hit to a modified line due to a memory access in
the other processor (MMX extension).
∗ PE0 2
2DH EMMS INSTRUCTIONS EXECUTED
Number of EMMS instructions executed (MMX extension).
∗ PE0 3
2EH BUS UTILIZATION DUE TO PROCESSOR ACTIVITY
Number of clocks the bus is busy due to the processor’s own activity (MMX extension).
∗ PE0 4
30H NUMBER OF CYCLES NOT IN HALT STATE
Number of cycles the processor is not idle due to HLT instruction (MMX extension).
∗ PE0 5
32H FLOATING POINT STALLS DURATION
Number of clocks while pipe is stalled due to a floating-point freeze (MMX extension).
∗ PE0 6
33H D1 STARVATION AND FIFO IS EMPTY
Number of times D1 stage cannot issue ANY instructions since the FIFO buffer is empty
(MMX extension).
43
∗ PE0 7
35H PIPELINE FLUSHES DUE TO WRONG BRANCH PREDICTIONS
Number of pipeline flushes due to wrong branch predictions resolved in either the E-stage
or the WB-stage (MMX extension).
∗ PE0 8
37H MISPREDICTED OR UNPREDICTED RETURNS
Number of returns predicted incorrectly or not predicted at all (MMX extension).
∗ PE0 9
39H RETURNS
Number of returns executed (MMX extension).
∗ PE0 10
3AH BTB FALSE ENTRIES
Number of false entries in the Branch Target Buffer (MMX extension).
– Specific to counter 1:
∗ PE1 0
2AH BUS OWNERSHIP TRANSFERS
Number of bus ownership transfers (MMX extension).
∗ PE1 1
2CH CACHE LINE SHARING
Number of shared data lines in the L1 cache (MMX extension).
∗ PE1 2
2EH WRITES TO NONCACHEABLE MEMORY
Number of write accesses to non-cacheable memory (MMX extension).
∗ PE1 3
30H DATA CACHE TLB MISS STALL DURATION
Number of clocks the pipeline is stalled due to a data cache translation look-aside buffer
miss (MMX extension).
∗ PE1 4
31H TAKEN BRANCHES
Number of branches taken (MMX extension).
∗ PE1 5
33H D1 STARVATION AND ONLY ONE INSTRUCTION IN FIFO
Number of times the D1 stage issues just a single instruction since the FIFO buffer had just
one instruction ready (MMX extension).
∗ PE1 6
35H PIPELINE FLUSHES DUE TO WRONG BRANCH PREDICTIONS RESOLVED IN WB-
STAGE
Number of pipeline flushes due to wrong branch predictions resolved in the WB-stage
(MMX extension).
∗ PE1 7
37H PREDICTED RETURNS
Number of predicted returns (MMX extension).
∗ PE1 8
3AH BTB MISS PREDICTION ON NOT TAKEN BRANCH
Number of times the BTB predicted a not-taken branch as taken (MMX extension).
By default, the instructions RDMSR and WRMSR to access the performance counter registers are kernel-
mode instructions (ring 0).
In [12] are software tools concerning the performance counters on Pentium-like processors described.
On Linux systems, libpperf is available to access the performance counters. It was written by M. Patrick
Goda and Michael S. Warren from Los Alamos National Laboratory. libpperf itself is based on the msr
device implemented by Stephan Meyer for Linux 2.0.x and 2.1.x.
A.5.2 Intel PentiumPro/Pentium II/Pentium III
To keep binary compatibility with the predecessor processors, the PentiumPro, Pentium II, and Pentium III
have 8 registers, 32 bit width each. First level cache is 8 KB for instructions (ICache) and 8 KB for data
(DCache) on PentiumPro, and 16 KB for both caches on Pentium II and Pentium III. As the PentiumPro,
Pentium II, and Pentium III are CISC-microprocessors (complex instruction set computer), every instruction
44
Register (8 * 32 Bit)
ICache (8KByte) DCache (8KByte)
Level-2-Cache (256/512KByte/1MByte)
CPU Intel Pentium Pro
Level-1-Cache
Main Memory
Figure A.5: Memory hierarchy of the Intel PentiumPro
is divided internally into micro-operations (UOP’s) of fixed length. Dependent on the complexity of the
instruction, the instruction is divided into 1-4 UOP’s.
The PentiumPro, Pentium II, and Pentium III has 2 performance counters capable of counting a total
of 77 different events (at most two at a time), some of them with an additional unit mask as parameter to
further subdivide the event type. Some of the events are countable only by a specific counter. The Pentium
III has 4 additional events concerning Streaming SIMD Extensions. The events countable by both counters
are:
• PP0PP1 0
43H DATA MEM REFS
All memory references, both cacheable and non-cacheable.
• PP0PP1 1
45H DCU LINES IN
Number of allocated lines in the 1st level data cache.
• PP0PP1 2
46H DCU M LINES IN
Number of allocated lines in the 1st level data cache which have the status modified.
• PP0PP1 3
47H DCU M LINES OUT
Number of evicted lines in the 1st level data cache which were marked as modified.
• PP0PP1 4
48H DCU MISS OUTSTANDING
Weighted number of cycles while a 1st level data cache miss is outstanding. An access that also misses
the L2 is short-changed by 2 cycles. (i.e. if counts N cycles, should be N+2 cycles.) Subsequent loads
to the same cache line will not result in any additional counts. Count value not precise, but still useful.
• PP0PP1 5
80H IFU IFETCH
Number of 1st level instruction cache loads.
• PP0PP1 6
81H IFU IFETCH MISS
Number of 1st level instruction cache misses.
• PP0PP1 7
85H ITLB MISS
Number of instruction transfer look-aside buffer misses.
45
• PP0PP1 8
86H IFU MEM STALL
Number of cycles in which the instruction fetch pipe stage is stalled.
• PP0PP1 9
87H ILD STALL
Number of cycles the instruction length decoder is stalled.
• PP0PP1 10
28H L2 IFETCH
Number of instruction fetches from the 2nd level cache.
• PP0PP1 11
29H L2 LD
Number of data loads from the 2nd level cache.
• PP0PP1 12
2AH L2 ST
Number of data stores to the 2nd level cache.
• PP0PP1 13
24H L2 LINES IN
Number of lines allocated in the 2nd level cache.
• PP0PP1 14
26H L2 LINES OUT
Number of cache lines removed from the 2nd level cache.
• PP0PP1 15
25H L2 M LINES INM
Number of allocated cache lines in the 2nd level cache which have been modified.
• PP0PP1 16
27H L2 M LINES OUTM
Number of modified cache lines in the 2nd level cache which have been removed.
• PP0PP1 17
2EH L2 RQSTS Number of requests to the 2nd level cache.
• PP0PP1 18
21H L2 ADS
Number of address strobes at 2nd level cache address bus.
• PP0PP1 19
22H L2 DBUS BUSY
Number of cycles during which the data bus was busy.
• PP0PP1 20
23H L2 DBUS BUSY RD
Number of cycles during which the data bus was busy transferring data from 2nd level cache to the
processor.
• PP0PP1 21
62H BUS DRDY CLOCKS
Number of cycles the DRDY-signal was active.
• PP0PP1 22
63H BUS LOCK CLOCKS
Number of processor clock cycles during which the LOCK-signal is asserted.
• PP0PP1 23
60H BUS REQ OUTSTANDING
Number of outstanding bus requests which either result out from a cacheable read request of 1st level
data cache lines or a to be completed bus operation.
46
• PP0PP1 24
65H BUS TRAN BRD
Number of burst read transactions.
• PP0PP1 25
66H BUS TRAN RFO
Number of read for ownership transactions.
• PP0PP1 26
67H BUS TRANS WB
Number of write back transactions.
• PP0PP1 27
68H BUS TRAN IFETCH
Number of completed instruction fetch transactions.
• PP0PP1 28
69H BUS TRAN INVAL
Number of completed bus invalidate transactions.
• PP0PP1 29
6AH BUS TRAN PWR
Number of completed partial write transactions.
• PP0PP1 30
6BH BUS TRANS P
Number of completed partial transactions.
• PP0PP1 31
6CH BUS TRANS IO
Number of completed I/O transactions.
• PP0PP1 32
6DH BUS TRAN DEF
Number of completed deferred transactions.
• PP0PP1 33
6EH BUS TRAN BURST
Number of completed burst transactions.
• PP0PP1 34
70H BUS TRAN ANY
Number of all completed transactions.
• PP0PP1 35
6FH BUS TRAN MEM
Number of completed memory transactions.
• PP0PP1 36
64H BUS DATA RCV
Number of bus clock cycles during which this processor is receiving data.
• PP0PP1 37
61H BUS BNR DRV
Number of bus clock cycles during which this processor is driving the BNR pin.
• PP0PP1 38
7AH BUS HIT DRV
Number of bus clock cycles during which this processor is driving the HIT pin including cycles due
to snoop stalls.
• PP0PP1 39
7BH BUS HITM DRV
Number of bus clock cycles during which this processor is driving the HITM pin including cycles
due to snoop stalls.
47
• PP0PP1 40
7EH BUS SNOOP STALL
Number of clock cycles during which the bus is snoop stalled.
• PP0PP1 41
03H LD BLOCKS
Number of store buffer locks.
• PP0PP1 42
04H SB DRAINS
Number of cycles in which the store buffer blocks.
• PP0PP1 43
05H MISALIGN MEM REF
Number of misaligned data memory references.
• PP0PP1 44
C0H INST RETIRED
Number of instructions retired.
• PP0PP1 45
C2H UOPS RETIRED
Number of micro-operations retired.
• PP0PP1 46
D0H INST DECODER
Number of instructions decoded and translated to UOP’s.
• PP0PP1 47
C8H HW INT RX
Number of hardware interrupts received.
• PP0PP1 48
C6H CYCLES INT MASKED
Number of processor cycles for which interrupts are disabled.
• PP0PP1 49
C7H CYCLES INT PENDIND AND MASKED
Number of ptrocessor cycles for which interrupts are disabled and interrupts are pending.
• PP0PP1 50
C4H BR INST RETIRED
Number of branch instructions retired.
• PP0PP1 51
C5H BR MISS PRED RETIRED
Number of completed but mispredicted branches.
• PP0PP1 52
C9H BR TAKEN RETIRED
Number of completed taken branches.
• PP0PP1 53
CAH BR MISS PRED TAKEN RET
Number of completed taken, but mispredicted branches.
• PP0PP1 54
E0H BR INST DECODED
Number of decoded branch instructions.
• PP0PP1 55
E2H BTB MISSES
Number of branches that missed the BTB.
• PP0PP1 56
E4H BR BOGUS
Number of bogus branches.
48
• PP0PP1 57
E6H BACLEARS
Number of times BACLEAR-signal is asserted.
• PP0PP1 58
A2H RESOURCE STALLS
Number of cycles during which there are resource related stalls.
• PP0PP1 59
D2H PARTIAL RAT STALLS
Number of cycles or events for partial stalls.
• PP0PP1 60
06H SEGMENT REG LOADS
Number of segment register loads.
• PP0PP1 61
79H CPU CLK UNHALTED
Number of cycles during which the processor is not halted.
• PP0PP1 62
B0H MMX INSTR EXEC
Number of MMX-instructions executed.
• PP0PP1 63
B3H MMX INSTR TYPE EXEC
Number of MMX-instructions executed. The further parameter unit mask specifies which category
should be counted.
• PP0PP1 64
B1H MMX SAT INSTR EXEC
MMX saturated instructions executed.
• PP0PP1 65
B2H MMX uOPS EXEC
Number of MMX uops executed.
• PP0PP1 66
CCH FP MMX TRANS
Transitions from MMX instructions to FP instructions.
• PP0PP1 67
CDH MMX ASSIST
Number of MMX assists (EMMS instructions executed).
• PP0PP1 68
CEH MMX INSTR RET
Number of MMX instructions retired.
• PP0PP1 69
D4H SEG RENAME STALLS
Segment register renaming stalls.
• PP0PP1 70
D5H SEG REG RENAMES
Segment registers renamed.
• PP0PP1 71
D6H RET SEG RENAMES
Number of segement register rename events retired.
• PP0PP1 72
D8H EMON SSE INST RETIRED
Number of Streaming SIMD extensions retired.
49
• PP0PP1 73
D9H EMON SSE COMP INST RET
Number of Streaming SIMD Extensions computation instructions retired.
• PP0PP1 74
07H EMON SSE PRE DISPATCHED
Number of prefetch/weakly ordered instructios dispatched (inclusive speculative prefetches).
• PP0PP1 75
4BH EMON SSE PRE MISS
Number of prefetch/weakly-ordered instructions that miss all caches.
• Counter-specific events:
– Specific to counter 0:
∗ PP0 0
C1H FLOPS
Number of retired floating point instructions.
∗ PP0 1
10H FP COMP OPS EXE
Number of floating point operations started (but which may not have been all completed.)
∗ PP0 2
14H CYCLES DIV BUSY
Number of cycles during which the divider is busy.
– Specific to counter 1:
∗ PP1 0
11H FP ASSIST
Number of floating-point exception cases handled by microcode.
∗ PP1 1
12H MUL
Number of multiplies (integer and floating-point).
∗ PP1 2
13H DIV
Number of divides (integer and floating-point).
All of the events can be counted on PentiumPro as well as on Pentium II and Pentium III. The Pentium
II and Pentium III have additional events defined mainly for MMX-extensions [13].
The same remarks as stated above in the Pentium-section concerning software environments apply to
the Pentium Pro, Pentium II, and Pentium III as well.
50
