Measuring Software Performance on Linux by Becker, Martin & Chakraborty, Samarjit
Measuring Software Performance on Linux
Technical Report
November 21, 2018
Martin Becker
Chair of Real-Time Computer Systems
Technical University of Munich
Munich, Germany
martin.becker@tum.de
Samarjit Chakraborty
Chair of Real-Time Computer Systems
Technical University of Munich
Munich, Germany
martin.becker@tum.de
+
TLB
miss
+
L1d$ 
miss
+
L1i$ miss
+
page
walk
+
+ +
page fault
minor+ major
+ (TLB flush) 
+
context 
switch 
+
+
+/-
+/-
+
data 
prefetch
+
core
migrations
cache 
coherency 
+ +
instructions
+
+
disk
I/O
+ DRAMmiss
+
+ 
+ (KPTI TLB flush) 
mode
switch
+
interrupt
+ (TLB shootdown)
multicore 
TLB
coherency
readahead
page
cache
+
++/-
program
.text
+
program 
.bss
.data 
+
OS
++ 
data
access
CPU
higher-level 
cache miss 
+
branch 
misprediction 
pollution
+
scheduler
Figure 1. Event interaction map for a program running on an Intel Core processor on Linux. Each event itself may cause
processor cycles, and inhibit (−), enable (+), or modulate (⊗) others.
Abstract
Measuring and analyzing the performance of software has
reached a high complexity, caused by more advanced pro-
cessor designs and the intricate interaction between user
programs, the operating system, and the processor’s microar-
chitecture. In this report, we summarize our experience on
how performance characteristics of software should be mea-
sured when running on a Linux operating system and a
modern processor. In particular, (1) we provide a general
overview about hardware and operating system features
that may have a significant impact on timing and explain
their interaction, (2) we identify sources of errors that need
to be controlled in order to obtain unbiased measurement
results, and (3) we propose a measurement setup for Linux
to minimize errors. Although not the focus of this report, we
describe the measurement process using hardware perfor-
mance counters, which can faithfully point to performance
bottlenecks on a given processor. Our experiments confirm
that our measurement setup has a large impact on the results.
More surprisingly, however, they also suggest that the setup
can be negligible for certain analysis methods. Furthermore,
we found that our setup maintains significantly better per-
formance under background load conditions, which means
it can be used to improve high-performance applications.
CCS Concepts • Software and its engineering→ Soft-
ware performance;
Keywords Software performance, Linux, Hardware Coun-
ters, Microarchitecture, Jitter
1 Introduction
Countless performance tests of software are available online
in blogs, articles, and so on, but often their significance can
be refuted by their measurement setup. Speedups are usu-
ally reported in units of time, yet without any introspection
how exactly the differences came to live. Not only are the
ar
X
iv
:1
81
1.
01
41
2v
2 
 [c
s.P
F]
  2
0 N
ov
 20
18
results questionable when the software runs on a different
machine, but even on identical processors with identical
memory configurations and peripherals, there are many ex-
ternal factors that can influence the result. One such factor
is the operating system (OS) itself. Different configurations
and usages of OSes might nullify or magnify some effects,
and thus the performance measurements do not necessarily
reflect the characteristics of the software that we actually
want to analyze and improve.
In this report, we describe how performance measure-
ments of user software should be conducted, when running
on a mainline Linux OS and a modern multi-core processor.
Specifically, we are concerned with real-time measurements
taken on the real hardware, providing quantitative informa-
tion like execution time, memory access times and power
consumption. We expect that the setup of the OS and hard-
ware have a significant impact on the results, and that a
proper setup can greatly reduce measurement errors.
This document is structured as follows: We start with a
general survey of microarchitectural and OS elements affect-
ing performance in modern processors in Section 2. In the
next section we list possible sources of measurement errors,
followed by a proposed measurement setup in Section 4. Fi-
nally, we show several examples of how the setup influences
the results, before we conclude this report.
2 Microarchitectural and OS Performance
Modulators
We start by giving an overview on hardware and OS fea-
tures and how they influence the performance of software.
In essence, this section is an elaboration of the interactions
depicted in Figure 1. The well-informed reader might di-
rectly proceed with the subsequent section, which identifies
sources of measurement errors based on the details presented
here.
Althoughwe try to keep the explanations generic, it would
be impractical to make only statements that cover any con-
ceivable processor. The details are therefore given for an In-
tel 2nd generation CoreTM x86-64 microarchitecture (“Sandy
Bridge”). This is a pipelined, four-width superscalar multi-
core processor, with out-of-order processing, speculative
execution, a multi-level cache hierarchy, prefetching and
memory management unit (MMU). This kind of processor
is currently used in high-end consumer computers, and has
well-proven architectural features that we expect to see in
embedded processors in the coming years; many of them
are already available in ARM SoCs [3]. The details of our
machine are summarized in Tab. 1. As operating system (OS)
we consider Linux, specifically in an SMP configuration.
Along this report, we will give some numbers to illustrate
the magnitude of some effects. These numbers are given to
the best of our knowledge, the best available vendor docu-
mentation, and supported by measurements that we have
Unit Properties
Processor Intel Core i7-2640M @2.8GHz, dual-core
Microarchitecture “Sandy Bridge”, Microcode version 0x25
L1i/d cache 32kB, 8-way, 64B/line, per core
L2 unified cache 256kB, 8-way, 64B/line, writeback, per core
L3 uncore cache 4MB, 16-way, 64B/line, shared between all cores
TLB 2-level, second level unified, per core
OS Debian 8.11 GNU/Linux SMP 3.16.51-3
Table 1. System specifications
taken. The numbers are all based on the following system
penalties, caused by the typical design of CPUs and OSes.
Penalties: The various latencies for our system are sum-
marized in Tab. 2, and discussed in the following. They are
taken from the processor documentation [12], and extended
by measurements using lmbench [24] and pmbench [32]. The
values denoted “best case” stem from the manufacturer doc-
umentation. Our measurements show that these values are
also the most frequently observed ones. Note, however, that
penalties can (and should) be hidden by out-of-order pro-
cessing. That is, not every page walk delays computation for
30 cycles.
Event Condition Latency [cycles]
L1 cache hit best case 4
L2 cache hit best case 12
L3 cache hit best case 26..31
DRAM access best case ≈ 200
branch misprediction most frequent 20
TLB miss 2nd-level TLB hit 7
page walk most frequent ≈30
minor page fault most frequent ≈1,000 .. 4,000
major page fault most frequent ≈260k .. 560k
context switch best case ≈3,400
Table 2. Penalties for system in Table 1
Software Performance: In this report, we look at software
performance, mainly from a timing point of view. Specifically,
for the execution time of the process, we only consider the
time where the processor executes instructions on behalf of
the process (including kernel code and stalls), but not sleep-
ing or waiting states, since the latter are either voluntary or
depend on the execution context and not the program itself.
2.1 Microarchitecture
As an overview about the features discussed next, consider
the microarchitectural block diagram shown in Fig.2.
2.1.1 Number of Cycles and Instructions
As a single number indicating program speedup, we first
look at the number of processor cycles spent on execution.
Lower is better. The number of clock cycles is primarily
driven by the instructions being executed, whereas the exact
2
Front End Instruction
Cache Tag
µOP Cache
Tag
L1 Instruction Cache
32KiB 8-Way Instruction
TLB
MicroCode
Sequencer
ROM
(MS ROM)
Decoded Stream Buffer (DSB)
(µOP Cache)
(1.5k µOPs; 8-Way)
(32 B window)
Branch
Predictor
(BPU)
L2 Cache
256KiB 8-W
ay
Unified STLB 
To L3
Execution
Engine
Memory Subsystem
L1 Data Cache
32KiB 8-Way
Data TLB
(36 entries)
Store Buffer & Forwarding
µOPµOPµOPµOPµOPµOP
16 Bytes/cycle
µOPµOPµOPµOP
32B/cycle
32B/cycle
32B/cycle
Stack Engine
(SE)
Adder Adder Adder
Instruction Fetch & PreDecode
(16 B window)
Instruction Queue
MOP
(40, 2x20 entries) Macro-Fusion
MOP MOP MOP MOP
MOPMOP MOP MOP MOP MOP
1-4 µOPs µOP µOPµOP
4-Way Decode 
Complex
Decoder
Simple
Decoder
Simple
Decoder
Simple
Decoder
4 µOPs
MUX
4 µOPs
Allocation Queue (IDQ) (2x28 µOPs) Micro-Fusion Loop StreamDetector (LSD) 
Register Alias Table (RAT) 4 µOP
Branch Order Buffer
(BOB) (48-entry)
Rename / Allocate / Retirement
ReOrder Buffer (168 entries) Zeroing IdiomsOnes Idioms
Line Fill Buffers (LFB)
(10 entries)
16B/cycle
128bit/cycle
4 µOPs
EUs
Scheduler
Unified Reservation Station (RS)
(54 entries)
Integer Physical Register File
(160 Registers)
Vector Physical Register File
(144 Registers)
Port 0 Port 1 Port 5 Port 2 Port 3 Port 4
INT ALU
INT DIV
INT Vect ALU
INT Vect MUL
FP MUL
FP DIV
INT ALU
INT MUL
INT Vect ALU
FP ADD
Vect Shuffle
INT ALU
Vect Shuffle
INT Vect ALU
Branch
AGU
Load Data
AGU
Load Data
Store Data
µOPµOPµOPµOPµOPµOP
Com
m
on D
ata Buses (CD
Bs)
Int
Int Vect
FP
Load
Store
Load Buffer
(64 entries)
16B/cycle
16B/cycle
Figure 2. Microarchitecture of Intel Sandy Bridge as example of a superscalar out-of-order processor with caches, branch
predictor and prefetcher. Taken from [4], and checked for consistency against manufacturer datasheets [12].
3
relationship is defined by the implementation of the microar-
chitecture. The relationship is usually nontrivial and not
fully documented, therefore we have forgone indicating the
relative influence of events of clock cycles in Fig.1. We will
provide typical numbers in the following descriptions, as far
as they are known.
If we count both the number of retired instructions and
processor cycles, we can compute the ratio instructions per
cycle (IPC). While often used as a first performance indicator,
this figure is highly program-specific, and not helpful for
judging our software. For example, a program may execute
faster than before although the IPC has dropped, simply by
virtue of a reduced number of instructions.
Micro-operations (uops): Some processors split instruc-
tions into multiple smaller operations [3, 12], which pro-
vides more possibilities for out-of-order processing [8]. On
such targets, it is more meaningful to measure uops instead
of instructions, since a single instruction may become a
variable-length series of uops. Some, usually more complex
instructions, may even invoke Microcode.
Microcode: On some processors, instructions are not hard-
wired but interpreted, to be broken down to hardwired ma-
chine code during execution. This abstraction layer can incur
a slowdown, but it allows upgrading processors during their
lifetime, e.g., to fix processor errata. On x86, microcode is
only used for few complex instructions; specifically, only for
instructions generating more than four uops in our proces-
sor [8]. The performance can be impacted in two ways: First,
switching between the regular instruction stream and Mi-
crocode may incur a penalty, and second, Microcode generat-
ing a lot of uops may be limited by the frontend bandwidth.
Uops Cache: Instructions are decoded into uops by a hard-
ware circuit that can be limited in throughput, causing a
bottleneck if the average instruction length exceeds its ca-
pacity. Some processors add a cache to store decoded uops,
in an attempt to alleviate this problem [8].
2.1.2 Pipelining
Instead of processing one instruction or uop after another,
processing is often is temporally overlapping to reduce ex-
ecution time, called pipelining. For example, while one in-
struction is being executed, the subsequent one can already
be decoded to uops in parallel, and meanwhile the uops
from the previous instruction can be retired. Today it is very
rare to find processors or even microcontrollers that do not
implement some form of pipelining.
2.1.3 Out-of-Order and Superscalar Processing
Modern processors implement dynamic scheduling of in-
structions [3, 8]. That is, the order in which instructions are
executed, may deviate from the original order of the instruc-
tion stream given by the program counter. This enables the
processor to hide various processing latencies, by perform-
ing other work while waiting for an instruction to finish. For
example, an arithmetic division can take several cycles to
complete, meanwhile successor instructions can be executed,
up to the point where the result of the division is required.
Similarly, memory access latencies can be hidden. This is
one of the primary reasons, while it is nontrivial to predict
the number of clock cycles that a certain event slows down
a program.
As a side effect, such out-of-order (OoO) processing en-
ables superscalarity. By having more than one instance of
each functional unit (e.g., several ALUs or FPUs), it becomes
possible to execute several instructions in parallel. Thus, OoO
processors often issue multiple instructions at the same clock
cycle; for our machine, four uops are issued simultaneously,
thus the maximum IPC is four.
An algorithm for OoO was proposed by Robert Tomasulo
in 1967 [28], and today’s out-of-order processor implemen-
tations still follow the same concept [8, 12]. The OoO im-
plementation is often called execution engine, and for the
purpose of our performance measurements, the following
constituents must be considered: 1. Register Renaming and
2. Scheduler.
Register Renaming: Register names used by the compiler
are those defined by the instruction set architecture (ISA),
called architectural registers. Due to the limited number of
registers, the compiler might be forced to re-use registers
for computations that are otherwise not related, creating
false data dependencies. However, actual implementations
of the ISA may have more physical registers than architec-
tural ones. Therefore, the first step is to map architectural to
physical registers, while resolving some false dependencies.
This renaming process therefore improves superscalar and
OoO processing [8].
Scheduler: The renamed operations now enter the main
core of OoO processing, the scheduler. Here operations are
queued up for execution units in reservation stations, and
will be held there until all operands become available. When
this becomes the case, the operation is started on the execu-
tion unit. As soon as it completes, results are propagated to
all subscribers (e.g., registers or reservation station entries
waiting for a result), and the operation is forwarded from the
reservation station to the Reorder Buffer. From here, most
schedulers also take care of retiring the instructions in their
original order. A bottleneck may occur when the scheduler
stalls execution due to lack of free reservation stations.
2.1.4 Branch Mispredictions
Many microarchitectures perform some kind of branch pre-
diction, to hide the latency for loading the instructions and
data after a branch [3, 8]. Predictions for both the outcome
of two-way branches, and the (possibly multi-way) target
address of indirect branches, are being made by the branch
prediction unit.
If the branch is incorrectly predicted, then the pipeline and
other resources must be flushed, which means there is a time
4
penalty to fetch and decode the correct instructions. The
magnitude of the penalty primarily depends on the pipeline
depth. In case of Sandy Bridge, the penalty for flushing and
resuming execution is about 20 cycles [8, 12].
2.1.5 Speculative Execution
The time window between branch prediction and learning
the actual branch outcome is spent with speculative execution.
That is, the processor continues the control flow at the as-
sumed branch target, and buffers the results until the actual
branch target becomes known. Whenever the prediction was
incorrect, it flushes the pipeline as described above. If the
prediction was correct, the speculatively executed instruc-
tions are allowed to be released from the buffer (“retire”),
and no time was lost waiting for the branch outcome.
There is one lesser known side effect, however, dubbed
pollution in Fig.1. Although instructions executed during
mis-speculation are not retired, they can still cause changes
in cache and buffer states. These effects are indirect cost of
branch mispredictions, which manifest themselves during
later execution. These effects have recently been exploited
in the Spectre and Meltdown vulnerabilities [19].
Last but not least, even perfect speculation might become
a performance bottleneck in some cases. All speculatively
taken branches are stored in the branch order buffer (BOB)
until they are confirmed. However, too many speculations in
a short time window might cause the BOB to fill up, which
in turn stops the issue of new uops [12].
2.1.6 Machine Clears
When multiple threads can run truly in parallel (as on SMP
systems and especially with OoO processing), the ordering of
memory accesses must be monitored and ensured. If the CPU
detects that accesses complete differently to program order,
a machine clear is performed. This entails undoing some
operations, flushing the pipeline, and re-starting with the
correct operands [12]. The cost is comparable to a branch
misprediction. Further causes for machine clears are self-
modifying code and illegal AVX addresses.
2.1.7 Cache Hierarchy and Misses
Although memory access is fast these days, it can still be or-
ders of magnitudes slower than the processor, which has be-
come known as the Memory Wall [31]. Consider the penalty
for a major page fault (disk access) shown in Tab. 2: At least
five orders of magnitude lie between the processor and the
disk access, although we use a relatively fast Solid State
Disk (SSD). Caching is the omnipresent approach to coun-
teract this issue, by preloading or latching all data in faster,
smaller memories, and exploiting the principle of locality.
That is, data that has been recently accessed, is likely to be
accessed soon in the future again. It thus makes sense to
buffer recently used data in caches. Every time a data item
is requested (“data access” in Fig.1), we check for the desired
data in the faster cache, before accessing the slow memory.
If we find the data, called a cache hit, we have circumvented
waiting for the slow memory. If the cache does not contain
the data, called a cache miss, we have to pay the penalty for
accessing the slow memory (usually DRAM). After this, the
data is placed in the cache for future reference.
Since caches have to be small to be fast, there are inevitably
situations when data is not in the cache, and needs to be
loaded from slower memory. Even if we were able to per-
fectly predict what data (including instructions) is needed,
then there are still compulsory misses on first access. As a
result, execution can be slowed down by one or two orders
of magnitude even with fast caches and very high hit ratio.
For example, let us assume each instruction takes one cycle,
and that each cache miss costs five extra cycles. Then, even
with p = 95% hit ratio, a program with X instructions would
take X + X (1 − p)5 = 1.25X cycles, i.e., experience a 25%
slowdown. Measuring cache behavior is therefore important.
VictimCaches: These are small and fully associative caches,
holding items evicted from a larger, not fully-associative one.
Each miss in the large cache is first looked up in the victim
cache, before the slower memory is consulted. This masks
miss times for temporally close conflict misses. From a prac-
tical point of view, the victim cache need not be considered
separately; instead, a victim hit can be seen as a hit in the
faster cache. Vice versa, a victim miss can be seen as a hit in
the next-slower memory or cache.
Hierarchy: As most modern processors, our machine uses
a hierarchy of caches. That is, there are three caches in a
cascade (three “levels”) before the slower DRAM is accessed.
The first level (L1) is the fastest/smallest and separate for
data (L1d) and instruction (L1i), see Tables 1 and 2. Leaving
aside special architectural tricks, this is the only level that
can be accessed directly by the CPU. Higher levels (L2, L3)
are larger and slower, and usually unified. Depending on the
processor, some caches can be shared with peripherals or
other processors.
2.1.8 DRAM Access
The next-larger memory after the cache hierarchy, is Direct
Random Access Memory (DRAM). Conceptually, the cache
hierarchy acts as a frontend and buffer to DRAM accesses.
Only if the lookup of data or instructions missed at all levels
in the cache hierarchy (Fig.1 “higher-level cache miss”; not
necessarily sequentially, though), then the DRAM is con-
sulted. Accesses to DRAM are typically 100 times slower
than the CPU, thus the penalty missing the entire cache hier-
archy becomes steep. Even worse, DRAM is often accessed
via the Northbridge, possibly suffering contention with other
CPUs and DMA transfers [7].
2.1.9 Hardware Data Prefetching
To further reduce access times to slow memory, many pro-
cessors have a data prefetcher circuit, which predicts future
5
data accesses and actively pre-loads the data into caches.
Most prefetchers are triggered by certain access patterns in
cache misses [7]. Some newer processors may even cross
page boundaries and thus influence the TLB [12, §2.4.7]. In
summary, prefetch events both depend on and influence L1d
cache events, as shown in Fig.1.
2.1.10 ISA Extensions, Streaming/Vector/SIMD
Processors may extend the ISA with instructions applying
the same operation on multiple data items (vectors, single-
instruction-multiple data). Examples are the extensions AVX,
SSE, MMX on Intel and AMD, and NEON on ARM proces-
sors. Using these can greatly speed up certain calculations,
but switching to these modes may also cause extra penal-
ties [8, §9.1.2]. In general, any switch between ISA modes or
extensions may cause extra penalties.
2.1.11 Direct Memory Access (DMA)
Accessing secondary storage, such as hard drives, but also
traffic from network cards, usually takes place via DMA,
which enables peripherals to exchange data directly with
the DRAM. As mentioned earlier, this might lead to bus
contention and slow down DRAM accesses for the processor.
2.1.12 Cache Coherency Protocols and ring bus
On multi-core systems, further cache accesses are caused by
cache coherency protocols, e.g., between the caches of the
different cores [7]. It becomes active during core migrations,
but also in the presence of data shared between cores.
2.1.13 Neglected Features
There are many peculiarities to each microarchitecture. We
have omitted many of such details, in an attempt to focus on
those features prevalent in most processors. Of course, these
details can be important for the performance, and a careful
study of the microarchitecture is required to see which need
to be measured. Omissions include instruction and uop fu-
sion, loopback buffers, register stalls, exhaustion of execution
ports, cache bank conflicts, misaligned memory access, and
store forwarding stalls. These are not considered to have a
large or systematic performance impact on Sandy Bridge [8].
2.2 Microarchitecture and OS Interaction
2.2.1 Virtual Memory and Paging
Virtual Addressing is common in larger general-purpose
and application processors, and is done for a variety of rea-
sons, chief among which are process isolation and provision
of a contiguous address space from the process’ point of
view. However, it also enables paging, which helps to sig-
nificantly mitigate the latency of accessing slow secondary
storage, such as hard drives. Unfortunately, Virtual Address-
ing comes with a performance penalty that can vary highly.
Translation needs to be performed for every instruction and
data reference (see Fig.1), thus there is an obvious incen-
tive to minimize delays. Therefore, in practice the address
translation process is very intricate and specific to the mi-
croarchitecture. An in-depth description can be found in [7].
Usually a hardware unit called Memory Management Unit
(MMU) is responisble for the address translation. To en-
sure translations do not stall execution, most MMUs have
their own caches, called Translation-look-aside buffers (TLBs),
which hold the most recently translated addresses. In case
the buffers do not provide the needed translation (TLB miss),
then a slow search in memory has to be conducted, called
a page walk, see Fig.1. On x86 machines, this means that
a dedicated hardware circuit starts looking for page table
entries in memory, which incurs a penalty depending on
memory access latency. For other architectures, like some
PowerPC, this might be done in software, and thus is even
slower. In case the information was found in the page table,
the TLB is updated and execution resumes. Otherwise, the
operating system is signaled a page fault, and has to decide
whether the access is allowed. If so, the OS updates the page
table and TLB with the requested translation, and possibly
bring missing data to the DRAM. The cost for virtual address
translation therefore is consisting of cycles spent in TLB
lookup (hardware), plus page walk cycles (often hardware),
plus cycles to handle page faults (OS).
Translation-Lookaside-Buffer:TLBs are regularly flushed
by the OS, since it is responsible for the coherency between
TLBs and page tables in memory, and that the TLBs do not
hold stale translations from a formerly running process. They
can either be flushed or selectively updated, depending on
the OS and hardware capabilities. Accesses after a flush are
TLB misses, therefore flushes degrade performance. In the
x86 architecture, there furthermore exists the case of TLB
shootdown. This stems from the need to have consistent TLB
entries between multiple cores in the presence of sharing.
Since x86 does not have a coherency protocol in place, TLB
contradictions between cores are avoided by triggering flush-
ing in hardware. Last but not least, TLB and cache access
can be executed in parallel, to reduce latency.
PageWalks: Examining the page tables in memory incurs a
penalty that depends on whether the tables are cached (L1d,
L2, L3), or whether the slow DRAM needs to be consulted.
On our machine, a typical page walk costs between 20 and
60 cycles. In an extreme case (“TLB trashing”), this could
happen for every instruction, and thus become very expen-
sive. On some microarchitectures, as in Intel Sandy Bridge,
page walks go through the caching hierarchy. That is, page
tables are buffered in L1d and below, and thus caches can be
modified by page walks. Consequently, caching behavior and
TLB misses cannot always be separated, even in the absence
of page faults. Additionally, it has recently been disclosed
that Intel processors speculatively work on possibly invalid
6
cache entries in parallel to page walks (see “L1TF” vulnera-
bility [6]). It can therefore be assumed that even the latency
of page walks is partially hidden.
Page Faults: If a translation cannot be found in the page
table, a page fault is signalled from the MMU to the OS (Fig.1
bottom). There are three fundamental types of page faults:
1. Invalid page fault, 2. major page fault, or 3. minor page
fault.
Invalid page faults are those caused by an attempt to ac-
cess addresses that are beyond the process’ address space, or
where privileges are insufficient. An example are segmenta-
tion faults. We do not discuss them further, since they are
pathological events pointing to faulty software.
Major page faults require disk access, which is orders
of magnitude slower than the effects we are trying to ob-
serve here. For our SSD-equipped machine, we used the
pmbench [32] tool and found the most frequent latency for
major faults as between 262 thousand and 524 thousand cy-
cles (coinciding with the median), same for both read and
write accesses. Additionally, the distribution is tail-heavy,
similar as in [32]. That is, the estimated average is about
one million cycles, due to some accesses exceeding several
dozens of milliseconds.
Closely related to page faults, Linux implements a page cache:
the virtual memory buffers data blocks of recently used files.
When files are read (e.g., via fread or mmap), then access is by
default buffered via the page cache. Therefore, if a program
is executed a second time and there was sufficient memory,
there ideally are no major page faults due to file access. To
further reduce first-time access latency, the Linux kernel
proactively reads file data from disk before it is demanded
(“read ahead”), which is no longer counted as major page
fault. Finally, if DRAM is exhausted and swapping is enabled,
unused pages are temporarily written to secondary storage
(“swapped out”). When they are needed again, they have to
be brought back to DRAM, which also causes major page
faults [9].
Minor page faults are caused by memory allocation with-
out disk access, but may still be problematic for our mea-
surements. The memory can either be immediately allocated,
or only reserved (“lazy”). In the latter case, which is the de-
fault case in Linux, the page is only created when the first
write occurs (“copy-on-write”). This means that the penalty
of minor page faults consists of two parts: the cost for the
fault handler itself, and the conditional copy-on-write cost.
Measuring the cost of a minor page fault is unequally more
complicated than major faults. First, the latency is relatively
short, resulting in inaccuracy when we try to measure it
using software. Tracing kernel functions is one option that
we have exercised, but it has some non-negligible overhead
in our kernel version, only giving us an upper bound. Hard-
ware measurements are not possible in this case. The result
is a somewhat wide range for the minor page fault cost. Us-
ing pmbench, we found that the mode lies between 1,024
and 4,096 cycles, again coinciding with the median. Note
that this includes lazy allocations. The numbers are consis-
tent with recent measurements of Torvalds on the successor
microarchitecture Haswell [29].
2.2.2 Context Switch
Switching between processes happens frequently (discussed
later in Section 2.2.6), and is an expensive operation that
influences the TLB and caches, see Fig.1.
Besides executing a number of instructions to save and
load hardware registers, stack pointer and PC of suspended
and waking processes, the pipeline is flushed, and the TLB
must also be updated [21]. On most Linux versions the TLB
update takes the form of a flush, accompanied by switch-
ing the page table pointer. This can only be avoided with
hardware that supports process context identifiers (PCIDs)
and with newer kernels supporting this feature – such as
x86 on Kernel 4.14 onwards [22]. The penalty for such TLB
invalidation is called the direct cost of a context switch [21].
However, the CPU caches are also affected, which incurs
indirect costs: Because processes share caches among them-
selves and also with the kernel, the waking process may not
find its data in the caches as had been left when it got sus-
pended, and experience cache misses. This effect magnifies
with growing working set size [21]. Context switches can
happen involuntarily due to interrupts, or voluntarily due to
system calls (both discussed later). Using lmbench, we found
that context switches on our system take at least 3,400 cycles,
with a frequent value around 30,000 cycles. Note that this
number would increase if a core migration happens at the
same time, which is explained next.
2.2.3 Core Migrations and Load Balancing
The OS (and the hardware, if Hyperthreading is enabled) may
migrate a running process between cores (“core migrations”
in Fig.1), to balance load or thermal stress. This causes a
context switch with additional overhead. The process must
be stopped, copied to another core’s run queue, cache lines
are moved to the new core [7], and only then both scheduling
domains are released. Naturally, this requires to run a lot
of kernel code, which in turn increases chances for events
like branch mispredictions, cache pollution and page walks.
We have observed latencies of beyond 100,000 cycles for
migrations, depending on the working set size.
2.2.4 Mode Switch
Switching between user and kernel mode is not a context
switch, and thus has lower overhead. These switches are
caused by system calls done by the running program (thus
the bidirectional interaction between instructions and mode
switches in Fig.1) for which the kernel is supposed to perform
some work on behalf of the calling process. The kernel is
then said to be in process context.
7
In Linux on x86, these calls involve copying the arguments
to registers, triggering a trap (CPU changes to kernel mode),
executing the trap handler (which copies the arguments from
the registers to the kernel stack and then performs its work),
and eventually returning to user mode. Depending on the
processor and OS, the TLB might be invalidated. This is the
case for x86 since the Spectre andMeltdown vulnerabilities in
2017, where now kernel page table isolation (KPTI) invalidates
entries during each mode switch to protect the kernel from
attacks, unless the hardware supports PCIDs (see above).
Finally, when exiting the kernel mode, a context switch to a
different process might take place.
Our kernel has KPTI enabled, but no PCID support. We
created a test program to measure the direct penalty from
KPTI.We found that the fastest syscall drops from atmost 660
cycles down to less than 130 cycles, when KPTI is disabled1.
Kernel developers have measured large slowdowns as well,
with up to 30% in networking code [5]. That is significant,
and thus needs to be considered during measurements.
In summary, mode switches also have a direct penalty
caused by the call overhead, and indirect penalties caused
by TLB and cache effects, depending on kernel version and
processor.
2.2.5 Interrupts
Typically, a few dozens of interrupts2 can arrive at any point
in time. For example, there are thermal interrupts in case the
hardware overheats, and machine check exceptions indicat-
ing hardware errors in the CPU. Some interrupts cannot be
avoided, while others can be disabled or redirected to differ-
ent cores. Interrupts do not directly cause context switches
as described above. Instead, the CPU itself saves and restores
very few registers (notably, the PC), and then a mode switch
happens [2], see also Fig.1. After this, the interrupt handler
must take care of the rest. Specifically, the kernel then is
running in interrupt context, as opposed to process context,
and the work being done here is not attributed to any user
process.
The interrupt handler itself is supposed to be fast and
must not suspend execution. This is also called the top half.
Interrupts which require longer-lasting or I/O work to be
done, must defer such work for later processing in a so-called
bottom half [16].
Once the top half handler finishes, the OS switches back
to user mode. As explained before, a context switch might
happen during this transition. Specifically, if the interrupt
had deferred some work to a bottom half, then a context
switch to a kernel thread might happen at this point in time,
to finish the interrupt work.
1This upper bound was measured as the minimum latency over many
calls of getuid, with the program given in Appendix A. This program had
experienced a 3x speedup!
2see /proc/interrupts
One special periodic interrupt is the local timer interrupt
driving the scheduler, also called “scheduling clock ticks”,
and discussed next.
2.2.6 Scheduler Ticks
By default, the Linux scheduler runs periodically, to select
which process is executed on the hardware for the next time
slice. This is achieved by a timer interrupt firing typically
every four milliseconds. Considering that interrupts cause
mode switches and possibly context switches (see Fig.1),
running the scheduler itself may unnecessarily suspend our
user task. Luckily, there are two options to configure the
kernel differently, commonly referred to as “tickless” [14]:
1. Dyntick-Idle, enabled by the kernel configuration op-
tion CONFIG_NO_HZ_IDLE=y, describes a kernel config-
uration that omits scheduling ticks when there is no
task to be executed. While this is often the default for
desktop and embedded systems to save energy, but is
not useful for our purpose.
2. Adaptive ticks, enabled by kernel configuration option
CONFIG_NO_HZ_FULL=y, describes a kernel configura-
tion that additionally turns off ticks when there is only
one runnable task, thus preventing unnecessary in-
terruptions. On top of the configuration option, it is
necessary to add a kernel boot parameter specifying
for which CPUs it shall be applied. RCU calls need to
be offloaded to other CPUs.
Both of these configurations have disadvantages, as well,
including increased number of instructions to switch to/from
idle mode, and more expensive mode switches [14]. Using
again the program shown in Appendix A, we have found
that mode switch times in our system increase to about 700
cycles with adaptive ticks and KPTI disabled. Consequently,
tickless operation may not be beneficial for workloads with
many syscalls, or for those going idle frequently.
2.2.7 Neglected Features
The most important mechanisms of OS-hardware interaction
have been explained. Nevertheless, there are many other
features and effects that depend on the specific version of
OS. Some omissions include speculative paging [11], as that
is a recent development not commonly used, yet, and page
migration between NUMA nodes, since our target is a single
NUMA node.
3 Measurement Errors
This section summarizes features that can create measure-
ment errors, building on the explanations given before and
visualized in Fig.1. By error we subsume all effects that stem
from sources other than the software under analysis and,
when not properly controlled, would therefore mislead us in
a systematic or random direction. For example, we consider
effects caused by other processes running in the background
8
of an operating system as measurement error. Inter alia, they
share caches with the software under analysis, and thus
the caching behavior of our software can look significantly
worse if there exists a data-heavy background process.
Table 3 depicts all major sources of measurement errors,
together with a classification whether they are caused by
hardware (HW) or software (SW), their measurable impact,
and the time when the effects manifest in the measurements.
Additional explanations are given in the notes below.
Notes for Table 3:
(a) Only traffic not caused by the software under analysis is
considered an error. Such traffic might be caused by all
other processes running on the same core, including the
kernel. See process isolation.
(b) If the number of demand data accesses shall be measured,
then the prefetcher skews the results, otherwise it is not
considered a source of error. Also, prefetching has the
same effect as demand access on caches and TLB, thus it
can skew TLB access metrics.
(c) Only interrupts not caused and required by the software
under analysis are considered source of errors. Interrupts
may cause other processes to be scheduled to execute
bottom halves.
(d) The mode switches themselves are often required to ex-
ecute system calls. But mode switch overhead can be
highly different depending on OS and HW. Therefore, it
may or may not contribute a significant error.
(e) Allocators for virtual memory have different characteris-
tics. Some may consume significantly more/less memory
than others, and some may force context switches.
(f) Non-tickless kernels may generate unnecessary context
switches and impair TLB, caches and thus performance.
None of those switches are caused by the software under
analysis, and thus considered errors.
(g) Major page faults cause disk access. If the effects to be
measured are rather small compared to disk access times,
then major faults would mask those effects due to large
jitter, and must be avoided. Furthermore, it is worth not-
ing that a process accessing a file might be sent into
waiting state, and then the number of unhalted processor
cycles is not incremented. Consequently, I/O is not di-
rectly visible in counter-based measurements, and even
I/O-limited applications can still exhibit a high IPC. In-
directly, however, it can be seen that the process spends
less CPU time than wall-clock time, pointing to an I/O-
bottleneck.
(h) Depending on the microarchitecture, DMA might slow
down DRAM access because of increased traffic on the
Northbridge [7]. Naturally, DMA also generates inter-
rupts. Furthermore, cache coherency might be a con-
cern [25], and generate extra cache traffic.
(i) Some caches might be shared with peripherals and thus
be influenced by their operation. For example, the GPU
and the processor share the L3 cache on our machine.
This means that even differentmonitor setups can change
cache miss statistics and bandwidth benchmarks3.
3.1 Short-Living Programs
When an OS is used, there is necessarily an overhead for
starting and terminating a process. This can become signif-
icant when we measure the performance of programs that
have a short execution time. After all, most developers are
not trying to optimize the OS or its libraries, but their own
software. To give a rough number, programs that execute less
than a million instructions should be considered to be too
short, with the specific number being subject to the amount
of work the OS has to do.
As an example, consider the call graph of a program (the
nsichneu from the Mälardalen WCET benchmarks, com-
piled with gcc and -O3) shown in Figure 3a. The user code,
all in a single function (black node in the upper right half),
executes about 40,000 instructions. The entire program, from
startup to after termination, requires about 150,000 instruc-
tions. The OS performs tasks such as dynamic loading (starts
in upper left corner and spreads about two thirds into the
picture), which locates and loads into memory all dynamic
libraries used by the program. The actual user code is exe-
cuted thereafter, and followed by another cascade of calls
performing cleanup actions. Figure 3b shows the same pro-
gram with static linking. The remaining overhead is mainly
from the C library, although this programs does not make
any explicit library calls, as evident from the call graph. In
this example, the user code is only responsible for about 26%
of the processor cycles with dynamic linking, and for 75%
with static linking.
In summary, if we are not aware of such OS and library
overhead when looking at the measurement results, we may
draw conclusions that merely reflect the operating system
and its libraries, rather than the software that we are trying
to analyze and improve.
4 Measurement Setup
In this section, we propose a measurement setup on SMP
Linux, that minimizes measurement errors and thus provides
a faithful quantification of the software under analysis, while
suppressing other influences as far as possible.
All event interactions described in previous sections are
highly depending on the microarchitecture, the operating
system and its configuration, next to the program under
analysis itself. We describe our measurement setup for the
system detailed in Table 1, as a representative of an advanced
superscalar out-of-order processor with operating system,
MMU, caches, prefetchers and so on. For different targets or
OSs, the setup has to be adapted accordingly.
3We have observed a 10% decline in throughput in the STREAM bench-
mark [23], when executed on a dual-monitor vs. single-monitor setup
9
0x
00
00
00
00
00
00
11
30
_d
l_
st
ar
t
1
_d
l_
in
it
1
0x
00
00
00
00
00
40
6d
e6
1
_d
l_
se
tu
p_
ha
sh
1
_d
l_
sy
sd
ep
_s
ta
rt
1
br
k
1
st
rle
n
1
sb
rk
1
dl
_m
ai
n
1
1
_d
l_
ne
xt
_l
d_
en
v_
en
try
3
bc
m
p
3
_d
l_
ne
w
_o
bj
ec
t
1
_d
l_
ad
d_
to
_n
am
es
pa
ce
_l
is
t
1
st
rc
m
p
1
_d
l_
di
sc
ov
er
_o
sv
er
si
on
1
_d
l_
in
it_
pa
th
s
1
ac
ce
ss
1
_d
l_
de
bu
g_
in
iti
al
iz
e
2
_d
l_
de
bu
g_
st
at
e
2
ha
nd
le
_l
d_
pr
el
oa
d
1
_d
l_
m
ap
_o
bj
ec
t_
de
ps
1
_d
l_
re
ce
iv
e_
er
ro
r
1
in
it_
tls
1
_d
l_
re
lo
ca
te
_o
bj
ec
t
4
_d
l_
ad
d_
to
_s
lo
tin
fo
1
_d
l_
al
lo
ca
te
_t
ls
_i
ni
t
1
_d
l_
sy
sd
ep
_s
ta
rt_
cl
ea
nu
p
1
_d
l_
un
lo
ad
_c
ac
he
1
5
ca
llo
c
3
m
al
lo
c
2
m
em
cp
y
3
m
em
pc
py
2
8
__
lib
c_
m
em
al
ig
n
27
m
m
ap
3
rtl
d_
lo
ck
_d
ef
au
lt_
lo
ck
_r
ec
ur
si
ve
3
rtl
d_
lo
ck
_d
ef
au
lt_
un
lo
ck
_r
ec
ur
si
ve
3
st
rc
m
p'
2
11
2
10
48
un
am
e
1
1
3
1
_d
l_
im
po
rta
nt
_h
w
ca
ps
1
in
de
x
1
fil
lin
_r
pa
th
1
1
1
2
4
4
4s
trs
ep
5
ex
pa
nd
_d
yn
am
ic
_s
tri
ng
_t
ok
en
4
fr
ee
4
5
lo
ca
l_
st
rd
up
5
6
6
6
1
ds
o_
na
m
e_
va
lid
_f
or
_s
ui
d
1
do
_p
re
lo
ad
1
_d
l_
ca
tc
h_
er
ro
r
1
_d
l_
in
iti
al
_e
rr
or
_c
at
ch
_t
sd
3
__
si
gs
et
jm
p
3
m
ap
_d
oi
t
1
op
en
au
x
2
_d
l_
m
ap
_o
bj
ec
t
1 1
2
2
1
1
_d
l_
na
m
e_
m
at
ch
_p
7
op
en
_v
er
ify
2
_d
l_
m
ap
_o
bj
ec
t_
fr
om
_f
d
2
op
en
_p
at
h
1
_d
l_
lo
ad
_c
ac
he
_l
oo
ku
p
1
66
8
op
en
18
re
ad2
2
2
2
5
2
_f
xs
ta
t
2
m
pr
ot
ec
t
2
m
em
se
t
2
cl
os
e
2
_d
l_
ne
xt
_t
ls
_m
od
id
1
3
5
2
2
3
2
36
16
_x
st
at12
2
1
_d
l_
sy
sd
ep
_r
ea
d_
w
ho
le
_f
ile
1
_d
l_
ca
ch
e_
lib
cm
p
9
1
1
1
1
_d
l_
ca
ch
e_
lib
cm
p'
2
2
1
ve
rs
io
n_
ch
ec
k_
do
it
1
_d
l_
ch
ec
k_
al
l_
ve
rs
io
ns
1
_d
l_
ch
ec
k_
m
ap
_v
er
si
on
s
4
3
7
m
at
ch
_s
ym
bo
l
3
3
1
_d
l_
de
te
rm
in
e_
tls
of
fs
et
1
_d
l_
al
lo
ca
te
_t
ls
_s
to
ra
ge1
1
al
lo
ca
te
_d
tv1
1
_d
l_
lo
ok
up
_s
ym
bo
l_
x
88
st
rc
as
ec
m
p
1
tim
e
1
st
rn
ca
se
cm
p
1
ge
tti
m
eo
fd
ay
1
m
em
cp
y@
G
LI
B
C
_2
.2
.5
1
_d
l_
pr
ot
ec
t_
re
lro
2
do
_l
oo
ku
p_
x
91
17
ch
ec
k_
m
at
ch
.9
45
7
84
96
__
in
it_
cp
u_
fe
at
ur
es
1
_d
l_
vd
so
_v
sy
m
1
1
2
1
1
m
un
m
ap
1
ca
ll_
in
it.
pa
rt.
0
4
0x
00
00
00
00
04
a2
35
88
1
0x
00
00
00
00
00
00
06
d0
1
_i
ni
t
1
in
it_
ca
ch
ei
nf
o
1
dl
in
it_
al
t
1
2
__
in
it_
m
is
c
1
__
ct
yp
e_
in
it
1
rin
de
x
1
ha
nd
le
_i
nt
el
2
in
te
l_
ch
ec
k_
w
or
d
4
_d
l_
ru
nt
im
e_
re
so
lv
e
1
(b
el
ow
 m
ai
n)
1
_d
l_
fix
up
3
3
__
cx
a_
at
ex
it
1 __
lib
c_
cs
u_
in
it
1
_s
et
jm
p
1
m
ai
n
1
ex
it
1
__
ne
w
_e
xi
tfn1
0x
00
00
00
00
00
40
03
70
1
0x
00
00
00
00
00
40
6e
b01
__
si
gs
et
jm
p
1
__
si
gj
m
p_
sa
ve
1
__
ru
n_
ex
it_
ha
nd
le
rs1
__
ca
ll_
tls
_d
to
rs
1
_d
l_
fin
i
1
_I
O
_c
le
an
up
1
_E
xi
t
1
1
__
tls
_g
et
_a
dd
r
1
1
1
_d
l_
so
rt_
fin
i
1
0x
00
00
00
00
00
40
6e
90
1
0x
00
00
00
00
00
40
6f
54
1
0x
00
00
00
00
00
00
06
90
1
0x
00
00
00
00
04
a2
38
5c
1
3
0x
00
00
00
00
00
40
6e
10
1
1
__
cx
a_
fin
al
iz
e
1
0x
00
00
00
00
00
00
06
00
1
__
un
re
gi
st
er
_a
tfo
rk
1
_I
O
_f
lu
sh
_a
ll_
lo
ck
p
1
(a)
0x
00
00
00
00
00
40
78
9e
(b
el
ow
 m
ai
n)
1
in
it_
ca
ch
ei
nf
o
ha
nd
le
_i
nt
el
2
_d
l_
au
x_
in
it
1
_s
et
jm
p
1
in
de
x
1
st
rn
ca
se
cm
p_
l
1
m
em
m
ov
e
1
st
rs
tr
1
st
rc
py
1
__
pt
hr
ea
d_
in
iti
al
iz
e_
m
in
im
al
1
_d
l_
di
sc
ov
er
_o
sv
er
si
on
1
__
lib
c_
in
it_
fir
st
1
st
pc
py
1
m
ai
n
1
ex
it
1
st
rc
m
p
1
__
cx
a_
at
ex
it
1
__
lib
c_
cs
u_
in
it
1
st
rc
as
ec
m
p_
l
1
bc
m
p
1
ch
ec
k_
fr
ee
.is
ra
.0
__
si
gs
et
jm
p
1
__
si
gj
m
p_
sa
ve
1
_I
O
_c
le
an
up
_I
O
_f
lu
sh
_a
ll_
lo
ck
p
1
__
in
it_
cp
u_
fe
at
ur
es
1
in
te
l_
ch
ec
k_
w
or
d
4
__
ru
n_
ex
it_
ha
nd
le
rs
1
__
lib
c_
cs
u_
fin
i
1
_E
xi
t
1
fin
i
1
0x
00
00
00
00
00
40
79
50
1
0x
00
00
00
00
00
48
e1
18
1
1
0x
00
00
00
00
00
40
78
d0
1
__
de
re
gi
st
er
_f
ra
m
e_
in
fo
1
__
lib
c_
se
tu
p_
tls
1
sb
rk
1
m
em
cp
y
1
br
k4
un
am
e
1
__
lib
c_
in
it_
se
cu
re
1 _
dl
_n
on
_d
yn
am
ic
_i
ni
t
1
__
in
it_
m
is
c
1
__
ct
yp
e_
in
it
1
_d
l_
ge
t_
or
ig
in
1
ge
te
nv
7
_d
l_
in
it_
pa
th
s
1
m
al
lo
c
1
m
em
pc
py
1
m
al
lo
c_
ho
ok
_i
ni
1
_i
nt
_m
al
lo
c
12
m
al
lo
c'2 11
pt
m
al
lo
c_
in
it.
pa
rt.
7
1
__
lin
ki
n_
at
fo
rk
1
m
al
lo
c_
co
ns
ol
id
at
e
1
__
de
fa
ul
t_
m
or
ec
or
e
2
m
al
lo
c_
in
it_
st
at
e
1
2
st
rle
n
7
st
rn
cm
p
13
1
3
1
_d
l_
im
po
rta
nt
_h
w
ca
ps
1
fil
lin
_r
pa
th
1
1
1
ac
ce
ss
1
__
sy
sc
al
l_
er
ro
r
1
4
4
4
st
rs
ep
5
ex
pa
nd
_d
yn
am
ic
_s
tri
ng
_t
ok
en
4
fr
ee
4
st
rp
br
k4
1
__
G
I_
st
rc
hr
4
lo
ca
l_
st
rd
up
4
__
de
re
gi
st
er
_f
ra
m
e_
in
fo
_b
as
es
1
4
4
4
_i
nt
_f
re
e
4
rin
de
x
1
__
ne
w
_e
xi
tfn
1
1
0x
00
00
00
00
00
40
02
78
1
0x
00
00
00
00
00
40
79
80
1
__
re
gi
st
er
_f
ra
m
e_
in
fo
1
(b)
Figure 3. Call graphs of a short-living program, suggesting that most of the work being done is OS “overhead”. The user
function is highlighted in black. Both figures show the same program, where (a) is dynamically linked and (b) statically linked.
10
Source Class Impact When Note
Speed Stepping/Frequency Scaling HW varying execution time immediate
TLB Shootdown HW slower address translation lagging
DMA Transfers HW/SW slower memory access immediate (h)
Cache Coherency* HW additional accesses to cache immediate & lagging (a)
Hardware Prefetcher* HW more cache & TLB accesses, less or more misses immediate & lagging (b)
Load Balancing/Migrations SW more & TLB accesses, less or more misses immediate & lagging
Interrupts HW/SW longer execution time, more mode switches immediate & lagging (c)
Mode Switches SW more cache/TLB misses, more context switches immediate & lagging (d)
Context Switches SW more cache/TLB misses, longer execution time immediate & lagging
Allocator SW more context switches, different cache usage immediate & lagging (e)
Scheduler SW more or less context switches immediate & lagging (f)
Major Page Fault* SW/HW more context switches immediate & lagging (g)
Peripherals HW more variance in memory access, cache misses immediate & lagging (i)
Table 3. Summary of sources for measurement errors. Sources marked with (*) may or may not be considered creating
measurement errors, depending on the intended observation.
The main focus of our setup is how to control variables
that would otherwise lead to the measurement errors listed
in the previous section. Therefore, many of the suggestions
given here are about parameters and configuration of both
the OS and the CPU. In the end, we briefly describe the
measurement process itself.
4.1 Control Variables
4.1.1 Isolating and Pinning the Process
First, one or more (yet not all) CPU cores were isolated from
the scheduler and SMP balancing, which prevents other
userspace tasks from interfering (kernel boot parameter
isolcpus). We recommend to leave out CPU0, since one
core is required to process the offloaded tasks, and CPU0
usually serves interrupts that cannot be moved to other cores
(e.g., some DMA controllers). Additionally, Hyperthreading
was turned off in BIOS, to avoid hardware context switches,
same-core migrations and some known errata with the hard-
ware counters. The process under analysis was subsequently
pinned to the isolated cores with command line tool taskset.
Note that each subprocess/thread needs pinning if a range of
cores is isolated, since otherwise migrations might happen.
Interrupts: The affinity of interrupts was set to the non-
isolated cores, preventing them from skewing the measure-
ments4. Thermal interrupts were prevented by throttling the
CPU speed such that full load does not lead to critical tem-
peratures. This can be done by setting the governor’s limits
in sysfs or tools like cpufreq.Machine check exceptions were
disabled by kernel boot parameter mce=off.
Ensuring tickless operation: To minimize the number of
context switches, a kernel with adaptive ticks should be
used, and enabled with kernel boot parameter nohz_full.
The remaining challenge is to keep kernel threads off the
4/proc/irq/default_smp_affinity
run queue, to prevent scheduler ticks from getting generated.
There are several factors that can cause runnable kernel
threads, which are discussed in [15]. One such factor are
RCU calls, which should be disabled for the isolated CPUs
with kernel boot parameter rcu_nocbs. Another reason are
bottom halves of interrupts as discussed earlier, which can
be prevented by setting the interrupt affinity.
Allocator Selection: Yet another reason why tickless oper-
ation fails, may be the kernel’s memory allocator. Depend-
ing on the distribution, different algorithms (called “SLAB”,
“SLOB” or “SLUB”) can be in use, with different impact on
cachemiss/hit metrics and execution time. Notably, the SLAB
allocator requires periodic cleanups, which enables a kernel
process and prevents tickless operation. The best option is
the newest, and in Desktop distributions most commonly
used allocator, the SLUB allocator (one exception is Debian,
using SLAB and thus rendering adaptive ticks ineffective).
Changing the allocator requires building a custom kernel.
Watchdogs:More context switches could be caused by the
watchdog. This can be prevented with kernel boot parame-
ters nowatchdog and nosoftlockup.
Real-Time Priority:Depending on the Linux version, some
kernel threads [15] are still active on isolated cores and can
cause unwanted context switches (e.g., kworker). To avoid
this, we perform our measurements at real-time priority us-
ing the command line tool chrt. Additionally, real-time throt-
tling must be disabled (by setting sched_rt_runtime_us in
procfs), since otherwise some Linux versions force one invol-
untary context switch per second for CPU-bound tasks.
Summary: The show configuration results in no other user
processes running on the isolated cores, and in no CPU mi-
grations taking place of our process(es). Some interrupts may
11
still be left, but these should only cause mode switches. This
can be verified using perf5.
4.1.2 Fixing Processor Speed
If the processor supports speed stepping/frequency scaling,
then the clock speed of the processor should be fixed for
two reasons. First, if the absolute execution time is part of
the measured metrics, a non-fixed clock speed could yield
different results in subsequent runs, depending on how the
governor selects the frequency. Second, we conservatively
prevent a potential impact of frequency scaling on processor
cycles; usually there is not enough information in the data
sheets to conclude there is no influence.
Further, fixing the clock speed also ensures more stable
DRAMaccess timeswhenmeasured in processor clock cycles.
DRAM runs at its own bus frequency, which means that a
higher CPU frequency will make DRAM bottlenecks stand
out more. This is important for backend-bound benchmarks,
where the bottleneck becomes more severe with increasing
processor frequency.
Fixing the processor speed is best done in BIOS by dis-
abling technologies like speed stepping. However, it has to
be ensured that the processor has sufficient cooling under
full load, which often is not the case for mobile or Laptop de-
vices. Alternatively, for Intel CPUs, the older ACPI driver can
be activated with boot parameter intel_pstate=disable,
which supports to fix the frequency of the processor.
4.1.3 Controlling TLB Flushes and Shootdowns
Even though we have isolated cores from the scheduler,
shootdowns may still happen because the Linux kernel runs
on the same core as the process it is serving. Thus, every
core that is used executes a part of the kernel at some point,
where shared kernel data can lead to shootdowns.
In our kernel version KPTI is enabled, yet PCID support is
not available. Therefore, each mode switch flushes the TLB,
significantly skewing our measurements (see Section 2.2.4).
We thus disabled KPTI with kernel boot option pti=off6.
This step is not recommended when PCIDs are supported by
both OS and CPU.
4.1.4 Controlling Page Faults
As earlier, we assume the user wants to avoid major page
faults as far as possible, since they have a penalty large
enough to hide other effects that may be of interest. Also,
there is not much the user can do if access to secondary
storage is logically required, unless the the program under
analysis is redesigned.
In Linux, reading or writing data files is by default buffered
through the virtual memory. A read-ahead heuristic is used
5perf record -e "sched:sched_switch" ...
6Note that this may open a security vulnerability!
to bring data from the slow disk to RAM, as soon as a de-
mand is foreseen. However, while this mechanism prevents
major page faults quite effectively, it causes DMA transfers
and might not be desirable. Once a file is buffered in virtual
memory (“page cache”), data can be accessed without disk
access taking place. Obviously, the page cache can only pre-
vent all disk access if it is large enough to hold all files that
will be in use, and not flushed in between.
The easiest way to make use of the page cache is to run the
measurement twice, and discarding the results from the first
run. Alternatively, there exist tools which allow checking
and manipulating the page cache, e.g., vmtouch.
4.1.5 Controlling Hardware Data Prefetching
Although it is possible to distinguish some events by their
cause (i.e., demand vs. speculation), it cannot be known
how much of the prefetcher-induced penalties are hidden
in out-of-order processing, or how many of certain events
are caused by the software directly. One shortcut to answer
this question is to turn off hardware prefetching and run the
benchmarks twice, then compare results. This is specific to
every processor family, and may not be possible on all CPUs.
For Intel CPUs, machine-specific registers (MSRs) allow con-
figuring the processor in many ways, inter alia to turn off
prefetchers.
4.1.6 Controlling Influences of Peripherals
If peripherals are sharing memory with the software under
analysis, it might be helpful to turn them off or minimize
their impact, in case an influence on the results cannot be
ruled out. For example, the graphics engine in Sandy Bridge
can be made to relinquish parts of the L3 cache by booting
into a low-resolution mode.
4.2 Taking Measurements
The act of taking measurements itself is straightforward, and
should provide meaningful results if the previous recommen-
dations have been followed. We therefore summarize this
only briefly.
4.2.1 Performance Monitoring Units
While there exist different ways and tools to measure per-
formance, such as hardware tracing and countless tools, we
briefly describe the use of the CPU’s performance monitoring
unit (PMU) [12], which provides hardware event counters.
These counters are processor-specific registers that are in-
cremented on the occurrence of certain events, for example,
the number of L1 cache misses, or the number of processor
cycles. Under Linux, the perf tool allows setting up and read-
ing these registers, and to separate event counts by process.
A growing number of CPUs and SoCs offer an equivalent to
the PMU, e.g., many ARM Cortex processors already do.
Additional to hardware events, the perf also reads kernel
(software) counters, such as minor and major page faults,
12
context switches and CPU migrations. Note: We highly rec-
ommend to use the native names of the registers as opposed
to perf’s names, to avoid misunderstandings (e.g., the perf
tool counts STLB hits under the name ITLB loads on Sandy
Bridge), and to watch out for errata (such as counter prob-
lems under HyperThreading).
Multiplex and Grouping: PMUs have a limited number of
hardware counters (eight, in our case). If we request more
events than counters, perf starts time-multiplexing, and ex-
trapolating the results from the sampling window to the
entire life time of the process. It is thus possible to miss
certain events if the specific counter is not currently ac-
tive. Therefore, if the process behavior is not stationary for
long enough, the results may appear inconsistent. One way
around this issue is grouping of counters, which tries to
multiplex all counters of a group at the same time. How-
ever, some processors have scheduling restrictions for which
counters must not be used together. One way around this is
to avoid multiplexing at all, and execute the program multi-
ple times instead. This only makes sense if the workload is
repeatable. With Boolean options for multiplexing (M) and
grouping (G), there are four different possibilities to measure,
each with its own consequences:
1. Grouping and Multiplexing: Possibly unsafe and incom-
plete results, because the process might be undersam-
pled from M., and G. may create scheduling conflicts,
which disables some counters.
2. No Grouping and Multiplexing: Possibly unsafe but
complete, because the process might be undersampled
and counts inconsistent towards each other, but groups
can be built in arbitrary ways to resolve scheduling
conflicts.
3. No Multiplexing and no Grouping: Possibly unsafe but
complete, only works for repeatable workloads. Multi-
plexing, although not enabled, might still take place
implicitly in an attempt to resolve conflicting counters.
4. No Multiplexing and Grouping: Safe but possibly in-
complete, only works for repeatable workloads. The
grouping prevents implicit multiplexing from taking
place.
By unsafe we mean that the counts may be both imprecise
and contradict each other (e.g., it might be possible that
the L2 miss count is greater than the L3 access count). By
complete we mean that every event that has been requested
was indeed counted at some point in time.
Recently, weak groups have been introduced in perf [17],
which can be broken to resolve scheduling conflicts, but oth-
erwise ensure that certain events are counted synchronously.
It is thus a mixture of cases 1 and 2, resulting in complete
results with minimal multiplexing artifacts, applicable to
non-repeatable workloads.
We suggest to use grouping and avoid multiplexing as
long as the workload is repeatable, to obtain self-consistent
and precise results. Otherwise, grouping and multiplexing
should be used, preferably using weak groups.
For the system described in Tab. 1, the CPU supports
around 475 different events, offering deep insights into the
processor. However, looking at single counter values can be
misleading. Counters can have several orders of magnitude
in difference (e.g., page faults vs. cache misses), yet have
a similar impact on the resulting performance. Especially
when comparing two versions of the same program, large
differences may become meaningless when compared to the
absolute values.
4.2.2 Hierarchical Bottleneck Analysis
To identify the true bottleneck of an application, Yasin has
proposed a hierarchical analysis [33]. The analysis follows
the hierarchical anatomy of general OoO processors, de-
picted in Fig.4. As a first level, the user should only be
concerned whether the majority of cycles is spent in the
Frontend (entails decoding instructions, L1d access and Mi-
crocode switches), or in Bad Speculation (machine clears and
branch mispredictions) or in the Backend (L1/2/3 and DRAM
data access) or whether most of the time is spent in retiring
instructions (ideally, 100%). Once the most time-consuming
category is identified, the user should focus on its children
at the next level, determine the most prominent one, and so
on. For a full explanation, the reader is referred to [33].
Figure 4. Hierarchical View on Processor Performance [33].
A free implementation of this analysis for Intel CPUs is
available in a tool called toplev, which is part of Intel’s
pmu-tools [18]. It takes measurements using Linux’ perf
tool, and presents the results in the explained hierarchy,
but in ratios rather than absolute numbers. Multiplexing
and grouping is supported as described earlier. Taking the
measurements thus boils down to invoking the tool.
Only after the bottleneck has been identified, individual
counters should be inspected using perf, since nowwe know
they are meaningful. Last but not least, it is worth noting that
13
pmu-tools also includes a convenience wrapper for perf,
called ocperf. This tool allows using native names for Intel
CPUs, and has better support for uncore events.
It should be noted that toplev does not allow localizing
causes of undesired behavior in the source code, since coun-
ters operate cumulatively and are not associated to specific
locations in the source code of a program. A method for
localization is provided by perf in sampling mode, where
a history of events can be logged, annotated in the source
code. The tool offers many options explained in the docu-
mentation [13].
4.2.3 Uncertainty Propagation
All measured events can be associated with a measurement
uncertainty. This uncertainty indicates the precision of the
measured values, and prevents us from drawing false con-
clusions if we are comparing two versions of a software.
Crucially, not all events can be measured directly on the
hardware, and thus must be calculated from others, mea-
sured or themselves calculated, ones. The calculated events
therefore have an uncertainty based on their constituents,
which needs to be properly tracked. As an example, on Sandy
Bridge the cache hit ratio r must be estimated from access a
and hit h counters, and its standard deviation σr depends on
both measurements as follows
σr =
ha 
√(σh
h
)2
+
(σa
a
)2
− 2σa,h
ah
, (1)
where σa,h is the covariance between accesses and hits. Anal-
ogously, the uncertainty is propagated for multiplication,
division, addition, and all other operations [20].
We have contributed patches to toplev, which track the
standard deviation of counters acrossmultiple runs (--repeat)
through all calculations, while assuming statistically inde-
pendent variables, i.e., σa,h = 0. The resulting uncertainty
for each event is indicated with error bars in our plots.
Warning: When measurements are taken in system-wide
mode across more than one logical processor core, then
toplev currently forces perf to not aggregate the results
(flag -A). This in turn suppresses perf’s output of the stan-
dard deviation, and leads to an optimistic uncertainty. System-
wide mode is also forced when HyperThreading (HT) is ac-
tivated, thus HT suppresses the standard deviation, as well.
There are thus three options to capture measurement uncer-
tainty correctly: (1) Recommended: Disable HT and avoid
system-wide mode. The perf tool will “follow” the software
under analysis to whatever core it is going (if not pinned).
(2) Alternative 1: Disable HT and use system-wide mode
while filtering only for only one single core. The software
under analysis must be pinned to that single core. (3) Alter-
native 2: Keep HT and use system-wide mode, and specify
--single-thread, which forces toplev to aggregate the re-
sults across all CPUs. The system should otherwise be idle. In
all cases, the core where the software is running on should
be isolated, as explained in Section 4. Future releases of
perf/toplev might no longer have this caveat.
5 Experiments
We provide a few short examples illustrating the difference
between our proposed measurement setup, and a default
Linux environment. We have chosen three programs with
different characteristics:
• gnugo: This is a game engine for the Chinese Go game.
The program consists of many branches and jumps,
many of which are hard to predict. Additionally, it has
a low spatial locality of instructions, and will there-
fore like suffer from many instruction cache misses.
Memory usage is low compared to the next program.
• stream: This is McCalpin’s memory bandwidth bench-
mark [23], which stresses the memory subsystem heav-
ily, but in a predictable pattern. Consequently, this
program occupies a lot of memory (1.1GB in our case),
but has few branches or jumps.
• syscalls: This is the program shown in Appendix A,
which exercises syscalls. The memory usage is low
compared to the others, but it causes many mode
switches and thus allows more context switches to
take place. This program should keep the functional
units busy.
All programs have been executed five times in a row to al-
low determining a standard deviation, which is then used for
our uncertainty propagation. We have repeated this for two
different measurement setups; first with the default settings
of the system described in Tab. 1, and another time with our
proposed measurement setup described in Section 4.
All measurements have been taken with similar parame-
ters, using the performance monitoring counters (see Sec-
tion 4.2.1. Specifically, multiplexing and HyperThreading
were disabled, to prevent undersampling and to ensure proper
uncertainty propagation, as explained in Section 4.2.2.
5.1 Hierarchical Bottleneck Analysis
Figure 5 shows the results of the first level of the hierarchical
analysis using toplev. It can be seen that the measurements
all programs reflect the expected behavior: gnugo indeed
spends a lot of time in its frontend (FE), due to many cache
misses and branch mispredictions (see Fig.6). The stream
benchmark spends most of its time in the backend (BE),
because it is mostly memory-bound (see Fig.7). The syscalls
program shows itself to be mainly backend-heavy, but this
time due to core usage (see Fig.8).
The difference between the default setup and ours is how-
ever barely visible for this type of analysis. Only syscalls
shows a larger difference. The results with our proposed
setup make the program appear less back-end bound, in
14
gn
ug
o-d
fl
gn
ug
o-o
ur
str
eam
-dfl
str
eam
-ou
r
sys
cal
ls-
dfl
sys
cal
ls-
ou
r0
50
100
ra
tio
%
FE BAD BE RET
Figure 5. Results of hierarchical performance analysis for
default setup (dfl) and our proposed setup (our).
exchange for more time spent retiring instructions. This hi-
erarchical analysis does not offer any more explanation, and
thus we will revisit this in the next section.
All figures have error bars, but only very few of them
are visible due to their low magnitude. This suggests that
a default setup may be sufficient in terms of measurement
uncertainty, and that the results of a hierarchical analysis
may be close enough between the two setups, at least for the
programs tested here. This is surprising, because the absolute
values of the counters, as well as program performance, differ
significantly, as we show next.
5.2 Absolute Event Counts and Absolute
Performance
Hierarchical analysis has not been showing any larger differ-
ences between the measurement setups. It builds on ratios
rather than absolute values, therefore large differences could
become invisible. While this is acceptable and desirable for
bottleneck analysis, it might be undesirable in other sce-
narios, such as when absolute performance or the precise
event counts are required. These absolute counts differ sig-
nificantly, as we show in the following.
Figure 9 shows the absolute event counts of gnugo for our
proposed setup (“tune”) and for the default one. First, it can
be seen that the program runs faster under the default setup.
The bar “task-clock (msec)” shows 11.4s vs 12.9s; note that
task-clock only increments for those cycles where the pro-
gram has been active on the processor, which is always less
or equal than wall-clock time. As a better metric reflecting
wall-clock time, all plots show a bar called “IPC0”, which
is calculated as I/(f · tw ), where I is the number of instruc-
tions, f the processor frequency, and tw the wall-clock time
of the program. In Fig.9, IPC0 is almost identical for both
measurement setups, thus the programs would take about
the same time to execute in both setups.
The next apparent difference is the absence of context
switches (one switch is always necessary) and CPU migra-
tions in our setup, followed by no major faults in our setup,
and a different L1d caching behavior. All in all, the measure-
ment uncertainty again is comparable when the counts are
close.
The results for stream and syscalls are shown in Appen-
dix B. Again, there are some differences in the absolute
event counts. This time, IPC0 shows that stream runs faster
in our setup ( 0.6/0.8 = 25%), whereas syscalls is signifi-
cantly slower ( 0.68/0.15 = 450%), as expected due to the
higher overhead for mode switches. Furthermore, syscalls
also shows large differences in most counters. The entire
characteristics seem to be driven by the differences in the
mode switch cost, and even outweighs that the default setup
performs many more context switches and CPU migrations
than our proposed setup.
In conclusion, the measurement setup canmake a large dif-
ference in absolute event counts and software performance.
Our experiments suggest that neither setup is always domi-
nating in performance, and that the choice of setup should
depend on the software under analysis.
5.3 Stability towards Background Processes
So far, all measurements have been taken on an idle system,
with no interactive user processes being executed. We now
examine howmeasurements changewhenwe run some back-
ground processes, as often the case in production systems.
Specifically, we run four background processes which stress
caches andDRAM (stress-ng -C2 --vm 2 --vm-populate
--vm-bytes=512m).
5.3.1 Hierarchical Analysis
Figure 10 shows the results of a hierarchical analysis, which
certify that all measurements under all setups are stable.
There are small differences, which however do not change
the results fundamentally. Again, the measurement uncer-
tainty surprisingly is always very low. The results suggest
that the hierarchical analysis is also robust against back-
ground processes, at least at the uppermost level of the hier-
archy.
5.3.2 Absolute Event Counts and Absolute
Performance
The absolute event counts give a similar picture as before.
However, the wall-clock time of the programs is considerably
more stable in our setup.
Figure 11 shows again absolute event counts for both
measurement setups of gnugo, when background processes
are running. A large change in IPC0 can be seen. In our
setup, the program almost maintains the same performance
as on an idle system (4% degradation, since the IPC0 drops
15
FE Frontend_Bound: 43.94 +- 0.02 % Slots <==
BAD Bad_Speculation: 14.16 +- 0.01 % Slots
RET Retiring: 35.07 +- 0.01 % Slots below
FE Frontend_Bound.Frontend_Latency: 22.77 +- 0.03 % Slots
FE Frontend_Bound.Frontend_Bandwidth: 21.13 +- 0.03 % Slots
BAD Bad_Speculation.Branch_Mispredicts: 14.07 +- 0.01 % Slots
FE Frontend_Bound.Frontend_Latency.ICache_Misses: 17.66 +- 0.00 % Clocks_Estimated
FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 14.07 +- 0.01 % Clocks_Estimated
Figure 6. toplev output for gnugo with our proposed measurement setup.
FE Frontend_Bound: 5.62 +- 0.00 % Slots below
BAD Bad_Speculation: 0.24 +- 0.00 % Slots below
BE Backend_Bound: 72.20 +- 0.00 % Slots
RET Retiring: 21.94 +- 0.00 % Slots below
BE/Mem Backend_Bound.Memory_Bound: 52.55 +- 0.02 % Slots <==
BE/Core Backend_Bound.Core_Bound: 19.65 +- 0.02 % Slots
BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 0.00 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L2_Bound: 1.73 +- 0.02 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L3_Bound: 17.80 +- 0.03 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.DRAM_Bound: 36.13 +- 0.04 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.Store_Bound: 10.27 +- 0.01 % Stalls below
BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 21.92 +- 0.03 % Clocks
Figure 7. toplev output for stream with our proposed measurement setup.
FE Frontend_Bound: 21.40 +- 0.05 % Slots
BE Backend_Bound: 47.52 +- 0.08 % Slots
RET Retiring: 26.31 +- 0.01 % Slots below
FE Frontend_Bound.Frontend_Latency: 12.94 +- 0.03 % Slots below
BE/Mem Backend_Bound.Memory_Bound: 0.00 +- 0.00 % Slots below
BE/Core Backend_Bound.Core_Bound: 47.52 +- 0.08 % Slots
BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 39.88 +- 0.20 % Clocks <==
Figure 8. toplev output for syscalls with our proposed measurement setup.
from 1.25 to 1.21), whereas in the default setup performance
significantly degrades (64% degradation, since IPC0 drops
from 1.22 to 0.45). The result is similar for stream, which
degrades by 16% in our setup, but by 67% in the default setup,
and for syscalls, which degrades by 60% in the default setup,
but only by 2% in our setup.
In summary, the absolute event counts are considerably
different in our setup, and additionally the performance of
the program under background load is significantly better
maintained in our setup.
5.4 Other Influences
When performance counters are used, there is a low but
visible overhead, as with many other measurement methods.
We have neglected this overhead in our experiments.
6 Related Work
Hardware Performance Counters: Since all performance
counters are vendor- and processor-specific, different profil-
ing software exists to access them. A generic tool is Linux’
perf, which also supports some many AMD, ARM, ARC,
Blackfin, MIPS, PowerPC and SPARC processors [30]. Spe-
cific to Intel CPUs, we have toplev, which we have men-
tioned earlier [18], and the commercial tool VTune. Inter-
estingly, some processors also provide counters for power
events [27], for which our setup might be equally important.
Software-based Measurements: There exist many well-
known tools for debugging software performance. Perhaps
best known are software profilers, such as gprof (execution
time) and Visual Studio Profiler. More generally, there ex-
ists the larger class of dynamic program analyzers, which
observe programs during run-time, sometimes by instru-
mentation, other times through simulation. A survey about
those was recently given in [10]. One successful framework
worth mentioning is Valgrind [26], which provides tools
to track execution time and memory usage. However, such
tools change the program’s characteristics through instru-
mentation or simulation, and furthermore do not provide
any microarchitectural explanations. Finally, we should also
mention ftrace, which can collect software events during
the execution of a program, and is integrated with perf.
Reducing OS Noise: Another attempt at reducing the OS’
impact on workloads has been described by Akkan et. al [1].
However, their report is mainly concerned with the OS, and
16
ta
sk
-c
lo
ck
 (m
se
c)
cy
cle
s
in
st
ru
ct
io
ns
IP
C0
cp
u-
m
ig
ra
tio
ns
co
nt
ex
t-s
wi
tc
he
s
m
aj
or
-fa
ul
t-e
st
cy
cle
s
m
aj
or
-fa
ul
ts
m
in
or
-fa
ul
t-e
st
cy
cle
s
m
in
or
-fa
ul
ts
br
an
ch
-m
iss
es
br
an
ch
es
br
an
ch
es
-in
di
re
ct
m
isp
re
d-
es
tc
yc
le
s
TL
B-
m
iss
-e
st
cy
cle
s
ca
ch
e-
dm
d-
es
tc
yc
le
s
L1
-d
m
d-
es
tc
yc
le
s
L1
d-
dm
d-
hi
t-r
at
io
L1
d-
dm
d-
lo
ad
L1
d-
dm
d-
lo
ad
-h
it
L1
i-h
it
L1
i-h
it-
ra
tio
L1
i-m
iss
L2
-d
m
d-
es
tc
yc
le
s
L2
-h
it-
ra
tio
L2
-p
en
di
ng
-c
yc
le
s
L2
d-
dm
d-
ac
ce
ss
L2
d-
dm
d-
hi
t
L2
i-h
it
L2
i-m
iss
LL
C-
dm
d-
ac
ce
ss
LL
C-
dm
d-
es
tc
yc
le
s
LL
C-
dm
d-
hi
t
LL
C-
dm
d-
hi
t-r
at
io
LL
C-
dm
d-
m
iss
LL
C-
dm
d-
pe
nd
in
g-
cy
cle
s
dr
am
-d
m
d-
es
tc
yc
le
s
dr
am
-d
m
d-
pe
nd
in
g-
cy
cle
s
RA
T-
st
al
ls
Vi
rtA
dd
r-e
st
cy
cle
s
cle
ar
s
de
co
de
-s
wi
tc
h-
cy
cle
s0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
2.00
2.25
ev
en
t c
ou
nt
 n
or
m
al
ize
d
11
,2
67
36
,9
57
,2
30
,7
31
45
,0
83
,7
26
,4
54
1.
22
17 64 33
3,
33
3
1.
67
4,
09
2,
00
0
4,
09
2
40
3,
78
6,
99
1
8,
65
5,
23
5,
03
0
66
,1
11
8,
07
5,
73
9,
82
0
65
6,
78
0,
97
1
18
3,
86
6,
22
4,
45
2
17
0,
50
7,
91
9,
60
8
0.
98
14
,0
55
,9
29
,7
26
13
,8
10
,6
75
,2
07
15
,0
05
,6
29
,4
88
0.
97
53
2,
18
5,
77
6
11
,7
15
,1
01
,2
56
0.
94
2,
35
9,
94
0,
83
0
24
8,
20
0,
16
1
21
0,
26
8,
24
6
76
5,
99
0,
19
2
25
,2
82
,0
88
72
,4
93
,3
65
1,
64
3,
20
3,
58
8
63
,2
00
,1
38
0.
87
9,
29
3,
22
7
1,
16
2,
92
7,
21
1
1,
85
8,
64
5,
40
0
1,
19
7,
01
3,
61
8
1,
97
3,
57
6,
70
8
91
6,
33
3,
40
2
2,
43
7,
67
3
26
9,
07
4,
54
9
12
,9
40
36
,0
27
,8
87
,5
00
45
,1
02
,3
70
,6
15
1.
25
0 1 0 0
4,
21
1,
00
0
4,
21
1
40
3,
45
0,
36
0
8,
65
8,
82
4,
32
9
53
,2
12 8
,0
69
,0
07
,2
00
67
5,
43
9,
15
6
95
,5
14
,9
63
,2
58
82
,2
75
,3
22
,1
36
0.
39 7
,0
38
,6
97
,7
22
2,
77
1,
90
7,
87
6
15
,0
25
,0
14
,7
82
0.
97
53
3,
61
8,
03
0
11
,7
22
,2
96
,8
52
0.
94
1,
48
2,
02
8,
78
6
24
7,
76
6,
80
2
21
1,
69
4,
94
6
76
5,
16
3,
12
5
23
,3
33
,9
23
66
,7
13
,6
90
1,
51
7,
34
4,
27
0
58
,3
59
,3
95
0.
88
8,
35
4,
29
5
74
0,
24
9,
08
4
1,
67
0,
85
9,
00
0
74
1,
77
9,
70
1 1,
96
7,
74
2,
45
3
90
3,
68
6,
14
6
2,
43
3,
13
9
26
8,
99
9,
80
3
Event counter comparison for "gnugo -b5 -r10 --level 17"
default/gnugo.perf.all
tune/gnugo.perf.all
Figure 9. Absolute event counts for gnugo for default setup (default) and our proposed setup (tune). Red error bars indicate
measurement uncertainty.
gn
ug
o-d
fl
gn
ug
o-d
fl-s
tre
ss
gn
ug
o-o
ur
gn
ug
o-o
ur-
str
ess
str
eam
-dfl
str
eam
-dfl
-st
res
s
str
eam
-ou
r
str
eam
-ou
r-s
tre
ss
sys
cal
ls-
dfl
sys
cal
ls-
dfl
-st
res
s
sys
cal
ls-
ou
r
sys
cal
ls-
ou
r-s
tre
ss0
20
40
60
80
100
ra
tio
%
FE BAD BE RET
Figure 10. Results of hierarchical performance analysis with
and without highly active background processes, for default
setup (dfl) and our proposed setup (our).
furthermore neglects some important effects (e.g., Hyper-
Threading, ftrace overhead and tick length). Nevertheless,
some pointers can be found there. For technical details in var-
ious aspects on Linux, the Linux Kernel Mailing List (LKML),
the kernel docs, and LWN have been invaluable primary
sources for understanding the inner workings of the kernel.
We have given the references where applicable, and many
more can be found by the interested reader. Last but not
least, to erase even the faintest doubt of imprecise or depre-
cated documentation, the Linux source code itself serves as
a definitive reference.
7 Concluding Remarks
We have described the most important features of both the
Linux operating system and a modern out-of-order super-
scalar processor, and how they influence software perfor-
mance. Based on the identified influences, we have proposed
a measurement setup that aims to reduce measurement er-
rors, such that not the OS or hardware is characterized, but
mainly the program under analysis. Towards that, we have
extended a hierarchical performance analysis method with
uncertainty propagation, to indicate the measurement errors.
Surprisingly, our experiments showed that for this hier-
archical and ratio-based performance analysis method, the
measurement setup makes little difference. In contrast, when
17
ta
sk
-c
lo
ck
 (m
se
c)
cy
cle
s
in
st
ru
ct
io
ns
IP
C0
cp
u-
m
ig
ra
tio
ns
co
nt
ex
t-s
wi
tc
he
s
m
aj
or
-fa
ul
t-e
st
cy
cle
s
m
aj
or
-fa
ul
ts
m
in
or
-fa
ul
t-e
st
cy
cle
s
m
in
or
-fa
ul
ts
br
an
ch
-m
iss
es
br
an
ch
es
br
an
ch
es
-in
di
re
ct
m
isp
re
d-
es
tc
yc
le
s
TL
B-
m
iss
-e
st
cy
cle
s
ca
ch
e-
dm
d-
es
tc
yc
le
s
L1
-d
m
d-
es
tc
yc
le
s
L1
d-
dm
d-
hi
t-r
at
io
L1
d-
dm
d-
lo
ad
L1
d-
dm
d-
lo
ad
-h
it
L1
i-h
it
L1
i-h
it-
ra
tio
L1
i-m
iss
L2
-d
m
d-
es
tc
yc
le
s
L2
-h
it-
ra
tio
L2
-p
en
di
ng
-c
yc
le
s
L2
d-
dm
d-
ac
ce
ss
L2
d-
dm
d-
hi
t
L2
i-h
it
L2
i-m
iss
LL
C-
dm
d-
ac
ce
ss
LL
C-
dm
d-
es
tc
yc
le
s
LL
C-
dm
d-
hi
t
LL
C-
dm
d-
hi
t-r
at
io
LL
C-
dm
d-
m
iss
LL
C-
dm
d-
pe
nd
in
g-
cy
cle
s
dr
am
-d
m
d-
es
tc
yc
le
s
dr
am
-d
m
d-
pe
nd
in
g-
cy
cle
s
RA
T-
st
al
ls
Vi
rtA
dd
r-e
st
cy
cle
s
cle
ar
s
de
co
de
-s
wi
tc
h-
cy
cle
s
0
100
ev
en
t c
ou
nt
 n
or
m
al
ize
d
12
,7
64
39
,0
79
,7
15
,9
26
45
,0
62
,6
97
,4
09
0.
45
61 2,
12
8
0 0
4,
09
0,
66
6
4,
09
0
40
7,
23
2,
40
3
8,
66
3,
92
5,
83
7
98
,7
60
8,
14
4,
64
8,
06
0
63
8,
63
4,
54
9
18
3,
72
3,
19
4,
84
0
17
0,
73
7,
02
6,
41
6
0.
98
14
,0
08
,4
08
,1
27
13
,7
90
,5
16
,0
07
15
,1
03
,2
24
,5
90
0.
96
54
0,
84
7,
96
4
11
,6
60
,7
13
,8
00
0.
94
3,
75
8,
92
8,
74
1
24
9,
96
1,
01
5
21
0,
07
1,
77
6
76
1,
65
4,
37
4
26
,0
95
,7
29
75
,2
36
,9
60
1,
32
5,
45
4,
62
4
50
,9
79
,0
24
0.
68
24
,2
57
,9
36
86
7,
93
4,
35
4
4,
85
1,
58
7,
20
0
2,
89
0,
99
4,
38
6
1,
99
6,
51
5,
55
4
1,
31
6,
51
4,
52
0
2,
86
1,
60
7
27
0,
54
9,
81
4
13
,4
32
37
,3
17
,0
02
,1
74
45
,0
72
,9
54
,1
20
1.
21
0 1 0 0
4,
21
2,
00
0
4,
21
2
40
4,
34
8,
32
6
8,
65
6,
36
5,
58
0
55
,6
69
8,
08
6,
96
6,
52
0
72
8,
56
5,
25
0
18
3,
44
5,
97
8,
44
0
17
0,
55
0,
89
1,
54
4
0.
98
14
,0
50
,8
35
,5
00
13
,8
16
,7
94
,7
50
15
,0
04
,1
33
,3
86
0.
97
53
2,
78
2,
02
3
11
,8
43
,4
41
,3
40
0.
95
2,
42
1,
32
0,
80
6
24
8,
36
6,
07
3
21
3,
69
6,
02
5
77
3,
25
7,
42
0
17
,0
40
,0
86
59
,3
63
,4
31
1,
05
1,
64
5,
55
6
40
,4
47
,9
06
0.
68
18
,9
15
,5
25
56
6,
58
1,
58
9
3,
78
3,
10
5,
00
0
1,
85
4,
73
9,
21
6
1,
97
5,
36
6,
64
9
1,
23
3,
14
5,
91
5
2,
65
3,
24
9
26
9,
31
2,
08
1
Event counter comparison for "gnugo -b5 -r10 --level 17" under stress
default/gnugo_stress_vm.perf.all
tune/gnugo_stress_vm.perf.all
Figure 11.Absolute event counts for gnugo for default setup (default) and our proposed setup (tune) with background processes.
Red error bars indicate measurement uncertainty. Large performance differences are visible.
absolute event counts are of interest, e.g., when cache ac-
cess counts shall be correlated to source code, our setup has
shown significant improvements.
As a side effect, we have found that the proposed setup is
more immune to background processes than a default setup.
Programs show considerably better stability in their observ-
able performance, whereas a performance degradation of up
to 64% has been observed in a default setup. Our proposed
setup can therefore be used as a guideline to tune the system
for high-performance applications.
Naturally, the results presented here may change with
different processors and OS versions. We have given refer-
ences for the reader to evaluate which parts of the setup
may become relevant, but an extrapolation of the presented
results to all targets would be unsafe.
In conclusion, the ultimate measurement setup does not
exist. It depends on both the requirements for the measure-
ments and the characteristics of the software under analysis.
Formore robustmeasurements, we recommend our proposed
setup. This may however degrade performance for some pro-
grams, and thus a subset of the presented configurations
might be chosen to achieve optimal results.
References
[1] Hakan Akkan, Michael Lang, and Lorie Liebrock. 2013. Understanding
and isolating the noise in the Linux kernel. The International Journal
of High Performance Computing Applications 27, 2 (2013), pp.136–146.
[2] Robertp Arcomano. 2018. TLDP Kernel Analysis HowTo. Online. https:
//www.tldp.org/HOWTO/KernelAnalysis-HOWTO-3.html, retrieved
2018 Oct. 30.
[3] ARM Ltd. 2016. Cortex-A57 Software Optimization Guide (ARM UAN
0015B ed.). ARM Ltd.
[4] At32Hz. 2017. Sandy Bridge Block Diagram. Online. https://
en.wikichip.org/wiki/File:sandy_bridge_block_diagram.svg, retrieved
2018 Oct 30.
[5] Jonathan Corbet. 2017. KAISER: hiding the kernel from user space.
Online. https://lwn.net/Articles/738975/, retrieved 2018 Oct 30.
[6] Jonathan Corbet. 2018. Meltdown strikes back: the L1 terminal fault
vulnerability. Online. https://lwn.net/Articles/762570/, retrieved 2018
Oct 30.
[7] Ulrich Drepper. 2007. What every programmer should know about
memory. Red Hat, Inc Vol.11 (2007).
[8] Agner Fog. 2018. The microarchitecture of Intel, AMD and VIA CPUs:
An optimization guide for assembly programmers and compiler mak-
ers. online. Retrieved 2018 Sept 3rd.
[9] Mel Gorman. 2007. Understanding the Linux Virtual Memory Manager.
Open Publication License. Available at https://www.kernel.org/doc/
gorman/html/understand/.
[10] Anjana Gosain and Ganga Sharma. 2015. A survey of dynamic program
analysis techniques and tools. In Proc. International Conference on
Frontiers of Intelligent Computing: Theory and Applications (FICTA).
Springer, pp.113–122.
[11] Nur Hussein. 2017. Another attempt at speculative page-fault handling.
Online. https://lwn.net/Articles/730531/.
[12] Intel Corp. 2018. Intel 64 and IA-32 Architectures Optimization Reference
Manual. Intel Corp.
[13] kernel.org. 2018. Linux kernel profiling with perf. Online. https:
//perf.wiki.kernel.org/index.php/Tutorial.
18
[14] kernel.org. 2018. NO_HZ: Reducing Scheduling-Clock Ticks. On-
line. https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt,
retrieved 2018 Oct 30.
[15] kernel.org. 2018. Reducing OS jitter due to per-cpu kthreads.
Online. https://www.kernel.org/doc/Documentation/
kernel-per-CPU-kthreads.txt, retrieved 2018 Oct 30.
[16] kernel.org. 2018. Software Interrupt Context: Softirqs and Tasklets.
Online. https://www.kernel.org/doc/htmldocs/kernel-hacking/
basics-softirqs.html, retrieved 2018 Oct. 30.
[17] Andi Kleen. 2017. Support standalone metrics and metric groups for
perf. Online. https://lwn.net/Articles/732567/, retrieved 2018 Oct 30.
[18] Andi Kleen. 2018. pmu-tools. github repository.
[19] Paul Kocher, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Ham-
burg, Moritz Lipp, StefanMangard, Thomas Prescher, Michael Schwarz,
and Yuval Yarom. 2018. Spectre attacks: Exploiting speculative execu-
tion. arXiv preprint arXiv:1801.01203 (2018).
[20] Harry H Ku et al. 1966. Notes on the use of propagation of error
formulas. J. Res. Nat. Bur. Standards 70, 4 (1966).
[21] Chuanpeng Li, Chen Ding, and Kai Shen. 2007. Quantifying the cost
of context switch. In Proc. Workshop on Experimental Computer Science.
ACM, p.2.
[22] Andy Lutomirski. 2017. PCID and improved laziness. Online. https:
//lkml.org/lkml/2017/6/14/16, retrieved 2018 Oct 30.
[23] John D McCalpin. 1995. Sustainable memory bandwidth in current
high performance computers. Silicon Graphics Inc (1995).
[24] Larry W McVoy, Carl Staelin, et al. 1996. lmbench: Portable Tools for
Performance Analysis.. In USENIX annual technical conference. San
Diego, CA, USA, pp.279–294.
[25] David Miller, Richard Henderson, and Jakub Jelinek. 2018. Dy-
namic DMA mapping Guide. Online. https://www.kernel.org/doc/
Documentation/DMA-API-HOWTO.txt, retrieved 2018 Oct 30.
[26] Nicholas Nethercote and Julian Seward. 2007. Valgrind: a framework
for heavyweight dynamic binary instrumentation. In ACM Sigplan
notices, Vol. 42. ACM, pp.89–100.
[27] Karan Singh, Major Bhadauria, and Sally A. McKee. 2009. Real Time
Power Estimation and Thread Scheduling via Performance Counters.
SIGARCH Comput. Archit. News 37, 2 (2009), pp.46–55. https://doi.
org/10.1145/1577129.1577137
[28] Robert M Tomasulo. 1967. An efficient algorithm for exploiting multi-
ple arithmetic units. IBM Journal of research and Development 11, 1
(1967), pp.25–33.
[29] Linus Torvalds. 2014. Google Plus. https://plus.google.com/
+LinusTorvalds/posts/YDKRFDwHwr6.
[30] Vince Weaver. 2018. Linux support for various PMUs. Online. http://
web.eece.maine.edu/~vweaver/projects/perf_events/support.html, re-
trieved 2018 Oct 30.
[31] William A. Wulf and Sally A. McKee. 1995. Hitting the memory wall:
implications of the obvious. SIGARCH Computer Architecture News 23,
1 (1995), pp.20–24. https://doi.org/10.1145/216585.216588
[32] Jisoo Yang and Julian Seymour. 2018. Pmbench: A Micro-Benchmark
for Profiling Paging Performance on a System with Low-Latency SSDs.
In Information Technology-New Generations. Springer, pp.627–633.
[33] Ahmad Yasin. 2014. A Top-Down method for performance analysis
and counters architecture. In International Symposium on Performance
Analysis of Systems and Software (ISPASS). IEEE Computer Society,
pp.35–44. https://doi.org/10.1109/ISPASS.2014.6844459
A Mode Switch Benchmark for x86
#include <x86intrin.h>
#include <stdio.h>
#include <unistd.h>
#define TWENTY_MILLION 20000000
#define NTRIALS TWENTY_MILLION
int main(void) {
unsigned long ini , end , now , best , tsc;
int i;
#define measure_time(code) \
for (i = 0; i < NTRIALS; i++) { \
ini = __rdtsc (); \
code; \
end = __rdtsc (); \
now = end - ini; \
if (now < best) best = now; \
}
/* time rdtsc itself (i.e. no code) */
best = ~0;
measure_time (0);
tsc = best;
printf ("rdtsc: %li cycles\n", tsc);
/* time one of the fastest syscalls */
best = ~0;
measure_time (getuid ());
printf ("getuid (): %li cycles\n", best -tsc);
return 0;
}
B Absolute Event Counts
19
ta
sk
-c
lo
ck
 (m
se
c)
cy
cle
s
in
st
ru
ct
io
ns
IP
C0
cp
u-
m
ig
ra
tio
ns
co
nt
ex
t-s
wi
tc
he
s
m
aj
or
-fa
ul
t-e
st
cy
cle
s
m
aj
or
-fa
ul
ts
m
in
or
-fa
ul
t-e
st
cy
cle
s
m
in
or
-fa
ul
ts
br
an
ch
-m
iss
es
br
an
ch
es
br
an
ch
es
-in
di
re
ct
m
isp
re
d-
es
tc
yc
le
s
TL
B-
m
iss
-e
st
cy
cle
s
ca
ch
e-
dm
d-
es
tc
yc
le
s
L1
-d
m
d-
es
tc
yc
le
s
L1
d-
dm
d-
hi
t-r
at
io
L1
d-
dm
d-
lo
ad
L1
d-
dm
d-
lo
ad
-h
it
L1
i-h
it
L1
i-h
it-
ra
tio
L1
i-m
iss
L2
-d
m
d-
es
tc
yc
le
s
L2
-h
it-
ra
tio
L2
-p
en
di
ng
-c
yc
le
s
L2
d-
dm
d-
ac
ce
ss
L2
d-
dm
d-
hi
t
L2
i-h
it
L2
i-m
iss
LL
C-
dm
d-
ac
ce
ss
LL
C-
dm
d-
es
tc
yc
le
s
LL
C-
dm
d-
hi
t
LL
C-
dm
d-
hi
t-r
at
io
LL
C-
dm
d-
m
iss
LL
C-
dm
d-
pe
nd
in
g-
cy
cle
s
dr
am
-d
m
d-
es
tc
yc
le
s
dr
am
-d
m
d-
pe
nd
in
g-
cy
cle
s
RA
T-
st
al
ls
Vi
rtA
dd
r-e
st
cy
cle
s
cle
ar
s
de
co
de
-s
wi
tc
h-
cy
cle
s0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
ev
en
t c
ou
nt
 n
or
m
al
ize
d
7,
06
8
23
,1
56
,2
10
,0
94
14
,5
73
,9
67
,5
66
0.
63
14 45
0 0
29
3,
03
3,
00
0
29
3,
03
3
70
5,
93
5
2,
45
5,
81
6,
82
1
10
,6
38
14
,1
18
,7
00
24
7,
57
6,
71
4
27
,5
92
,5
22
,5
22
16
,3
28
,9
96
,4
44
0.
49
3,
99
8,
22
8,
11
0
1,
97
2,
93
4,
66
5
13
6,
37
9,
78
1
0.
99
74
3,
19
2
91
3,
86
4,
54
8
0.
23
12
,1
84
,1
89
,6
48
33
6,
60
4,
40
5
75
,8
99
,6
33
25
5,
74
6
80
1,
31
7
49
0,
48
7,
06
1
10
,3
49
,6
61
,5
30
39
8,
06
3,
90
5
0.
81
92
,4
23
,1
56
4,
64
1,
11
5,
26
8
18
,4
84
,6
31
,2
00
7,
54
3,
07
4,
37
9
38
,7
75
,3
46
94
8,
61
3,
09
5
1,
26
2,
04
7
4,
12
7,
87
5
6,
60
6
18
,3
65
,0
73
,5
46
14
,7
58
,6
54
,7
89
0.
80
0 1 0 0
29
3,
15
3,
66
6
29
3,
15
3
96
2,
31
3
2,
49
0,
87
3,
41
3
3,
37
3
19
,2
46
,2
60
13
3,
55
5,
51
3
15
,3
57
,2
12
,7
28
4,
06
1,
64
9,
46
4
0.
20
2,
04
3,
08
8,
07
9
41
1,
71
5,
56
9
19
1,
98
1,
22
8
1.
00
44
0,
55
0
95
9,
52
1,
08
0
0.
24
8,
55
5,
06
1,
37
9 33
6,
39
4,
41
0
79
,7
46
,6
70
21
3,
42
0
47
5,
55
8
48
7,
57
1,
29
7
10
,3
36
,0
42
,1
84
39
7,
54
0,
08
4
0.
81
90
,0
31
,2
13
3,
30
9,
12
3,
27
2
18
,0
06
,2
42
,6
00
5,
24
5,
93
8,
10
6
24
,2
03
,2
30
65
2,
33
6,
11
3
74
3,
98
9
4,
92
5,
30
9
Event counter comparison for "stream"
default/stream.perf.all
tune/stream.perf.all
Figure 12. Absolute event counts for stream for default setup (default) and our proposed setup (tune).
ta
sk
-c
lo
ck
 (m
se
c)
cy
cle
s
in
st
ru
ct
io
ns
IP
C0
cp
u-
m
ig
ra
tio
ns
co
nt
ex
t-s
wi
tc
he
s
m
aj
or
-fa
ul
t-e
st
cy
cle
s
m
aj
or
-fa
ul
ts
m
in
or
-fa
ul
t-e
st
cy
cle
s
m
in
or
-fa
ul
ts
br
an
ch
-m
iss
es
br
an
ch
es
br
an
ch
es
-in
di
re
ct
m
isp
re
d-
es
tc
yc
le
s
TL
B-
m
iss
-e
st
cy
cle
s
ca
ch
e-
dm
d-
es
tc
yc
le
s
L1
-d
m
d-
es
tc
yc
le
s
L1
d-
dm
d-
hi
t-r
at
io
L1
d-
dm
d-
lo
ad
L1
d-
dm
d-
lo
ad
-h
it
L1
i-h
it
L1
i-h
it-
ra
tio
L1
i-m
iss
L2
-d
m
d-
es
tc
yc
le
s
L2
-h
it-
ra
tio
L2
-p
en
di
ng
-c
yc
le
s
L2
d-
dm
d-
ac
ce
ss
L2
d-
dm
d-
hi
t
L2
i-h
it
L2
i-m
iss
LL
C-
dm
d-
ac
ce
ss
LL
C-
dm
d-
es
tc
yc
le
s
LL
C-
dm
d-
hi
t
LL
C-
dm
d-
hi
t-r
at
io
LL
C-
dm
d-
m
iss
LL
C-
dm
d-
pe
nd
in
g-
cy
cle
s
dr
am
-d
m
d-
es
tc
yc
le
s
dr
am
-d
m
d-
pe
nd
in
g-
cy
cle
s
RA
T-
st
al
ls
Vi
rtA
dd
r-e
st
cy
cle
s
cle
ar
s
de
co
de
-s
wi
tc
h-
cy
cle
s
0
100
101
102
103
ev
en
t c
ou
nt
 n
or
m
al
ize
d
5,
32
3
17
,4
73
,6
27
,3
64
2,
56
2,
67
1,
21
9
0.
15
9.
33
27
0 0
47
,3
33
47 19
,1
94
44
1,
36
0,
82
8
19
,9
83
,4
95
38
3,
88
0
2,
13
5,
23
3,
23
3
10
,5
27
,7
39
,8
16
10
,5
24
,0
72
,6
08
0.
96
90
3,
53
5,
44
0
86
4,
83
2,
83
7
90
1,
35
2,
47
8
1 24
5,
96
7
2,
74
6,
39
2
0.
78
7,
18
4,
13
9
57
,4
14
33
,3
82
19
5,
48
4
40
,3
69
75
,9
22
92
0,
81
6
35
,4
16
0.
47
40
,5
06
79
7,
70
2
8,
10
1,
20
0
6,
38
6,
43
6
56
8,
45
3,
27
7
5,
03
7,
89
1,
17
1
24
,0
00
20
,0
27
,4
76
8,
92
8
24
,8
43
,9
49
,9
52 1
6,
85
1,
96
7,
30
9
0.
68
0 1 0 0
16
8,
66
6
16
8
20
,0
51
,4
90
3,
24
0,
72
7,
47
1
19
,9
89
,6
97
40
1,
02
9,
80
0
40
,4
74
50
,9
01
,6
26
,3
34
50
,8
99
,8
82
,1
80
0.
99
5,
27
6,
18
3,
83
2
5,
23
6,
19
1,
14
5
2,
25
2,
58
8,
25
5
1
12
8,
38
1
1,
21
4,
04
0
0.
80
58
9,
28
5 46
,3
30
28
,7
15
72
,4
55
7,
58
4
20
,4
54 53
0,
11
4
20
,3
89
1.
00
65
57
6,
42
1
13
,0
00
12
,8
63
84
9,
72
7,
09
3
10
6,
09
5
19
,5
41
,8
17
16
7,
20
2,
82
2
Event counter comparison for "syscalls"
default/syscalls.perf.all
tune/syscalls.perf.all
Figure 13. Absolute event counts for syscalls for default setup (default) and our proposed setup (tune).
20
ta
sk
-c
lo
ck
 (m
se
c)
cy
cle
s
in
st
ru
ct
io
ns
IP
C0
cp
u-
m
ig
ra
tio
ns
co
nt
ex
t-s
wi
tc
he
s
m
aj
or
-fa
ul
t-e
st
cy
cle
s
m
aj
or
-fa
ul
ts
m
in
or
-fa
ul
t-e
st
cy
cle
s
m
in
or
-fa
ul
ts
br
an
ch
-m
iss
es
br
an
ch
es
br
an
ch
es
-in
di
re
ct
m
isp
re
d-
es
tc
yc
le
s
TL
B-
m
iss
-e
st
cy
cle
s
ca
ch
e-
dm
d-
es
tc
yc
le
s
L1
-d
m
d-
es
tc
yc
le
s
L1
d-
dm
d-
hi
t-r
at
io
L1
d-
dm
d-
lo
ad
L1
d-
dm
d-
lo
ad
-h
it
L1
i-h
it
L1
i-h
it-
ra
tio
L1
i-m
iss
L2
-d
m
d-
es
tc
yc
le
s
L2
-h
it-
ra
tio
L2
-p
en
di
ng
-c
yc
le
s
L2
d-
dm
d-
ac
ce
ss
L2
d-
dm
d-
hi
t
L2
i-h
it
L2
i-m
iss
LL
C-
dm
d-
ac
ce
ss
LL
C-
dm
d-
es
tc
yc
le
s
LL
C-
dm
d-
hi
t
LL
C-
dm
d-
hi
t-r
at
io
LL
C-
dm
d-
m
iss
LL
C-
dm
d-
pe
nd
in
g-
cy
cle
s
dr
am
-d
m
d-
es
tc
yc
le
s
dr
am
-d
m
d-
pe
nd
in
g-
cy
cle
s
RA
T-
st
al
ls
Vi
rtA
dd
r-e
st
cy
cle
s
cle
ar
s
de
co
de
-s
wi
tc
h-
cy
cle
s
0
100
ev
en
t c
ou
nt
 n
or
m
al
ize
d
8,
26
6
26
,8
48
,7
60
,1
60
14
,6
34
,9
56
,3
42
0.
21
43 1,
45
0
0 0
29
3,
03
3,
00
0
29
3,
03
3
75
3,
73
2
2,
46
7,
65
0,
05
4
28
,0
12
15
,0
74
,6
40
20
4,
28
6,
49
5
27
,5
95
,8
98
,9
98
16
,6
35
,5
89
,5
44
0.
50
4,
00
9,
02
5,
38
4
2,
00
7,
00
0,
12
6
14
4,
89
7,
13
4
0.
99
1,
16
3,
38
7
97
8,
34
5,
38
4
0.
23
14
,4
61
,2
16
,7
45
34
5,
69
5,
72
8
81
,1
03
,9
77
42
4,
80
5
1,
24
3,
44
4
48
6,
66
3,
89
4
9,
98
1,
96
4,
07
0
38
3,
92
1,
69
5
0.
79
10
2,
74
2,
19
9
5,
03
2,
98
7,
79
8
20
,5
48
,4
39
,8
00
9,
42
8,
22
8,
94
6
44
,0
17
,7
99
1,
05
4,
22
9,
63
8
1,
57
8,
82
4
4,
72
6,
32
8
7,
93
7
22
,0
32
,0
50
,1
15
14
,7
39
,6
29
,7
79
0.
67
0 1 0 0
29
3,
15
3,
33
3
29
3,
15
3
1,
01
2,
92
5
2,
48
8,
07
1,
63
3
4,
07
1
20
,2
58
,5
00
12
9,
39
3,
66
3 27
,7
74
,6
74
,6
16
17
,0
86
,3
72
,0
04
0.
51
4,
03
0,
11
3,
36
4
2,
03
7,
88
4,
16
5
19
5,
82
4,
67
1
1.
00
62
3,
94
3
1,
04
0,
06
0,
11
2
0.
25
11
,7
28
,1
51
,9
01
34
2,
62
4,
01
5
86
,4
49
,8
24
22
1,
85
2
63
5,
18
5
47
4,
17
0,
02
1
9,
64
8,
24
2,
50
0
37
1,
08
6,
25
0
0.
78 10
3,
08
3,
77
1
3,
98
3,
03
7,
29
8
20
,6
16
,7
54
,2
00
7,
74
5,
11
4,
60
2
31
,5
06
,8
80
72
2,
63
6,
94
6
1,
23
2,
14
8 5,
07
0,
73
0
Event counter comparison for "stream" under stress
default/stream_stress_vm.perf.all
tune/stream_stress_vm.perf.all
Figure 14. Absolute event counts for stream for default setup (default) and our proposed setup (tune) with background
processes.
ta
sk
-c
lo
ck
 (m
se
c)
cy
cle
s
in
st
ru
ct
io
ns
IP
C0
cp
u-
m
ig
ra
tio
ns
co
nt
ex
t-s
wi
tc
he
s
m
aj
or
-fa
ul
t-e
st
cy
cle
s
m
aj
or
-fa
ul
ts
m
in
or
-fa
ul
t-e
st
cy
cle
s
m
in
or
-fa
ul
ts
br
an
ch
-m
iss
es
br
an
ch
es
br
an
ch
es
-in
di
re
ct
m
isp
re
d-
es
tc
yc
le
s
TL
B-
m
iss
-e
st
cy
cle
s
ca
ch
e-
dm
d-
es
tc
yc
le
s
L1
-d
m
d-
es
tc
yc
le
s
L1
d-
dm
d-
hi
t-r
at
io
L1
d-
dm
d-
lo
ad
L1
d-
dm
d-
lo
ad
-h
it
L1
i-h
it
L1
i-h
it-
ra
tio
L1
i-m
iss
L2
-d
m
d-
es
tc
yc
le
s
L2
-h
it-
ra
tio
L2
-p
en
di
ng
-c
yc
le
s
L2
d-
dm
d-
ac
ce
ss
L2
d-
dm
d-
hi
t
L2
i-h
it
L2
i-m
iss
LL
C-
dm
d-
ac
ce
ss
LL
C-
dm
d-
es
tc
yc
le
s
LL
C-
dm
d-
hi
t
LL
C-
dm
d-
hi
t-r
at
io
LL
C-
dm
d-
m
iss
LL
C-
dm
d-
pe
nd
in
g-
cy
cle
s
dr
am
-d
m
d-
es
tc
yc
le
s
dr
am
-d
m
d-
pe
nd
in
g-
cy
cle
s
RA
T-
st
al
ls
Vi
rtA
dd
r-e
st
cy
cle
s
cle
ar
s
de
co
de
-s
wi
tc
h-
cy
cle
s
0
100
101
102
ev
en
t c
ou
nt
 n
or
m
al
ize
d
5,
48
4
17
,6
24
,3
30
,7
19
2,
56
0,
37
4,
12
6
0.
06
25 92
5
0 0
47
,3
33
47 69
,9
47
44
1,
66
1,
03
6
20
,0
26
,3
62
1,
39
8,
94
0
2,
23
9,
89
2,
89
3
10
,4
01
,6
48
,8
50
10
,3
92
,7
72
,3
44
0.
95
90
3,
65
1,
86
7
86
2,
23
2,
09
0
87
3,
72
8,
90
6
1.
00
57
8,
12
1
3,
84
2,
88
0
0.
26
66
,5
17
,9
61
42
6,
59
0
50
,9
24
26
9,
31
6
54
1,
85
1
96
9,
51
3
5,
03
3,
62
6
19
3,
60
1
0.
20
77
5,
91
2
2,
28
9,
41
8
15
5,
18
2,
40
0
64
,2
28
,5
42
57
5,
15
5,
08
3
5,
17
1,
01
0,
25
5
49
4,
71
9
20
,2
47
,2
199,
03
2
25
,1
02
,3
49
,0
89
16
,8
61
,8
16
,6
89
0.
67
0 1 0 0
16
9,
00
0
16
9
20
,0
63
,6
62
3,
24
0,
37
6,
76
3
20
,0
04
,4
05
40
1,
27
3,
24
0
16
,9
75
50
,7
56
,1
29
,8
46
50
,7
48
,8
47
,8
60
0.
99
5,
27
7,
86
6,
33
4
5,
23
3,
06
2,
80
3
2,
22
1,
08
6,
35
9
1
68
2,
57
2
4,
86
0,
99
6
0.
29
57
,7
49
,8
45
34
8,
69
4
10
,9
04
39
4,
17
9
67
0,
81
5
1,
05
1,
22
9
2,
42
0,
99
0
93
,1
15
0.
09
95
8,
11
4
79
0,
80
0
19
1,
62
2,
80
0
56
,9
59
,0
44
86
2,
86
0,
56
2
15
2,
11
5,
53
7
20
,1
17
,9
60
16
7,
60
4,
82
9
Event counter comparison for "syscalls" under stress
default/syscalls_stress_vm.perf.all
tune/syscalls_stress_vm.perf.all
Figure 15. Absolute event counts for syscalls for default setup (default) and our proposed setup (tune) with background
processes.
21
