W&M ScholarWorks
Dissertations, Theses, and Masters Projects

Theses, Dissertations, & Master Projects

2021

Combining Performance Profiling And Modeling For Accuracy And
Efficiency
Hao Xu
William & Mary - Arts & Sciences, hxu07@email.wm.edu

Follow this and additional works at: https://scholarworks.wm.edu/etd
Part of the Computer Sciences Commons

Recommended Citation
Xu, Hao, "Combining Performance Profiling And Modeling For Accuracy And Efficiency" (2021).
Dissertations, Theses, and Masters Projects. William & Mary. Paper 1638386790.
https://scholarworks.wm.edu/etd/1638386790

This Dissertation is brought to you for free and open access by the Theses, Dissertations, & Master Projects at
W&M ScholarWorks. It has been accepted for inclusion in Dissertations, Theses, and Masters Projects by an
authorized administrator of W&M ScholarWorks. For more information, please contact scholarworks@wm.edu.

Combining Performance Profiling and Modeling for Accuracy and Efficiency

Hao Xu
Williamsburg, VA, USA

Master of Science, University of Chinese Academy of Sciences, China, 2014

A Dissertation presented to the Graduate Faculty
of The College of William & Mary in Candidacy for the Degree of
Doctor of Philosophy

Department of Computer Science

College of William & Mary
July 2021

© Copyright by Hao Xu 2021

APPROVAL PAGE

Thi Di e a i
he e

i

b i ed i a ia f fi
ie e
f he deg ee f
D c

f Phi

e

h

Ha X

A

ed b

X Li , A
N

he C

i ee, J

C
i ee Chai
cia e P fe
,C
e Scie ce
h Ca i a S a e U i e i

Bi Re , A i a P fe
,C
C ege f Wi ia & Ma

Q

2021

Li, P fe
,C
C ege f Wi ia

e Scie ce

e Scie ce
& Ma

Wei he Ma , P fe
,C
e Scie ce
C ege f Wi ia & Ma

G

ia g Ji , A i a P fe
,C
e Scie ce
N h Ca i a S a e U i e i

f

ABSTRACT
Modern computer systems have evolved to employ powerful parallel
architectures, including multi-core processors, multi-socket chips, large
memory subsystems, and fast network communication. Given such powerful
hardware, developers rely on performance profiling and modeling to guide their
performance optimization. However, performance optimization is facing new
challenges on efficiency and accuracy with emerging computer systems. In this
dissertation, we propose approaches to address these challenges.
We first study memory contention in Non-Uniform Memory Access (NUMA)
architectures. We present Dr-BW, a new tool based on machine learning to
identify bandwidth contention in NUMA architectures and provide optimization
guidance. Dr-BW collects performance data with low overhead (< 10%), feeds
the data into a novel machine learning model to identify contention achieving
more than 96% accuracy, and associates the analysis results with both
programs and significant data objects.
Then, we study and fix inaccuracy measurement in modern profilers. We
investigate multiple modern architectures and quantify the PMU instruction
profiling inaccuracy in these architectures with mathematical modeling. Then
we design a systematic framework to evaluate the impact of PMU inaccuracy
to the profiling results. We propose a software-based technique to rectify the
measurement inaccuracy raised by PMU and demonstrate its effectiveness.
Our research reveals that profiling and modeling significantly benefit system
performance improvement. In addition, modeling based profiling also help user
understand the performance bottleneck and guides the performance
optimization.

TABLE OF CONTENTS

Acknowledgments

v

Dedication

vi

List of Tables

vii

List of Figures

viii

1 Introduction

2

1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2 Problem Statements . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.3 Contributions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.4 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . .

6

2 Related Work

7

2.1 Identifying Bandwidth Contention in NUMA Architectures . . . . .

7

2.1.1 Memory Bandwidth Measurement . . . . . . . . . . . . . .

7

2.1.2 Heuristics for Bandwidth Contention . . . . . . . . . . . . .

8

2.1.3 Machine Learning in Performance Analysis . . . . . . . . .

9

2.2 Understanding and Fixing the Inaccuracy in Modern Profilers . . .

9

2.2.1 Edge Profile Prediction by Heuristics

. . . . . . . . . . . .

9

2.2.2 Accuracy Enhancement of Sampling Based Profiling . . . .

10

2.2.3 Architecture Support for Profiling Accuracy . . . . . . . . .

10

i

3 DR-BW: Identifying Bandwidth Contention in NUMA Architectures with
Supervised Learning

12

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

3.2 Dr-BW Methodology and Overview . . . . . . . . . . . . . . . . .

14

3.2.1 Dr-BW’s Profiler . . . . . . . . . . . . . . . . . . . . . . . .

15

Address Sampling. . . . . . . . . . . . . . . . . . . .

15

Associate Samples with Channels. . . . . . . . . . .

16

Attribute Samples to Data Objects. . . . . . . . . . .

17

3.2.2 Dr-BW’s Classifier . . . . . . . . . . . . . . . . . . . . . . .

17

Mini-programs for Training. . . . . . . . . . . . . . .

17

Identification of Performance Features. . . . . . . .

19

Collection of Training Data . . . . . . . . . . . . . . .

20

Classifier of the Decision Tree. . . . . . . . . . . . .

21

3.2.3 Dr-BW’s Diagnoser . . . . . . . . . . . . . . . . . . . . . .

22

Quantify Data Object’s Contribution to Contention. .

22

Metrics per channel. . . . . . . . . . . . . . . . . . .

22

Metrics cross channels. . . . . . . . . . . . . . . . .

22

Root-cause Blaming. . . . . . . . . . . . . . . . . . .

23

3.3 Evaluation of the Decision-tree classifier . . . . . . . . . . . . . .

23

3.3.1 Benchmark Classification Results . . . . . . . . . . . . . .

24

3.3.2 Classification Statistics . . . . . . . . . . . . . . . . . . . .

25

3.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

3.4.1 AMG2006

. . . . . . . . . . . . . . . . . . . . . . . . . . .

28

3.4.2 IRSmk . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.4.3 Streamcluster . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.4.4 LULESH . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

3.4.5 Rodinia NW . . . . . . . . . . . . . . . . . . . . . . . . . .

34

ii

3.4.6 SP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

3.4.7 Blackscholes . . . . . . . . . . . . . . . . . . . . . . . . . .

35

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

4 Understanding and Fixing Inaccuracy of Modern Profilers

37

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

4.2.1 Profilers with PMU Sampling . . . . . . . . . . . . . . . . .

39

Hardware Performance Monitoring Units (PMU): . .

39

Linux Perf_events: . . . . . . . . . . . . . . . . . . .

40

Profiling Mechanisms: . . . . . . . . . . . . . . . . .

40

4.2.2 Limitation of Precise Event Based Sampling . . . . . . . .

40

4.2.3 Ground Truth Profiler . . . . . . . . . . . . . . . . . . . . .

41

4.3 Quantify Skid Effect . . . . . . . . . . . . . . . . . . . . . . . . . .

42

4.3.1 Skid Effects Modeling in a Simple Loop . . . . . . . . . . .

44

4.3.2 Measurement of CPU Skid Duration . . . . . . . . . . . . .

45

4.3.3 Mathematic formulation of skid effect modeling on Simple
loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

4.4 Nullifying Skid Effect On Instruction Profiling . . . . . . . . . . . .

51

4.4.1 Control Flow Graph Decomposition . . . . . . . . . . . . .

52

4.4.2 Mathematic formulation of skid effects elimination on Control Flow Graph. . . . . . . . . . . . . . . . . . . . . . . . .

54

4.4.3 Control Flow Graph Decomposition . . . . . . . . . . . . .

55

4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

4.5.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . .

57

4.5.2 Effectiveness Provenance . . . . . . . . . . . . . . . . . .

58

Profiling Accuracy of Linux Perf . . . . . . . . . . . .

60

Profiling Accuracy at Basic Block Level . . . . . . . .

60

iii

4.6 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

4.6.1 PowerGraph Pagerank . . . . . . . . . . . . . . . . . . . .

62

4.6.2 SPEC CPU2006 astar . . . . . . . . . . . . . . . . . . . . .

63

4.6.3 SPEC CPU2006 hmmer

. . . . . . . . . . . . . . . . . . .

63

4.6.4 SPEC CPU2006 soplex . . . . . . . . . . . . . . . . . . . .

65

4.6.5 SPEC CPU2006 omnet . . . . . . . . . . . . . . . . . . . .

66

4.6.6 SPEC CPU2006 mcf . . . . . . . . . . . . . . . . . . . . .

68

4.6.7 libsvm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

4.6.8 Effectiveness on Other Hardware Events . . . . . . . . . .

71

4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

5 Conclusion and Future Work

73

Bibliography

75

Vita

85

iv

ACKNOWLEDGMENTS
This dissertation is written with the support and help of many individuals. I
would like to thank all of them.
First and foremost, I would like to express my deepest appreciation to my
advisor, Dr. Xu Liu. Without his guidance in my research, encouragement in
my life, and confidence in my abilities, this dissertation would not have been
possible.
I would also like to thank the rest of my committee members, Dr. Qun Li, Dr.
Bin Ren, Dr. Weizhen Mao, and Dr. Guoliang Jin for serving on my dissertation
committee, as well as for their insightful comments.
My sincere thanks also go to all the members of the research group past and
present, Dr. Du Shen, Dr. Shasha Wen, Dr. Qingsen Wang, Dr. Probir Roy,
Bolun Li, Jialiang Tan, Qidong Zhao, for the stimulating discussions,
constructive suggestions, generous assistance, and effective teamwork.
I also thank my friends, Yongsen Ma, Xiao Liu, and so on for all the fun we
have had in the past years.
Last but not the least, I would like to thank my family. Thanks to my parents,
whose love and care has made me who I am today.

v

This dissertation is dedicated to my beloved parents for their endless and
selfless love and support.

vi

LIST OF TABLES

3.1 Selected features to train Dr-BW’s classifier.

. . . . . . . . . . .

21

3.2 Summary of the collected training data. . . . . . . . . . . . . . . .

21

3.3 Confusion matrix for the training data. . . . . . . . . . . . . . . . .

21

3.4 Benchmark classification. . . . . . . . . . . . . . . . . . . . . . . .

24

3.5 Evaluating the accuracy of Dr-BW with all the real benchmarks. .

26

3.6 Quantifying Dr-BW’s accuracy when analyzing remote memory bandwidth contention. . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

3.7 Dr-BW’s runtime overhead. . . . . . . . . . . . . . . . . . . . . . .

27

4.1 Exclusive function-level instruction counts for two hot procedures
in PowerGraph PageRank [1]. The ground truth result is collected by
CCTLib. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

4.2 Evaluation platform configurations. . . . . . . . . . . . . . . . . .

57

4.3 PAPI_SR_INS instruction counts for code line of hmmer. . . . . . . .

64

vii

LIST OF FIGURES

3.1 An example NUMA architecture with four fully connected sockets.

13

3.2 Overview Dr-BW’s workflow for bandwidth contention detection and
diagnosis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.3 The decision tree used by Dr-BW. The internal nodes are labeled
with “features”, while the leaf nodes are labeled with “classifications”. 20
3.4 Contribution Fraction (CF) distribution across data objects in AMG2006. 28
3.5 Speedups of IRSmk with different input sizes and execution configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

3.6 Speedups of IRSmk with different input sizes and execution configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.7 Speedups for Streamcluster with different input sizes and execution configurations. . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.8 streamcluster: Contribution Fraction (CF) distribution across data
objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.9 Speedups for LULESH with different execution configurations. . .

33

3.10 Contribution Fraction (CF) distribution across data objects in LULESH. 33
3.11 Contribution Fraction (CF) distribution across data objects in Rodinia NW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.1 Comparison of instruction profiling inaccuracy loss. Z-axis value
indicates the total mis-attributed instruction number at the function
level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
viii

41

4.2

Skid effect on cycle profiling result. The skid duration S is less
than 2 cycles and every cycle has the same probability to cause
the counter overflow. There are three instructions (1, 2, 3) executed in order. Two different kinds of time points are defined: 1)
counter overflow point, a finished cycle causing the hardware
performance counter to overflow; 2) sample point, a time point
when active instruction is blamed as the cause of performance
counter overflow. Each counter flow point has a corresponding
sample point (e.g., the counter overflow pointer Ao has its own
sample point As .). The time difference of each pair of counter overflow point and sample point is skid duration S measured in cycles.

4.3

43

Skid effect emulation with three different skid values on a simple
loop consisting of five instructions. Each emulation generates an
instruction distribution D(S, [c1 , c2 , c3 , c4 , c5 ]) that is a vector of the
number of samples of all instructions, while D is the actual instruction profiling result. . . . . . . . . . . . . . . . . . . . . . . . . . .

44

4.4 Measurement of skid cycle duration for multiple platforms. Y-axis
represents the sum of errors described in the equation, while Xaxis denotes the candidate skid duration from 0 to 300 cycles. We
search for the skid duration value, which makes the error sum minimum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

4.5 Recover instruction profile of control flow graph in Figure 4.6. There
are six steps for recovering process. CPU cycle profile and instruction profile are provided by sampling based profile. Step (1) to (5)
constitutes a closed iterative loop to improved the recovering quality. 52

ix

4.6 Control flow graph with multiple branches decomposition. After
static analysis, we obtain a instruction level control flow graph with
5 instructions (A, B, C, D, E). Decomposition generate 3 different
simple loops: A → B → C → D, A → B → C → D, A → B → C → E. 53
4.7 The value hpctoolkit for chosen applications on five platforms. Smaller
value is better.

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

4.8 The value P erf for chosen applications on five platforms. Smaller
value is better.

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

4.9 The value f ix for chosen applications on five platforms. Smaller
value is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

4.10 The value f ix,bb for chosen applications on five platforms. Smaller
value is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

4.11 Fixed result of 2 mis-attributed functions’ instruction profile in PageRank
for Intel-SandyBridge and Intel-Skylake. . . . . . . . . . . . . . .

62

4.12 Fixed result of 3 mis-attributed functions’ instruction profile in astar
for Intel-SandyBridge and Intel-Skylake. . . . . . . . . . . . . . .

64

4.13 Fixed result of 2 mis-attributed functions’ instruction profile in soplex
for Intel-SandyBridge and Intel-Skylake. . . . . . . . . . . . . . .

67

4.14 Fixed result of 2 mis-attributed functions’ instruction profile in omnet
for Intel-SandyBridge and Intel-Skylake. . . . . . . . . . . . . . .
4.15 mcf,Intel-SandyBridge

. . . . . . . . . . . . . . . . . . . . . . . .

67
68

4.16 Fixed result of 2 mis-attributed functions’ instruction profile in mcf
for Intel-SandyBridge and Intel-Skylake. . . . . . . . . . . . . . .

68

4.17 Fixed result of 4 mis-attributed functions’ instruction profile in libsvm
for Intel-SandyBridge and Intel-Skylake. . . . . . . . . . . . . . .

x

70

Combining Performance Profiling and Modeling for Accuracy and
Efficiency

Chapter 1

Introduction
Computer systems have become increasingly complex. It is challenging to tune software on a specific hardware platform for high performance [2, 3]. For example, many
computer software applications achieve only 5 − 15% of the peak performance on modern architectures. To address such a problem, users rely on optimizations maximizing
the overall system performance. Such optimizations are usually guided by performance
profiling and modeling, facing challenges from different layers: hardware, performance
tools and design. In this dissertation, We focus on topics addressing these challenges.
Non-Uniform Memory Access (NUMA) architectures are widely used in mainstream
multi-socket computer systems to scale memory bandwidth. Without a NUMA-aware design, programs can suffer from significant performance degradation due to inter-socket
bandwidth contention. However, identifying bandwidth contention is challenging. Existing methods measure bandwidth consumption. However, consumption alone is insufficient to quantify bandwidth contention. Furthermore, existing methods diagnose
bandwidth for the entire program execution, but lack the ability to associate bandwidth
performance to the source code and data structures involved. Therefore, a tool that
can identify memory contention problem and pinpointing problematic codes and data
structures does not exist.
In addition to memory contention, inaccuracy of performance profiling tools is another important topic in performance optimization. Modern computer systems provide
2

performance monitoring units (PMU) [4, 5, 6], which are able to monitor more than one
hundred performance events, such as CPU cycles, cache misses, floating point operations, and many others. Performance tools utilize PMUs to identify performance inefficiencies, attribute them to different code segments, and provide optimization guidance
for hardware designers, compiler developers, and application programmers. However,
hardware limit will cause inaccurate measurement in handling the PMU samples. There
have already been some work on studying these inaccurate measurements. However,
they do not quantify the hardware flaw’s impact on measurements. Thus, it is essential to build a mathematical model to quantify the skid effect and provide guidance on
rectifying the inaccurate measurement.

1.1 Thesis Statement
Modern sampling-based profiler, assisted with machine learning techniques and measurement based modeling techniques, can identify performance bottleneck (e.g., bandwidth contention in NUMA Architectures) with low overhead, high accuracy, and insightful information guiding optimization.

1.2 Problem Statements
In this dissertation, we investigate how to address the challenges of peroformance optimization. Specifically, we work on the following problems:
(1) Identifying problematic code that incurs bandwidth contention in NUMA architectures. Runtime variation from hardware prefetching, parallel instruction pipelining, and operating system interference makes it difficult to predict contention via static
analysis. While performance monitoring units (PMUs) are able to measure memory
bandwidth consumption and count remote memory access requests, this information
is insufficient to indicate whether bandwidth is suffering from contention. First, a high
bandwidth consumption does not necessarily mean bandwidth contention. Furthermore,
3

bandwidth contention can occur in any interconnect channel between sockets, so it is
crucial to understand the NUMA topology to identify and localize contention. Even if
we can identify contention, deriving insights that would allow us to mitigate the problem
remains challenging. Identifying contention indicates whether the problem exists, but
without knowledge of the causes to the contention in an application, resolving the problem requires a substantial amount of domain knowledge and manual efforts. Existing
tools such as HPCToolkit-NUMA [7] and MemProf [8] measure memory access latency
to identify lines of code that cause problematic remote memory accesses, but neither
addresses contention.
(2) Understanding and Fixing the Inaccuracy in Modern Profilers. This inaccurate measurement arises due to the hardware limit in handling the PMU samples. There
is always a time delay between the PMU overflow interrupt and the signal delivery to
the performance tool, which is known as “skid” [9]. Such skid pervasively exists in most
architectures, such as x86, POWERPC, and ARM, and is the major source of the inaccuracy. However, the impact of the skid to the measurement accuracy has not been
extensively studied. Prior work [10] relies on hardware support to minimize the skid,
such as the precise event-based sampling (PEBS) [4] and last branch record (LBR) [11]
in Intel processors. However, PEBS and LBR are not generally available in all processors. Moreover, even PEBS is not guaranteed to be accurate, which will be described
in Section 4.2 of Chaptor 4.

1.3 Contributions
This dissertation proposes two contributions towards challenges in performance optimizations. The overall contributions are as follows:
Identifying Bandwidth Contention in NUMA Architectures. As a solution, we develop Dr-BW, a lightweight profiler that automatically identifies bandwidth contention in
NUMA architectures using supervised machine learning (ML) techniques. We make the

4

following three contributions in Dr-BW:
• Dr-BW employs a supervised ML technique to train a highly accurate classifier for
bandwidth contention. To the best of our knowledge, Dr-BW is the first profiler that
applies ML to diagnose bandwidth contention.
• Dr-BW adopts a lightweight sampling mechanism available in modern CPU architectures. It collects all performance data to train our classifier in a single run,
incurring less than 10% runtime overhead, on average.
• Dr-BW not only identifies programs that suffer from memory contention, but also
pinpoints problematic data structures used in the code. This insight provides
straightforward guidance to optimize bandwidth contention.

Understanding and Fixing inaccuracy of PMU Sampling based Profilers. To address the limitations in existing PMU sampling based profilers, we systematically study
the instruction measurement accuracy of performance tools based on the PMU sampling.
• We show the instruction measurement inaccuracy in state-of-the-art performance
tools by comparing with the ground truth and summarize the characteristics of
victims of the inaccurate measurement from the software perspective.
• We design and implement a measurement technique to quantify the skid in various
architectures, as the root cause of the inaccuracy from the hardware perspective.
• We propose a novel software scheme without any extra hardware support to quantify the measurement inaccuracy by solving an optimization problem.
• We evaluate our techniques on several benchmarks and applications, which shows
significant improvement on the accuracy of instruction profiles.

5

1.4 Dissertation Organization
The rest of this dissertation is structured as follows. In Chapter 2, we discuss related
work. In Chapter 3, we present our tool for identifying bandwidth contention in NUMA architectures: Dr-BW. In Chapter 4, we study and fix inaccuracy of modern PMU sampling
based profilers. Finally, we conclude in Chapter 5.

6

Chapter 2

Related Work
This chapter reviews related work in diagnosing problematic memory bandwidth performance and and handling inaccurate hardware event sampling.

2.1 Identifying Bandwidth Contention in NUMA Architectures
2.1.1 Memory Bandwidth Measurement
A number of existing tools such as HPCToolkit [12], VTune [13], and Perf [14] collect
off-chip memory requests from hardware performance counters to quantify bandwidth
consumption. However, the bandwidth consumption does not tell whether contention
exists or not. For example, a regular pattern with high bandwidth consumption may
not cause any bandwidth contention, while a random pattern with a low bandwidth consumption may incur intensive contention.
Instead of directly measuring bandwidth usage, Eklov et al. developed Bandwidth
Bandit [15] to empirically measure a program’s susceptibility to bandwidth contention
problems. Bandwidth Bandit creates interference threads, which can be tuned to consume different amount of memory bandwidth. Because the available bandwidth is reduced, the monitored program may suffer a performance slowdown. If the interference
thread incurs a large slowdown, the monitored program is bandwidth bound and subject to contention. Otherwise, the monitored program is not bottlenecked on bandwidth.
7

Casas and Bronevetsky applied a similar approach to parallel programs [16].
The approach of utilizing interference threads is beneficial in determining whether a
program’s performance is sensitive to bandwidth contention but does not actually detect the occurrence of contention in an unmodified program. Furthermore, interference
threads need to run on spare cores, but many parallel programs, especially HPC applications, use up all the cores available in the machine. In addition, this approach limits
the analysis on the entire program level, lacking performance insights in fine-grained
program contexts or semantics.

2.1.2 Heuristics for Bandwidth Contention
There are a number of tools that use heuristics to identify bandwidth contention issues
and perform optimizations. Such tools exist within compilers [17], runtime systems [18],
or operating systems [19, 20]. One approach determines bandwidth contention based
on whether data allocated in one NUMA socket is accessed from threads in all sockets [21]. While effective for many workloads, this heuristic may not hold if the hardware
pre-fetcher loads data into local caches in advance or if accesses from multiple sockets
do not overlap in time. Other approaches use memory access latency as a heuristic—
accesses that exceed a certain latency threshold are classified as contentious [20].
However, access latency varies due to a number of factors, and may not be indicative of contention in particular. In addition, determining an adequate threshold is usually
difficult; some tools [7] determine its value via simple experiments.
Because no performance monitoring units currently exist to quantify bandwidth contention, using heuristics based on related measurements has proven feasible, but all
aforementioned approaches are limited to the domains where the heuristics hold true.
Dr-BW builds upon the idea of heuristic-based detection, but instead of employing a single predefined heuristic, Dr-BW adapts a statistical model for bandwidth contention by
employing machine learning techniques on related performance measurements. Thus,
Dr-BW overcomes many of the limitations in previous work.

8

2.1.3 Machine Learning in Performance Analysis
Recent work has applied supervised learning to HPC performance bottleneck analysis.
Sanath et al [22] use machine learning to detect false sharing and inefficient memory
access. By training a classifier on a set of micro benchmarks with understood behaviors, they are able to detect the presence of false sharing in an application execution.
ElMoustapha et al [23] build model trees to analyze the architecture performance. They
aim to identify performance problems and estimate potential gain by addressing a specific performance issue. Wucherl Yoo builds ADP [24], an automated system to model,
detect, and provide optimization suggestions for known performance “pathologies”. ADP
collects several hardware events for all functions in an application and uses them as
inputs to a decision tree to classify each function according to a set of known pathologies. Vetter [25] uses machine learning to classify the communication inefficiencies in
distributed application. Instead of overall analysis, Vetter detects each individual communication operations to see if it is efficient or not and reveals the cause of inefficiencies.

2.2 Understanding and Fixing the Inaccuracy in Modern Profilers
2.2.1 Edge Profile Prediction by Heuristics
Ball et al. [26] propose to collect edge frequency profiles by optimally inserting monitoring code, which incurs acceptable overhead. They also obtain some heuristics about
predicting the edge frequency profiles. Moreover, Wu et al. [27] show that such branch
frequency heuristics can guide static profiling to obtain estimation of edge execution frequencies. Anderson et al. [28] introduce a two-step framework, correlating the sample
count of instructions with execution frequencies heuristics, to improve the prediction accuracy of execution frequencies. Yet our research goes beyond heuristics by building a
mathematical model on skid effects in a complex control flow graph.

9

2.2.2 Accuracy Enhancement of Sampling Based Profiling
Much work has been done to tame inaccurate hardware event sampling result. Levin et
al. [29] propose that constructing an edge profile from basic block sample counts can
be formalized as a Minimum Cost Circulation problem. Chen et al. [30] extend the Minimum Cost Circulation model by adding additional performance counters to improve the
quality of sampling profiles. They apply supervised learning techniques to minimize the
skid effect on sampling profiles. A later study [31] by Wu et al. points out that varying
the sampling rate does not improve the accuracy of collected profiling result. In a previous exploration [32], Dimakopoulou et al. have studied event scheduling optimization
in the Linux kernel to minimize hardware performance counter corruption. Mytkowicz et
al. [33] have studied the accuracy of multiple java profilers and found that only sampling
at yield points, which is a JVM mechanism for supporting maintenance operations, incurs bias on profiling result. Lim et al. show that intelligently selecting how events are
multiplexed based on their rate of change can improve profiler’s accuracy [34]. Moreover, Mathur et al. quantify the error caused by events multiplexing and propose new
estimation algorithms to improve accuracy [35]. All of these works do not quantify the
CPU hardware flaw’s impact (skid effect) on profile accuracy. In this paper, we propose
a mathematical model to quantify the skid effect in loop-based programs, measure skid
duration for different CPUs, and then formulate an optimization problem for control flow
graph to eliminate the mis-attribution caused by the skid effect.

2.2.3 Architecture Support for Profiling Accuracy
Intel x86 provides Last Branch Records (LBRs) [11] to continuously record the most recent branches, which can help count basic block execution frequency [36]. Works by
Chen et al. [10] and Nowak [37] have proved the effectiveness of instruction profiling in
basic block level by utilizing LBRs. However, users must sample event of branch instruction, when user adopts LBR to calculate each basic block’s frequency. The sampling
result of taken branches could also be affected by skid effects. LBRs cannot eliminate
10

the skid effect, since skid effect is caused by hardware flaw in CPU design. Moreover,
not all CPU vendors provide such branch logging facility as Intel, e.g., AMD or ARM.
Modern CPUs can often support more advanced forms of sampling, such as Intel’s Precise Event-Based Sampling (PEBS). PEBS tries to keep the skid small [38], directly
supported by hardware. Even though Nowak [37] claims that PEBS could obtain more
accurate profiling result over the standard hardware event sampling method, we find
that PEBS still suffers from skid effect, and its sampling result cannot be quantified by
an mathematical model with a constant skid. Our method eliminates the skid effect from
the software side regardless of the hardware architecture beneath.

11

Chapter 3

DR-BW: Identifying Bandwidth
Contention in NUMA Architectures
with Supervised Learning
3.1 Introduction
The number of CPU cores per node in High Performance Computing (HPC) systems
has increased rapidly in recent years, but the main memory bandwidth has not scaled
with such a speed. Thus, bandwidth contention across cores due to main memory accesses has become a critical bottleneck. To mitigate this contention, modern HPC systems adopt Non-Uniform Memory Access (NUMA) architectures to scale the bandwidth
of main memory. Figure 3.1 shows a typical NUMA architecture, which integrates four
fully interconnected sockets, each with its own memory attached. Each core can access local memory attached to itself or remote memory attached to other socket. Local
accesses have much lower latency and higher bandwidth than remote accesses. While
this design makes it conceptually possible for cores to operate on independent portions
of memory in parallel, it is also possible for cores on different sockets to contend for
available memory bandwidth. With careless software design, bandwidth contention can
occur in any memory controller, so it is critical to identify the root causes of bandwidth
12

cpu 0

cpu 1

cpu 2

Data 0

cpu 3

cpu 4

Data 1

cpu 5

Data 3

node 0

cpu 0

cpu 8

cpu 1

cpu 6

cpu 7

Data 4

node 1

cpu 2

cpu 3

cpu 4

cpu 5

cpu 6

cache 3

cache 3

memory

memory

node 0

node 1

cpu 9

cpu 10

cpu 11

cpu 12 cpu 13 cpu 14

cache 3

cache 3

memory

memory

node 2

node 3

cpu 7

cpu 15

Figure 3.1: An example NUMA architecture with four fully connected sockets.
contention in NUMA applications to address performance problems.
Identifying problematic code that incurs bandwidth contention in NUMA architectures is challenging for several reasons. Runtime variation from hardware prefetching, parallel instruction pipelining, and operating system interference makes it difficult to
predict contention via static analysis. While performance monitoring units (PMUs) are
able to measure memory bandwidth consumption and count remote memory access
requests, this information is insufficient to indicate whether bandwidth is suffering from
contention. First, a high bandwidth consumption does not necessarily mean bandwidth
contention. Furthermore, bandwidth contention can occur in any interconnect channel
between sockets, so it is crucial to understand the NUMA topology to identify and localize contention.
Even if we can identify contention, deriving insights that would allow us to mitigate
the problem remains challenging. Identifying contention indicates whether the problem
exists, but without knowledge of the causes to the contention in an application, resolving
the problem requires a substantial amount of domain knowledge and manual efforts.
Existing tools such as HPCToolkit-NUMA [7] and MemProf [8] measure memory access
latency to identify lines of code that cause problematic remote memory accesses, but
neither addresses contention.
In this chapter, we develop Dr-BW, a lightweight profiler that automatically identifies
bandwidth contention in NUMA architectures using supervised machine learning (ML)
techniques. Dr-BW works on fully optimized binary code that runs at large scale on
13

modern NUMA machines. It does not require any hardware or OS extensions. We train
Dr-BW with a set of micro benchmarks and apply the trained Dr-BW to a set of real
benchmarks from the Sequoia [39], Rodinia [40], NPB [41], and PARSEC [42] suites.
Dr-BW correctly detects bandwidth contention with 95% accuracy. Guided by Dr-BW,
we are able to optimize the code and achieve a up to 6.5× speedup in modern NUMA
architectures. Dr-BW and the related benchmarks used in this paper are open sourced
at https://github.com/xuhao417347761/DR-BW.

3.2 Dr-BW Methodology and Overview
Dr-BW addresses three major challenges in identifying bandwidth contention from hardware, software, and tool’s views.
Bandwidth contention in complex interconnect channels. From the hardware
perspective, the interconnect bandwidths between sockets vary, even for channels in
opposing directions between sockets [43]. Understanding contention in specific interconnects and directions is important for guiding optimization. Dr-BW monitors data
transmission in each channel and associates contention with specific channels.
Root-cause analysis. From the software perspective, problematic memory accesses can be buried deep in complex codebases. Moreover, understanding the problematic access instructions alone does not lead to straightforward optimization strategies. Dr-BW applies root-cause analysis to associate problematic accesses with data
objects. Accesses to these data objects can then be optimized by modifying their allocation schemes for different NUMA architectures.
Bandwidth contention mini programs. Another challenge is collecting meaningful
data on which to train the classifier. No standard benchmark suite exists for bandwidth
contention, so we developed a set of problem-specific mini programs, each tunable to
run with different configurations.
Figure 3.2 shows an overview of Dr-BW. The profiler monitors the execution of fully

14

executable
binaries

Decision-tree
Classifier

Root-cause
Diagnoser

Data-centric
Profiler

data sets

problematic
objects

Figure 3.2: Overview Dr-BW’s workflow for bandwidth contention detection and diagnosis.
optimized binary code and collects performance data. The performance data is fed
into a decision-tree classifier to determine whether contention has occurred in the program execution or not. If Dr-BW identifies contention in a program, its diagnoser further
analyzes the code to identify the root cause of the contention, including accesses to
problematic data objects. Dr-BW associates the analysis with source code to provide
intuitive optimization guidance.

3.2.1 Dr-BW’s Profiler
The profiler component in Dr-BW collects memory accesses and extracts performance
features that are used as input to the contention training and detection phases. To
guarantee lightweight analysis, Dr-BW relies on low-overhead hardware performance
monitoring units (PMUs) to do the measurement. PMUs collect a variety of statistics
during the execution, quantify some performance metrics, and help to identify potential
bottlenecks. In addition, to accurately predict the bandwidth for complex interconnection
topologies and diagnose the root cause of the contention, there are two extra features
implemented in the profiler. One is to distribute the samples to corresponding channels;
the other is to associate the samples with allocated data objects in the program.

Address Sampling.

In order to provide Dr-BW with sufficient information for measure-

ment, prediction and diagnosis, the PMU sampling we use should:
• Report the effective memory address that is read or written by the sample, and the
memory layer this sample is touching: L1, L2, L3, local DRAM or remote DRAM.

15

• Collect memory related metrics along with the sample. Such metrics include local/remote NUMA memory accesses, cache misses, or latency.
• Record the CPU ID where each sampled memory access instruction executes.
Address sampling supported in modern PMUs can meet all of these requirements.
Examples include Intel’s precise event-based sampling (PEBS) with latency extensions [44]
(supported on Nehalem, SandyBridge, IvyBridge, Haswell, and Broadwell microarchitectures); AMD Opteron processors with instruction-based sampling for micro ops (IBS
op) [45]; and IBM POWER5 and later generations of POWER architectures that count
marked events (MRK) [46]. In this paper, we conduct all our experiments is an Intel
SandyBridge machine. We use PEBS PMUs to sample memory latency event 1 . We
sample one of every 2000 memory accesses independently in each thread. Dr-BW uses
PEBS to collect memory samples and report metrics. We will extend our Dr-BW on AMD
and IBM platforms in future work.

Associate Samples with Channels.

A single memory sample can be located on the

channel from any NUMA node to another. We assume that bandwidth issues on one
channel are mainly identified by accesses on that channel. For example, we use only
samples observed between nodes 0 and 1 to diagnose performance problems on the
bus connecting nodes 0 and 1 (not samples that occurred between nodes 0 and 2).
Instead of predicting problems for the entire execution, Dr-BW detects bandwidth issues
per-channel.
Thus, it is necessary to associate all the samples with channels based on their
sources and targets. The source of a sample, also known as the accessing node,
is the node where the processor triggers the memory access. With the precise CPU
ID and the hardware topology, the NUMA node that a core resides is easy to obtain.
The target of a sample is the locating node, where the data reside. To find the target,
we use the libnuma library to get the location node with the precise memory address
1

MEM_TRANS_RETIRED:LATENCY_ABOVE_THRESHOLD

16

reported by this sample. With the sources and targets, the samples are then associated
with different channels.

Attribute Samples to Data Objects.

After sample distribution, Dr-BW can use the

samples from one channel to predict if there would be contention on that channel. When
contention is predicted, to better understand the root cause of the contention, Dr-BW
attributes the samples to the data objects.
There can be three different data types in a program, static data, stack data, and
heap allocated data. Compared with static and stack ones, the dynamically allocated
data usually have larger sizes and suffer more from the bandwidth contention. So in the
current implementation of Dr-BW’s profiler, we focus on the heap allocated data.

3.2.2 Dr-BW’s Classifier
The decision-tree classification algorithm is widely used in predicting the answer to a
yes/no question. Dr-BW uses a decision tree to answer whether there would be bandwidth contention for a program. To build and train this decision tree, we need a set
of benchmarks which cover both bandwidth contention and non-contention scenarios.
There does not exist a standard benchmark suite for this purpose, so we develop several
micro benchmarks and tune the data sizes to run these benchmarks in either bandwidth
friendly mode or contention mode. In this section, we describe how we build our micro
benchmarks and how we train the decision-tree model.

Mini-programs for Training.

The first set of programs is OpenMP multithreaded vec-

tor operations. The summary of these programs is as follows:
• vector summation (sumv): each thread computes the summation of its own share
of vector data.
• dot-product of vectors (dotv): each thread computes the dot-product of its own
share of vector data.
17

• count for vectors (countv): each thread counts the number of occurrences of a
specific number in its own share of vector data.
We use the name sumv, dotv, countv to refer to these three programs in later sections. sumv, dotv and countv differ in memory usage and memory access pattern.
The size of the vector is tuned to adjust the bandwidth friendliness. As the size input
data grows, cache misses and remote accesses increase. When the execution time of
these programs do not grow proportional to the input sizes, we believe contention in
remote bandwidth occurs. This is because the contention can largely delay the memory
requests and incurs significantly longer execution time.
We design the bandit program to continuously issue memory requests without cache
hits. We use the bandit program to study main memory bandwidth and avoid the interference by caches. The implementation of the bandit program follows an existing
approach [15]. The bandit program issues every memory access that conflicts with its
previous one in caches so the request goes to the main memory. To achieve this, we
first allocate several huge pages to ensure that we can have a deterministic mapping
between the page offset and cache set. We then form a stream of memory accesses
with a pointer chasing pattern. All memory accesses touch addresses that are mapped
to the same cache sets, causing conflict cache misses. By placing huge pages in remote
memory, we are able to evaluate the remote memory bandwidth.
In the training phase, we tune the number of streams in one bandit instance and
the number of co-running bandit instances to ensure different requirements to memory
bandwidth.
To simplify the description of whether one run of a program suffers from bandwidth
contention, we define the following two modes for each running instance:
• good: i.e, no remote memory bandwidth contention
• rmc: with remote memory bandwidth contention
When one application running under a configuration has contention, we say this run
18

results in the “rmc” mode.

Identification of Performance Features.

To predict the memory contention, Dr-BW

mainly measures information that are memory related. Given one memory access, all
the features related to it can be classified into the following three categories:
• Identification means features that can be used to identify the memory access, including the memory address and the source node, CPU id, thread id that triggered this access.
• Location includes features specifying where this memory is located and which
layer of the memory hierarchy this access is touching. Information like L1 Hit,
L2 Hit, L3 Hit, L3 Miss, DRAM access, Remote DRAM access belong to this
category.
• Latency, or how many CPU cycles it takes to complete an access.
Dr-BW records all these features for each memory sample, when a batch of memory
samples are collected, the categories above can be derived to further statistics features:
• Statistics Identification pulls in features like number of memory accesses triggered by CPU id, thread id, node id.
• Statistics Location includes features saying the total number happened in one
memory layer, such as Num_L1_Hit, Num_L2_Hit, Num_L3_Hit, Num_L3_Miss,
Num_DRAM_access, Num_RemoteDRAM_access.
• Statistics Latency adds features quantifying the ratios of different latency among
the samples and Average latency of memory accesses across different memory
layers.
All these statistics features are included in a candidate list for training and prediction.
We call it candidate list because it is impractical and unnecessary to use all the features

19

in the prediction. It is reasonable to select and use the features that are highly relevant
to bandwidth contention.
In the selection phase, each of our multi-threaded mini-programs is executed in both
"good" and "rmc" modes, with different thread numbers (e.g. 1, 2, 4, 8, and 16 in a
NUMA node). We measure each candidate feature. If there is significant difference in
the statistics between "good" and "rmc" for a majority of mini-programs, this candidate
feature is selected as a relevant one for the future prediction.
In the experiments, we notice that there are some remote memory events which we
thought should be related with contention, but are actually not, such as remote memory
access with cache miss events2 . Table 3.1 shows the selected features Dr-BW uses.

Figure 3.3: The decision tree used by Dr-BW. The internal nodes are labeled with
“features”, while the leaf nodes are labeled with “classifications”.

Collection of Training Data

We use decision tree classification algorithm in Statistics

and Machine learning toolbox of Matlab 2016a. The training process is conducted on
the platform described in Section 3.3.
Training datasets are the selected statistics collected by running the mini-programs.
Each mini-program is run with multiple configurations. The configuration includes a set
of problem size, number of threads and threads to nodes binding. Each configuration
2

Mem_Load_Uops_LLC_Miss_Retired.Remote_DRAM

20

Table 3.1: Selected features to train Dr-BW’s classifier.
Feature
1
2
3
4
5
6
7
8
9
10
11
12
13

Feature Description
Ratio of latency above 1000 among all samples
Ratio of latency above 500 among all samples
Ratio of latency above 200 among all samples
Ratio of latency above 100 among all samples
Ratio of latency above 50 among all samples
# of remote dram access sample
Average remote dram access latency
# of local dram access sample
Average local dram access latency
Total # of memory access sample
Average memory access latency
Total # of line fill buffer access sample
Line fill buffer access latency

Table 3.2: Summary of the collected training data.
mini-programs
sumv
dotv
countv
bandit
Full training data set

good
24
24
24
48
120

rmc
24
24
24
72

Total
48
48
48
48
192

is either in "good" mode or "rmc" mode. Then we collected all initialized data, and
manually examined each of them. The result of data collection is summarized in table
3.2. Our overall training data set, has 192 instances. We manually label each training
data instance by adding corresponding mode("good","rmc") as separate field.
Table 3.3: Confusion matrix for the training data.

Actual Class

Classifier of the Decision Tree.

good
rmc

Predicted Class
good rmc
118
2
3
69

Figure 3.3 shows the decision tree generated with

the training datasets. The model uses two features (features numbered with 6 and 7 in
Table 3.1). In every internal node in the tree, branching is to the right if the normalized
value of the corresponding feature is above a threshold and otherwise, to the left. To
verify the effectiveness of the decision-tree algorithm, we have applied stratified 10-fold
cross validation on the training data. It shows a 187/192 (or 97.4%) overall success rate.
21

Table 3.3 shows the confusion matrix. Misclassification sometimes occurs because
Dr-BW depends on hardware sampling, which does not monitor every memory access.

3.2.3 Dr-BW’s Diagnoser
Once Dr-BW’s classifier detects that the application does have a bandwidth contention
issue, we apply Dr-BW’s root-cause diagnoser to further identify why this contention
happens. In the diagnoser, we develop metrics to quantify the contribution to the contention of all the data objects. The data which incur most to the contention requires
further investigation.

Quantify Data Object’s Contribution to Contention.

In Dr-BW’s profiler component,

we tagged each sample with an allocation point, so we know exactly which data object
each sample is touching. On the other hand, in Dr-BW’s classifier, we detect the contention issue for each channel. So we know which samples in which channel cause
the contention. To quantify data’s contribution, we can quantify how these samples are
distributed among the data objects.

Metrics per channel.

For a channel c connecting two NUMA nodes, which has a con-

tention issue, all the samples in c are aggregate based on the data objects they are
touching. We define Contribution Fraction (CF) for a data object A in channel c as follows:
CFc (A) =

Samples(c, A)
Samples(c, ALL)

Samples(c, A) means the total number of samples accessing data A in channel c while
Samples(c, ALL) is the total number of samples happen in this channel. The Contribution Fraction for one data object specifies its contribution to the contention in that
channel.

Metrics cross channels.

When accumulating the contribution across channels, we

count all the samples in the channels that have contention issues. For those channels
22

that do not have any contention issue, we do not further analyze their samples. Thus,
the Contribution Fraction (CF) for data A over all the channels involved in the program
is:

N
P

CF (A) =

Samples(c, A)

c=0
N
P

Samples(c, ALL)

c=0

N is the total number of contented channels. The sum of CF for all the data objects
used in the program should be 1.

Root-cause Blaming.

After we get the CF for all the data used in one program, we

can rank the data objects based on the CF values. The data objects with the highest
CF are the root causes of bandwidth contention. To alleviate the contention, one should
applying optimization methods, such as collocating the data with their computation to
the top data objects identified by Dr-BW.

3.3 Evaluation of the Decision-tree classifier
In this section we evaluate our decision-tree classifier with real world benchmarks from
the following benchmark suites:
• NAS Parallel Benchmarks (NPB) [41] are a small set of programs derived from
computational fluid dynamics applications. It includes benchmarks for instructed
adaptive mesh, parallel I/O, multi-zone applications and computational grids.
• PARSEC [42], short for the Princeton Application Repository for Shared-Memory
Computers, contains multithreaded programs focusing on emerging workloads.
• Rodinia [40] is a parallel benchmark suite containing computation-intensive application with diverse accelerators. Paralleled codes are provided with different
engines. We run with the OpenMP ones.

23

• Sequoia [39] is a benchmark suite published by Lawrence Livermore National Laboratory. Memory access patterns in these benchmarks are highly representative.
• LULESH [47] also developed by LLNL solves the Sedov blast wave problem for
one material in 3D. We use the OpenMP parallelized version with C++ codes.
Our experiment platform is a 32-core (8 cores × 4 sockets) Intel Xeon CPU E54650 machine clocked at 2.70GHz. The machine has 32KB L1 cache, 256KB L2 cache
per core, 20MB L3 cache per socket and 256GB (64GB × 4 sockets) DRAM. All the
benchmarks are compiled with gcc 4.8.5 -O3.

3.3.1 Benchmark Classification Results
Table 3.4: Benchmark classification.
Class
good

rmc

Benchmarks
BT, CG, DC, EP, FT, IS, LU, MG, UA
Blackscholes, Bodytrack, Ferret, Fluidanimate,
Freqmine, Raytrace, Swaptions, X264
SP
Streamcluster
Needleman_Wunsch
AMG2006, IRSmk, LULESH

We applied our classifier model on 23 benchmarks selected from the benchmark
suites above. Each program is run with different combinations of input sets, number of
threads, and NUMA nodes.
We run PARSEC benchmarks with four input sets: native, simLarge, simMedium
and simSmall. NPB benchmarks are run with CLASS A, B and C. For Rodinia and
Sequoia benchmarks, we run with the provided default input size and also tune the
parameters to make it both smaller and larger.
We use Tt-Nn to represent a specific configuration with total t threads and n nodes
used. The total t threads are evenly distributed among the n nodes. Threads are also
bound to the cores, e.g. for T16-N4 configuration, threads 0-3 are bound to node 0,
threads 4-7 are in node 1, threads 8-11 are in node 2, and threads 12-15 are in
24

node 3. Our experimental platform has 4-node and 32-core with Hyper-Threading Technology. We tuned t to be 16, 24, 32 and 64 and n to be 2, 3, 4. For each node, we have
t/n threads assigned. In total, we have eight configurations (T16-N4, T24-N4, T32-N4,
T64-N4, T24-N3, T16-N2, T24-N2 ,T32-N2).
We applied our classifier on each channel and use the following rules to classify the
detection result:
1. For a specific case of a benchmark (case here denotes specific inputs, specific
threads and NUMA nodes affinity), if there is at least one remote access channel
which is detected to have contention, we treat this case as "rmc". Otherwise, it
will be treated as "good" .
2. For a benchmark program with all different cases, if there is at least one of them
has remote memory contention issue, we treat this program as "rmc". Otherwise
it will be treated as "good".
Table 3.4 shows the classification summary of the 23 benchmarks. The classification
is the overall result considering all different cases. 17 benchmarks are classified in the
"good" class as they do not show remote bandwidth contention issue with all the input
size and threads we tried. We got six programs with remote memory contention issues,
so contention happens at least during one configuration run. We will discuss the results
of them in detail in the following subsections.

3.3.2 Classification Statistics
It is not straightforward to test how accurate our classifier is over the real benchmarks
because there is no prior-knowledge on the existence of bandwidth contention in these
benchmarks. Moreover, the contention varies according to different hardware parameters and execution configurations. To address this issue, we build our evaluation based
on a assumption that remote bandwidth contention will benefit from the memory interleaving. Because memory interleaving is able to balance the memory requests across
25

Table 3.5: Evaluating the accuracy of Dr-BW with all the real benchmarks.
Benchmark
Swaptions
Blackscholes
Bodytrack
Freqmine
Ferret
Fluidanimate
X264
Streamcluster
IRSmk
AMG2006
NW
BT
CG
DC
EP
FT
IS
LU
MG
UA
SP
Total (Overall)

# cases
32
32
16
32
32
32
32
16
24
8
24
24
24
16
24
24
24
24
24
24
24
512

Actual
RMC NO RMC
0
32
0
32
0
16
0
32
0
32
0
32
0
32
13
3
15
9
8
0
16
8
0
24
0
24
0
16
0
24
0
24
0
24
0
24
0
24
0
24
11
13
63
449

Detected
RMC NO RMC
0
32
0
32
0
16
0
32
0
32
4
28
0
32
16
0
15
9
8
0
17
7
0
24
0
24
0
16
0
24
2
22
0
24
0
24
0
24
9
15
11
13
82
430

different NUMA domain, it alleviate memory bandwidth contention. Thus, if the speedup
of the interleaved version exceeds a predefined threshold 10% over the original code,
we believe this benchmark suffers from a contention issue. We treat the results obtained
from this method as the ground truth.
Tables 3.5 shows the summary of the benchmarks detected with our classifier methodology and the interleaved classification method. This table shows the total number of
instances we run for each benchmark with different combination of input, threads and
nodes. Among those instances, we identify how many of them are detected to have
bandwidth contention and how many of them are contention free. The Actual columns
show the result of interleaved classification method (ground truth) while the Detected
columns show the result of our decision-tree classifier. We can see that for most cases,

26

the two classification methods show the same results, which highlights the accuracy of
Dr-BW.
We further evaluate our detection method in Table 3.6. Compared to “actual” in all
Benchmarks, We have been able to detect remote memory contention with no false
negative and 96.3% overall correctness. Thus, we can infer that Dr-BW successfully
detects remote memory contention problems in Streamcluster, AMG2006, IRSmk, SP
and NW.
Table 3.6: Quantifying Dr-BW’s accuracy when analyzing remote memory bandwidth
contention.

RMC
No RMC
Correctness
False positive Rate
False negative Rate
Actual

Detection Classification
RMC
No RMC
63
0
19
430
(430+63)/(0+63+19+430) = 96.3%
19/(19+430)=4.2%
0/(0+63)=0%

3.4 Case Studies
Table 3.7: Dr-BW’s runtime overhead.
Code
IRSmk
AMG2006
Streamcluster
NW
SP
LULESH
Average

Execution time (s)
Without profiling With profiling
118.1
119.2
122.5
132.1
245.2
222.6
55.8
59.4
411.0
425.7
118.2
130.0
-

Overhead
(%)
+0.9
+7.8
-9.2
+6.4
+3.6
+10.0
+3.3

Among the total 23 benchmarks we investigate, Dr-BW’s classifier detects six suffering from the remote bandwidth contention issue. In this section, we further study these
benchmarks with Dr-BW’s root-cause diagnoser, pinpoint and optimize the problematic
data objects.

27

The profiling overhead when using all the 64 cores across four NUMA nodes for
these benchmarks is shown in Table 3.7. Time is averaged after four executions. As the
table shows, the highest overhead we have is for LULESH, which is 10.0%. The average
overhead of the six benchmarks is 3.3%. Particularly, Streamcluster runs 9% faster
with profiling, because, to the best of our knowledge, the profiling code interferences
the original memory accesses and reducesTable
the1 bandwidth contention. In the following
0-

9 8A

9 8 9 9

4 38 9 8A

subsections, 6we1 discuss each benchmark one by one.
6
6
6
6
6
6
6

LG H87

H 9 9 2

H 79H 9

1
1
1
1
1
1
1

3.4.1 AMG2006
diag_j
hyper_VectorData

diag_data
Other variables

RAP_diag_j

75

50

T32-N2

T24-N2

T16-N2

T24-N3

T64-N4

T32-N4

0

T24-N4

25

T16-N4

% Contribution Fraction (CF)

100

Figure 3.4: Contribution Fraction (CF) distribution across data objects in AMG2006.
AMG2006, one of LLNL Sequoia benchmarks, is a parallel algebraic multi-grid solver
for linear systems arising from problems on unstructured grids. It consists of three
phases: initialization, setup and solver. It is written in C with MPI and OpenMP.
With Dr-BW’s diagnoser analysis, we calculate the CF for all the data objects used
in AMG2006. Figure 3.4(a) shows the distribution of CF among the data objects when
running with 30 × 30 × 30 grid size. Four of 1them, having high CF are highly related
with the contention when running with different numbers of threads pinned to different
number of nodes.
Among the four data objects listed, array RAP_diag_j is the most highly related with

28

the contention no matter how many threads and nodes are set to run the program. Moreover, arrays diag_j and diag_data’s contention contribution grows when more NUMA
nodes are used for computation. We then manually check how these four arrays are
used in the program. We find that they are used in an OpenMP parallel for loop, and
each thread handles a continuous segment of the array. To optimize the code, we break
the data into multiple segments and co-locate each with its computation at the array allocation point. We use libnuma [48] to control the memory allocation.
1.2

interleave
co-locate

Speedup (times)

Speedup (times)

1

2

interleave
co-locate

0.8
0.6
0.4

1.5

1

0.5

T24-N3

T16-N2

T24-N2

T32-N2

T24-N3

T16-N2

T24-N2

T32-N2

(a) A

T64-N4

T32-N4

T24-N4

T16-N4

0

T32-N2

T24-N2

T16-N2

T24-N3

T64-N4

T32-N4

T24-N4

0

T16-N4

0.2

(b) B

2.5

Speedup (times)

Speedup (times)

2

1.5

interleave
co-locate

1.5
1

interleave
co-locate

1

0.5

(c) C

T64-N4

T32-N4

T24-N4

T16-N4

0

T32-N2

T24-N2

T16-N2

T24-N3

T64-N4

T32-N4

T24-N4

0

T16-N4

0.5

(d) D

Figure 3.5: Speedups of IRSmk with different input sizes and execution configurations.

We optimize with data-computation collocation for all the four data objects as shown
in Figure 3.4; Figure 3.5 demonstrates the speedup of our optimization with different
execution parameters at different execution phases. We also compare this speedup with
the memory interleaved optimization, which interleaves all the memory pages allocated
for the entire program. We can see that our optimization that focuses on data objected
29

identified by Dr-BW achieves higher speedups.
As shown in Figure 3.5, the interleave optimization achieves good performance (1.5×
in average) in the solver phase, but this coarse-grained optimization hurts the setup and
initialization phases. Our co-locate optimization guided by Dr-BW achieves the same
high speedup in the solver phase without hurting the setup and initialization phases.
Thus, our optimization has higher speedups for the entire program execution over the
interleave optimization.
We profile the optimized code with 64 threads in four NUMA nodes. The total number of remote memory accesses is reduced by 87.8% and the average memory access
latency is decreased by 83%.

3.4.2 IRSmk
7

5

(a) Medium (64 works)

T32-N2

T24-N2

T16-N2

T32-N2

T24-N2

T16-N2

T24-N3

0

T64-N4

0

T32-N4

1

T24-N4

1

T24-N3

2

T64-N4

2

3

T32-N4

3

4

T24-N4

4

interleave
co-locate

T16-N4

Speedup (times)

5

T16-N4

Speedup (times)

6

6

interleave
co-locate

(b) Large (96 works)

Figure 3.6: Speedups of IRSmk with different input sizes and execution configurations.
IRSmk, also from LLNL Sequoia benchmarks, is a parallel Implicit Radiation Solver for
diffusion equation on a three-dimensional, block structured mesh. It is implemented in C
with OpenMP. We evaluate a highly optimized version of IRSmk obtained from previous
work [49]. Dr-BW’s diagnoser detects 29 problematic arrays including array b, k and
other 27 arrays, which are of the same size and show similar access patterns. These
arrays share similar CF values and uniformly contribute to the bandwidth contention.
To optimize the code, we apply the co-locate optimization for all the 29 variables.
30

And we test our speedup along with the interleave optimization with different input sizes,
nodes and threads as shown in Figure 3.6. Medium and large mean that we run with
64×64×64 and 96×96×96 input meshes, respectively.
When the input size is smaller, e.g., with the configuration of T16-N4, both optimization strategies do not show significant speedups. However, with the growing of input
sizes, the benefit of co-locate and interleave policies becomes more significant. The
maximum speedup can be as high as 6.2×. With all four NUMA nodes utilized and
the number of threads bound to each NUMA node less than eight, interleave can have
slightly shorter execution time than co-locate. However, co-locate performs much better
when fewer NUMA nodes are used. When running large input with 64 threads and 4
NUMA nodes, the total remote memory accesses is reduced by 72.5% and the average
memory access latency is decreased by 88.9%.

3.4.3 Streamcluster
2.5

interleave
replicate

2

Speedup (times)

4
3
2

1

(a) simLarge

T32-N2

T24-N2

T16-N2

T24-N3

T64-N4

T32-N2

T24-N2

T16-N2

T24-N3

T64-N4

T32-N4

T24-N4

T16-N4

0

T32-N4

0.5

1
0

1.5

T24-N4

Speedup (times)

5

interleave
replicate

T16-N4

6

(b) Native

Figure 3.7: Speedups for Streamcluster with different input sizes and execution configurations.
Dr-BW identifies that Streamcluster from PARSEC has a remote memory bandwidth contention issue, which, actually, has been verified in previous work [20]. We
evaluate Streamcluster with two different input sizes, native and simLarge. Figure 3.8
shows problematic data structures identified by Dr-BW’s diagnoser. When running with

31

5 H 9D
6
1
6
1
6
1
6
1
6
1
6
1
6
1
6
1

H

G

G 2

H 79H 9

block

points.p

Other variables

75

50

T32-N2

T24-N2

T16-N2

T24-N3

T64-N4

T32-N4

0

T24-N4

25

T16-N4

% Contribution Fraction (CF)

100

Figure 3.8: streamcluster: Contribution Fraction (CF) distribution across data objects.
the native input, two arrays (block and point.p) account for more than 90% of the contention. Among them, the array block is more important since it has the highest CF
value.
Our further analysis shows that block is randomly accessed by all the threads and
the data is never overwritten after the initialization. Thus, we create a shadow replications of block for the threads in each NUMA node, so all the accesses to block can
go to local memory. Figure 3.7 shows the speedups with the replicate optimization as
well as the interleave optimization when running with different inputs. We can see that
1

when running with larger number of nodes, e.g. three or four, the interleave and replicate show similar improvement over the original execution. However, when fewer nodes
and threads are used, replicate performs much better. This is because the interleave
optimization balances the bandwidth requests, but it also introduces more remote memory accesses. When running with only two nodes and a small number of threads, the
bandwidth contention is not that serious, so the overhead of increased remote accesses
surpasses the benefit of bandwidth contention reduction.

32

interleave
co-locate

9 9

2

H 79H 9

T32-N2

T24-N2

T16-N2

T24-N3

T64-N4

0

T32-N4

1
1
1
1
1
1
1
1

0.5

T16-N4

. 9G
6
6
6
6
6
6
6
6

1

T24-N4

Speedup (times)

1.5

Figure 3.9: Speedups for LULESH with different execution configurations.
Heap Data

Other variables

75

50

T32-N2

T24-N2

T16-N2

T24-N3

T64-N4

T32-N4

0

T24-N4

25

T16-N4

% Contribution Fraction (CF)

100

Figure 3.10: Contribution Fraction (CF) distribution across data objects in LULESH.

3.4.4 LULESH
We evaluate LULESH with one large input size. In LULESH, there are over 40 heap allocated arrays, which show similar sizes and accessing patterns. As shown in Figure 3.10,
heap data objects allocated at line:2158-2238 account for a sum of the CF higher than
50%. There are some other intractable data objects that account for non-negligible CF.
1

Particularly, two static data objects used in LULESH incur significant memory accesses.
Dr-BW currently does not support tracing static data object, so we leave it as our further
work.
To optimize the heap allocated data, similarly as IRSmk, we co-locate the data with
33

the computations. Figure 3.9 shows execution time speedups of our co-locate and also
interleave. We can clearly see that, the speedup of co-locate performs much better than
interleave. When running with T16-N4 configuration, there is not significant speedup,
since the Dr-BW’s classifier puts the execution with this configuration in "good" category.
This is18because only four
threads
bound
to each NUMA node are not enough to saturate
G
G ML9 L
L
2LA N
6

1

1
remote66 memory
bandwidths.
1
6
6
6
6
6

1
1
1
1
1

3.4.5 Rodinia NW
reference

input_itemsets

Other variables

75

50

T32-N2

T24-N2

T16-N2

T24-N3

T64-N4

T32-N4

0

T24-N4

25

T16-N4

% Contribution Fraction (CF)

100

Figure 3.11: Contribution Fraction (CF) distribution across data objects in Rodinia NW
In Rodinia NW , Dr-BW pinpoints two problematic data structures reference and
input_itemsets with CF values shown in Figure 3.11. Both arrays are allocated by
the master thread but accessed by threads across all NUMA nodes. To address this
problem, we co-locate the allocation of these two arrays with computations across all
NUMA nodes by using libnuma. This achieves a speeded up of 32.6%. After the opti1
mization, the average memory access
latency is reduced by 60%.

3.4.6 SP
Dr-BW’s classifier detects that SP with C class input size has a remote memory bandwidth contention issue. All the data objects used in this project are global data which are
34

statically allocated. As we do not track the static data, we simply quantify the execution
speedups obtained with the interleave optimization. When the threads per node is high
(e.g., > 8), the speedup is as high as 1.75×, when running with 64 threads across four
NUMA nodes.

3.4.7 Blackscholes
To evaluate Dr-BW, we also choose benchmarks which fall into the "good" category
and apply optimizations to see whether we can obtain performance benefit. The study
of Blackscholes from PARSEC is for this purpose. Dr-BW’s classifier reports that
Blackscholes with native input size has no remote memory bandwidth contention issue.
We evaluate the execution time of interleave optimization with all configurations. The
difference with the original execution is negligible. With further analysis, Dr-BW highlights the array buffer associated with the highest CF score. We optimize the buffer
by collocating the data with computations, but the speedup is < 1%. Thus, Dr-BW successfully shows that Blackscholes does not incur bandwidth contention, which applies
to other benchmarks in the "good" category.

3.5 Conclusion
In this chapter, we present the design and implementation of Dr-BW, a profiler that uses
machine learning techniques to identify bandwidth contention in NUMA architectures.
Dr-BW collects performance data with low overhead, feeds the data into a novel machine learning model to identify contention, and associates the analysis results with both
programs and significant data objects. We study a number of benchmarks and show that
Dr-BW achieves more than 96% accuracy. With several case studies, we demonstrate
that Dr-BW is able to guide performance optimization and yield up to a 6.5× speedup. In
the future, we will extend Dr-BW to identify resource contention beyond memory bandwidth using machine learning techniques, such as contention in instruction issue slots,

35

different level of caches, and I/O devices.

36

Chapter 4

Understanding and Fixing
Inaccuracy of Modern Profilers
4.1 Introduction
PMU sampling (also known as event-based sampling) is widely used in mainstream
performance tools, such as Oprofile [50], Linux Perf [51], HPCToolkit [52], and Intel
VTune [53]. These tools configure the PMUs to a preset value M AX − P , where M AX
is the maximum value a 64-bit PMU register can represent, while P is the sampling
period predefined for the monitored event. When the event occurs P times, the PMU
triggers an overflow interrupt, known as a sample. The monitored program suspends
its execution and switches to the signal handler routine installed by the performance
tool. In the signal handler, the performance tool is able to associate the sample with
the program code obtained from the signal context. Furthermore, Perf can create an
internal buffer to store all samples and use poll() to read the buffer to further lower the
overhead. Performance tools based on PMU sampling typically have low overhead with
reasonable sampling periods and thus are more desirable in production.
There has been a large volume of work [38, 54, 55, 52, 53, 50] showing the usefulness of the sampling-based tools in guiding performance diagnosis and optimization.
Instruction per cycle (IPC) is one of the most important evaluation metrics during perfor37

Table 4.1: Exclusive function-level instruction counts for two hot procedures in
PowerGraph PageRank [1]. The ground truth result is collected by CCTLib.
Procedure

VTune
1.40e12
gather
(50.1%)
1.36e12
9execute_gathers
(49.9%)

Perf HPCToolkit Ground Truth
1.37e12 1.63e12
9.32e11
(49.6%)
(59.3%)
(33.8%)
1.39e12 1.13e12
1.83e12
(50.4%)
(40.7%)
(66.2%)

mance analysis, which largely relies on accurate instruction and cycle profiling results.
However, there is a lack of systematic study about the accuracy of sampling-based instruction measurement. The common knowledge about the PMU sampling is that hot
procedures (i.e., procedures with more samples) tend to have more reliable profiling
results, while the measurement results of cold procedures (i.e., procedures with fewer
samples) come with more noises. Since performance tools usually focus on hot procedures more than cold ones, such inaccuracy is tolerable. However, this common
knowledge is derived from theoretical analysis or intuition [56], with little work in quantifying the accuracy in practice. We show that state-of-the-art tools based on this common
knowledge can produce misleading results.
We use Intel VTune, Linux Perf, and HPCToolkit to collect retired instructions via
PMUs on a real code base—PowerGraph [1], one of the most popular graph engines, as
discussed in Section 4.6.1. Table 4.1 shows the profiling results in two hot procedures:
gather and execute_gather. These two procedures account for more than 80% of
CPU cycles of the entire graph processing phase. We can see that all three tools report
similar results but significantly different from the ground truth, which is collected with
CCTLib [57], a tool based on binary instrumentation. This profiling result incorrectly
identifies the hotspot, potentially wasting optimization efforts.
This inaccurate measurement arises due to the hardware limit in handling the PMU
samples. There is always a time delay between the PMU overflow interrupt and the
signal delivery to the performance tool, which is known as “skid” [9]. Such skid pervasively exists in most architectures, such as x86, POWERPC, and ARM, and is the major
source of the inaccuracy. However, the impact of the skid to the measurement accuracy

38

has not been extensively studied. Prior work [10] relies on hardware support to minimize
the skid, such as the precise event-based sampling (PEBS) [4] and last branch record
(LBR) [11] in Intel processors. However, PEBS and LBR are not generally available in
all processors. Moreover, even PEBS is not guaranteed to be accurate, which will be
described in Section 4.2.
The remainder of this chapter is organized as follows. Section 4.2 shows the background and motivation of this paper. Section 4.3 describes our approach to quantifying
the skid in various CPU architectures. Section 4.4 elaborates on the methodology of
fixing the measurement inaccuracy with a pure software technique. Section 4.5 and 4.6
depict our evaluation and case study on real applications and benchmarks, respectively.
Finally, Section 4.7 presents some conclusions.

4.2 Background
This section introduces the background knowledge of existing profilers based on PMU
sampling, reveals the provenance of inaccurate measurement, and discusses our scheme
of obtaining ground truth for the accuracy study.

4.2.1 Profilers with PMU Sampling
Hardware Performance Monitoring Units (PMU):

CPU’s PMUs offer a programmable

way to count hardware events such as retired instructions, CPU cycles, cache misses,
etc.. A PMU can trigger an overflow interrupt once a preset number of occurrences of
an event is reached. A profiler, running in the address space of the monitored program,
can handle the interrupt and attribute the measurement “appropriately”. We refer to a
PMU counter overflow as a “sample”. PMUs pervasively exist in CPU processors from
different vendors, such as Intel, AMD, IBM, and ARM.

39

Linux Perf_events:

Linux provides APIs to configure, enable, and disable PMUs un-

der thread granularity (e.g., perf_event_open [51], ioctl). Once a PMU event counter
overflows, the Linux kernel signals the corresponding thread with the details of the event
(e.g., instruction pointer). The signal handler in the user space then examines the event
information and attributes proper measurements. Some PMU facilities, such as Intel’s
precise event-based sampling, allocate a kernel buffer to record multiple samples and
allow tools to read the buffer via poll().

Profiling Mechanisms:

Many tools utilize call path profling [58], a profiling technique

in which runtime events (e.g., cache misses) are attributed to the full call path seen at
the time of the event. Call path profiling offers insightful details in complex applications
with deep call chains. The calling context of an event is a set of active procedure frames
when the event happens. A calling context begins at a process or thread entry function
such as main and ends at the instruction pointer (IP) of the instruction that triggers the
event. With the calling context, tools are able to associate samples with all functions
in the call chain. The accumulated metrics from all the callees are known as inclusive
metrics, while metrics not accumulated from the callees are known as exclusive metrics.

4.2.2 Limitation of Precise Event Based Sampling
Modern CPU processors provide various precise sampling mechanisms to alleviate or
eliminate the skid, such as Intel’s precise event-based sampling (PEBS) [4] and AMD’s
instruction-based sampling (IBS) [9]. These mechanisms use specific PMU registers to
record the precise instruction pointers (IP) that trigger PMU counter overflows. However,
not all CPU vendors (e.g., ARM) support these precise mechanisms. Moreover, these
mechanisms do not always provide reliable profiling results. For IBS, PMU needs to
tag an instruction at the issuing point to monitor its execution in the pipeline. If this
instruction is not retired due to the speculative execution, PMU will not capture any
sample in this period. PEBS suffers from shadow effect [4]. When the PMU selects

40

an instruction being retired in the pipeline to report, there can be multiple candidates.
Yet PEBS is more likely to report the one with the highest execution latency, leading to
biased profiling results.

36.00
34.95
33.89
32.84
31.79
30.73
29.68
28.63
27.57
26.52

1

2

3 4
5 6
functi
7
on g(
) div
time

8

9

10

1

2

3

4

7

6

5

on

cti
fun

9

8

d

d
)a

Absolute Error Value

Absolute Error Value

137.71
122.58
107.44
92.30
77.16
62.03
46.89
31.75
16.61
1.48

10
e
tim

1

f(

(a) PEBS

2

3 4
5 6
functi
7
on g(
) div
time

8

9

10

1

2

3

4
fun

5

6

o
cti

n

7
f(

9

8

d

d
)a

10
tim

e

(b) Traditional PMU Sampling

Figure 4.1: Comparison of instruction profiling inaccuracy loss. Z-axis value indicates
the total mis-attributed instruction number at the function level.

4.2.3 Ground Truth Profiler
We cannot rely on PMUs to collect the ground truth. Alternatively, one may use simulators [59, 60, 61, 62] to measure the hardware-related events (e.g., cycles, cache
misses). However, since it is difficult to accurately simulate every feature of a processor, simulators may not truly produce ground truth for these events. We observe that
software-related events such as retired instruction and floating point operations, do not
depend on hardware. We are able to adopt a software method to collect ground truth
profile of these software-related events, regardless of hardware platforms. Thus, we
develop a tool that uses a pure software method to measure the number of retired instructions of procedures in their calling contexts. We build the tool based on CCTLib [57],
a Pin tool that is able to determine the calling context of each monitored instruction in
a parallel program. We design a client tool atop CCTLib to count the number of retired
instructions in each procedure within its calling context.

41

4.3 Quantify Skid Effect
In this section, we introduce a mathematical model to quantify the skid effect for a simple
program that only consists of a simple loop of instructions.
Definition 4.3.1. Simple Loop1 of Instructions: A repeatedly executed loop consists
of a fixed number of instructions, which contains no conditional branches except the one
for loop control.
The mathematical model relies on the following assumptions:
• The skid effect can be quantified in CPU cycles [30].
• Each instruction takes a fixed number of cycles on average, i.e., its Cycle Per Instruction (CPI) stays the same.
• CPU can issue multiple instructions at the same time.
• When sampled by CPU cycles, each cycle has the same chance to overflow the hardware event counter, regardless of the instruction being currently executed.
Figure 4.2 illustrates the skid effect of a simple loop under instruction profiling, which
helps us draw the two conclusions:
CPU Cycle profiling result for each instruction is not affected by skid effects.
As shown in Figure 4.2, Instruction 2 with a duration of 4 cycles should trigger 4 counter
overflows (Co , Do , Eo , Fo ) and own 4 samples2 . With a skid of 2 cycles, the sample
points E2 and F2 are finally attributed to Instruction 3. In the meantime, another two
sample points A2 and B2 triggered by previous instructions are attributed to Instruction
2, which makes Instruction 2 still own 4 sample points. The number of sample points
within an instruction always stays constant whatever the skid duration is. Thus, the skid
effect does not affect the cycle profiling result [10]. Instead, a sampling profiling with
1

It is also called Simple Cycle. We deliberately use Simple Loop to avoid any confusion from CPU
cycles.
2
More accurately, it is 4p samples instead of 4, where p is the probability of a cycle to trigger an overflow.

42

Skid Duration S

Fo
Eo

Cs

Co
Bo

cycle

cycle

Es
Ds

Do

Ao

Fs

Bs
As

cycle

cycle

Instruction 1

cycle

cycle

Instruction 2

cycle

cycle

cycle

cycle

……

Instruction 3

Time

Figure 4.2: Skid effect on cycle profiling result. The skid duration S is less than 2 cycles
and every cycle has the same probability to cause the counter overflow. There are three
instructions (1, 2, 3) executed in order. Two different kinds of time points are defined: 1)
counter overflow point, a finished cycle causing the hardware performance counter to
overflow; 2) sample point, a time point when active instruction is blamed as the cause
of performance counter overflow. Each counter flow point has a corresponding sample
point (e.g., the counter overflow pointer Ao has its own sample point As .). The time
difference of each pair of counter overflow point and sample point is skid duration S
measured in cycles.
a period of T and a skid S is equivalent to the one with a period of T + S without any
skid. This property can be used to estimate the CPI value of each instruction from cycle
profiling result. To minimize the overhead of profiler, T is usually significantly greater
than S, which is usually very small.
CPU cycle profiling result is not affected by instruction-level parallelism (ILP).
CPU has its own mechanism to blame which instruction from all instructions on the
fly as the sample point, when performance counter are employed for sampling CPU
cycles. Each instruction has its own probability to be blamed at the sample point. Such
probability is not affected by skid effect and up to CPU PMU design. Since the skid
effect only increases the sampling period and the sampling period does not alter the
profiling result, instruction-level parallelism (ILP) with skid effect will not introduce any
further effect on CPU cycle profiling.
43

D = [0,2,0,2,1]

Skid emulation three
different length of skid
cycle duration.

(1) S =

I5|c5

I1I|c
11

I2|c2

I3|c3 I4|c4

I5|c5

D(S,[c1,c2,c3,c4,c5]) = [0,2,0,1,2]

...

(2) S =

I5|c5

I1|c1

I2|c2

I3|c3 I4|c4

I5|c5

D(S,[c1,c2,c3,c4,c5]) = [0,2,0,2,1]

...

(3) S =

I5|c5

I1|c1

I2|c2

I3|c3 I4|c4

I5|c5

D(S,[c1,c2,c3,c4,c5]) = [1,1,1,1,1]

Figure 4.3: Skid effect emulation with three different skid values on a simple loop
consisting of five instructions. Each emulation generates an instruction distribution
D(S, [c1 , c2 , c3 , c4 , c5 ]) that is a vector of the number of samples of all instructions, while
D is the actual instruction profiling result.
In the next few subsections, we will first explain how skid effect emulation can generate relative instruction distribution in a simple loop of instructions, and then design an
algorithm to measure the CPU cycle duration of the skid.

4.3.1 Skid Effects Modeling in a Simple Loop
We use a simple loop with 5 instructions I1 , I2 , I3 , I4 and I5 to illustrate our skid effect
model. They have their CPI values, which are c1 , c2 , c3 , c4 and c5 respectively. Figure 4.3 shows skid effect emulations with different skid CPU cycle duration S. Each
instruction has the same probability to cause the performance counter overflow, when
PMU is adapted to sample instructions with a fixed sampling rate. When a retired instruction causes the counter overflow, it takes S CPU cycles (skid duration) to stop and
then attributes to the instruction on the fly as the sample point. Then we can construct
a mapping from the instruction causing counter overflow to the instruction at the sample
point. With the CPI information, this mapping for fixed skid CPU cycle can generate
the relative sampled instruction distribution D(S, [c1 , c2 , c3 , c4 , c5 ]). In Figure 4.3, we em-

44

ulate the skid effect under three different skid values, thus generating three relative
sampled instruction distributions. For example, skid emulation (2) generates a distribution of D(S, [c1 , c2 , c3 , c4 , c5 ]) = [0, 2, 0, 2, 1], which is closest to the actual profiling result
D = [0, 2, 0, 2, 1].
For a simple loop with a specific CPI vector c, a specific skid duration S will generate a corresponding instruction distribution D(S, c) upon skid effect emulation. We
rely on this property to design an algorithm to measure the skid duration S for specific
CPU, which will be explained in the next subsection. Formal mathematical modeling is
provided in Section 1 in our complement document [63].

4.3.2 Measurement of CPU Skid Duration
The skid model of a simple loop introduced in the previous subsection involves three
variables:
• S, skid duration in cycles,
• c, a vector of cycle duration of instructions, equivalent to {c1 , c2 , ..., cN },
• D, a vector of the number of samples of all instructions, i.e., the profiling result.
As CPU skid changes the instruction distribution of a simple loop other than the total
number of sampled instructions, we have

kDk1 = kD(S, c)k1 = kD(0, c)k1 .
Since the cycle profiling result of a simple loop is immune to the skid effect, we can
estimate each instruction’s CPI by the cycle profiling result. More specifically, if the loop
is executed Ne times, we can derive the CPI value of an instruction i by

ci =

Ci
,
Ne

45

where Ci is the total number of cycles executing instruction i from the cycle profiling
result. After applying this equation to all the instructions, we can get the CPI values of
all instructions, which constitute a vector c.
Knowing c, we can further obtain D(S, c) under different values of S as described
in Section 4.3.1. To quantify how close our emulation result D(S, c) is from the actual
instruction sampling result D, we introduce an error metric Error(C, S), where

Error(C, S) = kD(S, c) − Dk2 .

If Error(C, S) is small enough, we can assume the skid value S chosen is correct.
There may be more than one ”correct” skid value for one simple loop. Thus we adopt
several similar but different simple loops to eliminate other possible values. Listing 4.1
shows a mini-benchmark template that is used to measure the skid duration S.

3

It

consists of two functions in a simple loop. By altering the number of add operations P
and div operations Q from 1 to 10, respectively, we end up with M = 10 × 10 = 100
programs in total. Then we collect all programs’ cycle profiling results (denoted as
{C1 , C2 , ..., CM }) and instruction profiling results (denoted as {D1 , D2 , ..., DM }) to estimate the CPI vectors for all programs on the same CPU (denoted as {c1 , c2 , ..., cM }).
By choosing an arbitrary skid value S and emulating the skid effect on all the programs,
we can get Error(C, S) of all the programs. The correct skid duration S of this CPU
should minimize the sum of Error(C, S), i.e.,

arg min
S

M
X

Error(Cm , S).

(4.1)

m=1

Algorithm 1 describes how we measure the skid duration on a specific platform. We
exhaustively set S from 0 to 300 (Line 3). Under each value of S, we emulate the skid
effect on all the programs and sum up the error as shown in Equation 4.1 (Line 4-8).
The value of S is only kept when the corresponding error sum is the minimal value seen
3

The mini-benchmark code and scripts to collect the data are in https://github.com/xuhao417347761/ics19-minibenchmark.git.

46

1
2
3
4
5
6
7
8
9
10
11
12

void g(){
a/=i; // P div operations
...}
void f(){
a+=i; // Q add operations
...}
int main (){
for (i = 1; i<= NUM_TIME; i++) {
g();
f();
}
}

Listing 4.1: Mini-benchmark template for measuring skid cycle duration. A for-loop in main calls
two functions g() and f(), which contains P div operations and Q add operations, respectively.

Error Value

105
104

Intel-Xeon Phi
AMD-Opteron
Intel-SandyBridge
Intel-Broadwell
Intel-Skylake

103
102

0

50

100
150
200
Skid Cycle Duration

250

300

Figure 4.4: Measurement of skid cycle duration for multiple platforms. Y-axis represents
the sum of errors described in the equation, while X-axis denotes the candidate skid
duration from 0 to 300 cycles. We search for the skid duration value, which makes the
error sum minimum.
so far (Line 9-12).
Figure 4.4 plots the error sums when S changes from 0 to 300 under five platforms (AMD-Opteron, Intel-SandyBridge, Intel-Broadwell, Intel-Skylake and Intel-Xeon
Phi). We can always find a global minimum point for every platform, where its S value
makes the error sum minimal. The skid duration measurement process is a one-time
job for a specific CPU platform, which takes around 10 minutes to finish running all minibenchmarks and searching for the optimal value. As we discussed earlier, the S value
at the minimal point is the skid duration of this platform. The skid duration of all Intel
CPUs is less than 20 cycles, while AMD-Opteron has a much longer skid, which is 34
cycles.

47

Algorithm 1: Measurement of Skid Duration

1
2
3
4
5
6
7
8
9
10
11
12
13
14

Input: Cycle profiling result {C1 , C2 , ..., CM }, instruction profiling result {D1 , D2 , ..., DM }
Output: Skid duration S
min ← 0;
minErrorT emp ← ∞;
for S ← 0 to 300 with step of 0.5 do
errorSum ← 0;
for m ← 1 to M do
c ← Cm /Ne ;
errorSum ← errorSum + kD(S, c) − Dm k2 ;
end
if errorSum < minErrorT emp then
minErrorT emp ← errorSum ;
min ← S ;
end
end
S ← min ;

Besides the arithmetic instructions, we have also developed other mini-benchmark
suite with heavy memory access instructions. The skid measurement results are quite
close to Figure 4.4. We will release all the mini-benchmarks for the skid duration measurement once the paper gets accepted.
After we obtain the skid value S of a platform, we can use it to emulate the skid
effect happening in any simple loop, which should resemble the actual instruction profiling result. However, we encounter a challenge when applying the skid effect emulation
to control flow graph with multiple branches. PMU does not quantify the execution frequency of every branch, which makes it impossible to estimate CPI of each instruction
(one of the skid emulation inputs). We elaborate on how to eliminate skid effect for on
complex control flow graphs with multiple branches in the next section.

4.3.3 Mathematic formulation of skid effect modeling on Simple loop
These are some definitions to help explain our mathematical model:
• N : number of instructions in a simple loop.
• I = {I1 , I2 , I3 , ..., IN } (|I| = N ): simple loop consists of N instructions. All instructions are executed one by one.
48

• Each instruction has its duration in cycles, we define each instruction In ’s cycle
duration is cn , and we have 0 < n ≤ N . Here cn is also called cycles per instruction
(CPI) for instruction In . We can also define a CPI vector for this cyclical code
sequence: c = {c1 , c2 , c3 , ..., cN }.
• S: skid cycles duration, quantified in cycles.
• Ns : the total sampled instructions in a simple loop.
• T : the instruction sampling period.
• Ne : the number of times the simple loop I executed.
• g(S, m, c): the skid length of the instruction Im , quantified in instructions.
• D(S, c): the instruction distribution of the loop In affected by skid effects with a
skid duration S in cycles.
Since all instructions in this loop are executed circularly, there is an instruction sequence I exactly after I executed. For a better understanding, we have:

IN ∗K+n = In , K ∈ N+ , 0 < n ≤ N
cN ∗K+n = cn , K ∈ N+ , 0 < n ≤ N
Ideal Instruction Sampling Without Skid: Before we go deep into the skid problem,
we need to have a mathematical explanation of ideal sampling without skid effects. Such
an ideal model of sampling scheme could help us understand more complex scenarios.
The sampling period T is usually a very large number (T  N ) to minimize the sampling
overhead. Additionally, T will be set as a prime number to ensure that instruction sample
points are evenly distributed in all instructions in I. The total sampled instructions in this
loop is:
Ns =

N ∗ Ne
T

49

For this whole loop, we can use NS to estimate total instructions: NS ∗T , without bias. To
ensure there are enough sample points number NS , we need to guarantee N ∗ Ne  T .
The sampled points number distribution D(0, c) can be represented as:
1
D(0, c) = [Ne

···
···

2
Ne

1
N
Ne ] = [ NsN∗T

2
Ns ∗T
N

···
···

N
Ns ∗T
]
N

Since we can guarantee the instruction profiling result is not biased at the instruction
level, each basic block or function will also get an unbiased profiling result. However,
in real sampling based profiling, skid effects will cause the offset between the sampled
instruction and the instruction causing the hardware counter overflow. We will explain
skid effects modeling to calculate sample points’ distribution with skid duration in cycles
in the next subsection.
Mathematical Modeling of Skid Effect. To describe the mathematical model quantifying the skid effects, the instruction Im from the simple loop instructions I is used to
explain the offset caused by skid effects. If the instruction counter overflow happens at
Im , and the attributed instruction is Im0 , we can guarantee m0 > m. For skid length in
cycles S, we can obtain its range :
0 −1
m
X

0

m
X

cn < S ≤

n=m+1

cn

n=m+1

Skid length quantified in instruction g(S,m,c) can be represented as :
g(S, m, c) = m0 − m

g(S, m, c) is decided by three input coefficients:
• Skid duration S cycles. We assume S ≤

PN

n=1 cn .

• Instruction index m for Im in the code sequence I.
• All CPI information c.

50

For instruction Im , at which counter overflow happens, from I, g(S, m, c) can be defined
as a periodic step function for S:

1, 0 < S ≤ cm+1



2, cm+1 < S ≤ cm+1 + cP
m+2


m+3
 3, c
m+1 + cm+2 < S ≤
n=m+1 cn
g(S, m, c) =
...

Pm+N −2
P

−1

N − 1,
cn < S ≤ m+N
c

n=m+1

Pm+N −1
Pm+Nn=m+1 n

N,
n=m+1 cn < S ≤
n=m+1 cn

(4.2)

Im ’s sampling point distribution without skid effects can be defined as:
1
d(0, c, m) = [0

···
···

m−1
0

m
1

m+1
0

···
···

N
0]

After we obtain skid length g(S, m, c) for a specific instruction Im , Im ’s sampling point
distribution with skid S can be obtained using the shift matrix:


0
0
0

d(S, c, m) = d(0, c, m) ×  ..
.

0
1

1
0
0
..
.
0
0

···
···
···
..
.
0 0
0 ··· 0
0 ··· 0
0
1
0

0
0
1


0 g(S,m,c)
0
0


0

1
0

All attributed instruction distribution D(S, c) can be obtained after sampling:
N
NS ∗ T X
D(S, c) =
∗
d(S, c, m)
N

(4.3)

m=1

4.4 Nullifying Skid Effect On Instruction Profiling
In this section, we explain how to eliminate the skid effect on instruction profiling results
for real programs by obtaining the skid duration S from Section 4.3. Unlike instructions in
simple loops, instructions from a real program usually belong to a control flow graph with
many branches. To apply our skid emulation with measured skid CPU cycle duration S,
we decompose the control flow graph into several simple loops and aggregate the skid
effects on these simple loops. An optimization problem is formulated to obtain the simple
51

A

Skid Emulation

B
C

D

A B C D

E

A B C D

CPI

CPU Cycle Profile
A

B

C

F1

D

Simple Loop Frequency
A

B

C

F2

E

Calculate CPI

(1)

F1
A

Instruction
Distribution

A B C

E

E

F2

CPI

A

(3)

Skid Emulation
A

B

C

D

E

F3

Simple Loop Frequency

A B C D
E

CPI

A B C D
E

(2)

E

Aggregate

Instruction
Distribution

Simple Loop Frequency

C D

CPU Instruction
Profile

Skid Emulation
A B C

B

B

C D E

A B

Instruction Count
Error of Emulation

C D
E

Instruction Count

F3

Calculate
emulation Error

Optimization
Problem Solver

Obtain New Simple Loop
Frequency

Instruction
Distribution
(5)

(4)

F1

Recover the Instruction
Profile when Emulation
Error is Small Enough

F2

Update Simple Loop Frequency

F3

A B C

(6)

Figure 4.5: Recover instruction profile of control flow graph in Figure 4.6. There are six
steps for recovering process. CPU cycle profile and instruction profile are provided by
sampling based profile. Step (1) to (5) constitutes a closed iterative loop to improved
the recovering quality.
loop frequencies with measured skid duration in cycles. In this section, we will show the
intuition and all the formal mathematical description is in the complement material [63].

4.4.1 Control Flow Graph Decomposition
Extracting Instruction Control Flow Graph. We rely on static analysis [52] to extract
the basic-block level control flow graph of the whole program. Since instructions within
the same basic block are executed in a deterministic order, we are able to deduce the
corresponding instruction control flow graph, which is the result generated after step
(1) in Figure 4.6. In practice, we only focus on the code blocks with large amounts of
instructions, which are most interesting to profiler users.
Every simple loop has its own execution frequency. The instruction profile is up to
execution frequencies of these simple loops. We cannot directly derive the CPI values
of a simple loop from the CPU cycle profiling result as the number of times executed
(or execution frequency) is unknown. Seeking the execution frequency of all simple
loops is then formulated as an optimization problem consisting of six steps as shown in
Figure 4.5. We keep using the control flow graph in Figure 4.6 as an example to explain
our methodology.
(1) Obtain Cycle Per Instruction. With a fixed set of simple loop frequencies, we are
52

D

E

A

Static Analysis

(1)

B

C

Decomposition
D

A

B

C

D

A

B

C

E

(2)
A

B

C

D

E

E

Figure 4.6: Control flow graph with multiple branches decomposition. After static analysis, we obtain a instruction level control flow graph with 5 instructions (A, B, C, D, E).
Decomposition generate 3 different simple loops: A → B → C → D, A → B → C → D, A
→ B → C → E.
able to obtain execution frequency for each instruction. Like simple loop, skid effect
with fixed CPU cycle duration does not affect the cycle profiling distribution in the
control flow graph [30]. Since CPU cycle profiling result can provide CPU cycle
counts for each instruction, cycle per instruction information for each instruction can
be obtained by dividing CPU cycle counts by instruction counts. As a result, we are
able to obtain the CPI information for each simple loop to emulate the skid.
(2) Emulate Skid Effect on Simple Loops. Skid emulation on each simple loop produces
the instruction distribution.
(3) Aggregate Instruction Distribution. We aggregate all the instruction distribution with
their frequency weights to generate the instruction distribution for the entire control
flow graph. The emulation error is the difference between the generated instruction
distribution and the actual instruction sampling result.
(4) Generate New Simple Loop Frequencies. By investigating the emulation error, we
are able to obtain a cost function to evaluate how close the summation of emulations
on all simple loops to the instruction profile generated by sampling based profilers.
We formulate an optimization problem to find the best solution for all simple loop
frequencies based on this cost function. To avoid brute-force enumerating all possible values of simple loop frequencies, we adopt the Gibbs Sampling method [64]
53

in polynomial time complexity to search the best simple loop frequency vector.
(5) Update Simple Loop Frequencies. When the emulation error is still large, we start a
new iteration by updating the simple loop frequencies in steps (1) and (3) with more
accurate ones obtained from the last step.
(6) Output Recovered Instruction Profile. When the emulation error is significant smaller
than initial emulation error and does not converge to smaller value with additional
iterations, the iteration process is stopped. Then We are able to obtain the instruction
profile close enough to the ground truth, combined with the structure of the simple
loops. We implement this algorithm using MATLAB. The optimization problem can
be solved in less than 1 minute. After solving this optimization problem, we can get
each instruction’s profiling result inside the problematic loop. Then each function’s
profiling result can be successfully recovered.

4.4.2 Mathematic formulation of skid effects elimination on Control Flow
Graph.
Below are some definitions used to explain our methodology:
• K: a control flow graph containing multiple conditional branches, it can be decomposed into multiple simple loops. K, we have L simple loops.
• Kl : a simple loop in K. The execution time for a simple loop Kl is Fl . K can also
be represented as {K1 , K2 , K3 , ..., Kl , ..., KL }.
• Ii0 : a instruction in the complex loop K.
• F = {F1 , F2 , ..., FL }: a set of the number of execution times each simple loop in K.
• Bb : a basic block in the control flow graph of K, which contains several instructions.
One basic block may belongs to one or several simple loops in K.
• δ(i, b): a binary value indicates whether instruction Ii belongs to basic block Bb .
54

• λi,l : the index of instruction Ii0 in simple loop Kl . if instruction Ii is not in simple
loop Kl , it will be 0.
• d (S, cl , λi,l ): the value of D(S, cl )’s λi,l th element. We define d (S, cl , 0) = 0.
• Ei : the instruction profiling result for Ii0 from sampling based profiler.
• Ei (F): given a simple loop execution frequency vector F, instruction executed
frequency for instruction Ii0 .
• Êi (F): the instruction count of Ii0 after skid effects emulation with a specific simple
loop frequency vector F.
• Ci : the cycle profiling result for Ii0 from sampling based profiler.
• ci : the average CPI for instruction Ii0 .
• ϕ(F): a metric to quantify the edge frequency result at basic block level close
enough to the sampling profile result.

4.4.3 Control Flow Graph Decomposition
For a control flow graph K, we cannot directly emulate skid effects with instruction and
cycle profiling results. We then focus on the simple loops {K1 , K2 , K3 , ..., Kl , ..., KL }
to do skid effects emulation. Each simple loop Kl can be treated as a simple loop
introduced.
The total instructions for a specific loop is not biased by sampling based profiler. For
complex loop K,the total instructions number |E| for the whole loop can be represented
with simple loop execution frequency F and all basic blocks in each simple loop Kl as:

|E| =

X

Fl 


X

|Bb |

(4.4)

Bb ∈Kl

Fl ∈F

As a result, F’s degree of freedom is L − 1 here. Furthermore, each basic block
instruction number |Bb | is fixed, which can be obtained by static analysis. For a specific
55

instruction Ii0 , its execution frequency is:

Ei (F) =

X

Fl 

δ(i, b) =

(4.5)

δ(i, b)

Bb ∈Kl

Fl ∈F




X

1,
0,

Ii0 ∈ Bb
Ii0 ∈
/ Bb

Moreover, as we explained in section 4.3, the cycle profiling result here is also not
biased at all. For a specific instruction Ii0 , based on the its cycle profiling result Ci and
execution frequency Ei (F), the CPI estimation for Ii0 can be obtained as:
Ci
Ei (F)

ci =

(4.6)

Different from the mathematical model for simple loop, the instruction and cycle profiling result cannot provide the cycle execution frequency information Fl for each simple
loop Kl . We cannot directly obtain the CPI information for each instruction Ii0 in Kl
without edge frequency information for all branches. Since each instruction’s CPI information is up to the overall simple loop frequency F, we run emulate skid effects to
get the problematic instruction distribution for any possible F. Based on this property,
an optimization problem can be formulated to obtain F. We will describe optimization
formulation in detail in the next subsection.
With a candidate setting of simple loop frequency F, the CPI information cl for each
simple loop can be obtained based on equation 4.6. Since the skid cycle duration S can
be measured for a specific machine, Skid effects emulation can be done for simple loop
Kl by equation 4.3 to generate the instruction distribution D(S, cl ).
After doing skid effects emulation for each simple loop with F, we can get the overall
instruction count Êi (F) of a specific instruction I 0 , which can be represented as:

Êi (F) =

X
Fl ∈F

Fl

X  X
Bb ∈Kl

δ(i, b)d (S, cl , λi,l )



(4.7)

Ii ∈Bb

We can use sampling based profiling to get instruction number Ei for instruction Ii .
56

To quantify the edge frequency result close enough to the sampling profile result, we
define a metric to quantify the difference between Ei and Êi (F) at basic block level:
  X X
2
X
ϕ F =
Ei −
Êi (F)
Ii ∈Bb

Bb

Ii ∈Bb

Then we can formulate an optimization problem to obtain F:

arg min (ϕ(F))
F

The optimization problem formulation is based on the connection of simple loop execution frequency F and CPI information. The skid effects emulation with F can reproduce
the instruction distribution caused by skid effects, for all basic blocks in this complex loop,
if the F is close enough to the ground truth.

4.5 Evaluation
In this section, we evaluate the accuracy of instruction profiles collected from 23 applications (21 from SPEC CPU2006 benchmark suite [65] and 2 real applications) on five
platforms. We also evaluate our methodology on problematic applications. Our study
demonstrates that our method is able to rectify the skid effect effectively. Our study indicates the instruction profile of application with small hot functions is more likely to be
mis-attributed at function level.

4.5.1 Evaluation Setup
Table 4.2: Evaluation platform configurations.
Microarchitecture
Intel-SandyBridge
Intel-Broadwell
Intel-Skylake
AMD-Opteron
Intel-Xeon Phi

Processor
Xeon E5 4650@2.7GHz
Xeon E5-2650 v4@2.2GHz
Xeon E3-1240 v5@3.5GHz
Opteron 6168@1.6GHz
KNL 7210@1.30GHz

SMT × # Cores
2×8
2 × 12
2×4
1 × 12
2 × 64

57

L1/L2/L3 Cache, Memory
32KB/256KB/20MB, 64GB
64KB/256KB/30MB, 128GB
64KB/256KB/8MB, 64GB
64KB/512KB/10MB,16GB
32KB/32MB/-,32GB

Compiler
gcc 4.8.5 -O3
gcc 4.8.5 -O3
gcc 5.4.0 -O3
gcc 4.8.5 -O3
gcc 4.8.5 -O3

The experiments are performed on five platforms, whose configurations are shown
in Table 4.2. We run SPEC CPU2006 integer and floating-point benchmarks with reference inputs [6, 5]. For real applications like PowerGraph and libsvm, we use their
representative datasets as inputs. Two profiling tools HPCToolkit [52] and CCTLib [57]
are adapted to profile each application4 :
• HPCToolkit is an integrated suite of tools for collecting measurement and analysis of
program performance, based on statistical sampling hardware performance counters.
HPCToolkit incurs low overhead during the whole sampling process, but its profiling
result may suffer from skid effects [52].
• CCTLib [57] is an instrumentation tool based on Intel Pin [66]. It instruments every instruction instance dynamically, however, with significantly higher overhead than HPCToolkit. We treat its result as the ground truth to identify mis-attribution problem from
the HPCToolkit profiling result.
Our method recovers instruction profiling result at the basic block level. However,
we present all the profiling result in function level, since function level profiling result is
more straightforward for users and helps users more effectively pinpoint hotspots and
optimize problematic code blocks.

4.5.2 Effectiveness Provenance
After processing instruction profiling result generated by sampling based profiler HPCToolkit, our algorithm can generate fixed (recovered) instruction profiling result. We use
this property to identify the application with problematic instruction profiling result at the
function level. A metric  is defined to quantify the difference of sampling based instruction profile from ground truth instruction profiling result provided by CCTLib. Since
we only care about hot functions in each application, here we only focus on the top 10
4
We have explained in the previous section about inconsistent profiling result of PEBS techniques.
Besides, AMD-Opteron and Intel-Xeon Phi do not support PEBS. Thus, we do not choose PEBS as our
baseline to prove the effectiveness of our method.

58

Intel-Broadwell

Intel-Skylake

Intel-Xeon Phi

AMD-Opteron

0.4
0.2

k
era
n

ix

4

vm
libs

cul
cal

lex

h26

f

sop

r

mc

me

ar

hm

net

ast

lbm

wrf

om

pag

cac

libq

uan

tum

inx

to

sph

ton

d
nam
d

lie3

AD
M
tus

les

p
sm

gro

ma
cs

lc
mi

zeu

bzi
p

ess
bw
ave
s

0.0

gam

Metric Value in Percent

Intel-SandyBridge

Intel-SandyBridge

Intel-Broadwell

Intel-Skylake

Intel-Xeon Phi

AMD-Opteron

0.4
0.2

nk
era
pag

4

ix
libs
vm

cul
cal

lex

h26

f

sop

mc

r
me
hm

ast
ar

net

lbm

om

wrf

libq

uan

tum

inx
sph

o
ton
t

cac

tus
AD
M
les
lie3
d
nam
d

ma
cs

p

gro

lc

sm
zeu

mi

s

bzi
p

bw
ave

ess

0.0

gam

Metric Value in Percent

Figure 4.7: The value hpctoolkit for chosen applications on five platforms. Smaller value
is better.

Intel-SandyBridge

Intel-Broadwell

Intel-Skylake

Intel-Xeon Phi

AMD-Opteron

0.4
0.2

nk
era
pag

ix

4

vm
libs

cul
cal

f

lex

h26

sop

mc

r
me
hm

ast
ar

net
om

lbm

wrf

libq

uan

tum

inx
sph

to
ton

lie3
d
nam
d

les

M
AD

cac

tus

p

cs
ma
gro

sm
zeu

lc
mi

p
bzi

bw
ave
s

ess

0.0

gam

Metric Value in Percent

Figure 4.8: The value P erf for chosen applications on five platforms. Smaller value is
better.

Figure 4.9: The value f ix for chosen applications on five platforms. Smaller value is
better.
functions with most instruction counts executed in recovered instruction profiling result,
these functions are denoted as set Φ. We define πφ,h and πφ,c as the instruction profiling
result (in percent) of function φ from HPCToolkit and CCTLib, respectively. Then we
define hpctoolkit as:

hpctoolkit =

X

|πφ,h − πφ,c |.

φ∈Φ

If hpctoolkit is close to 0, the HPCToolkit profiling result is close to the ground truth, which
59

means that HPCToolkit produces an accurate instruction profile. The hpctoolkit values of
all applications are summarized in Figure 4.7. Most of the applications have a very ideal
sampling based instruction profile while several applications with hpctoolkit > 0.1 are
problematic. The profiling result of omnet is severely mis-attributed on all five machines,
as well as libsvm and pagerank. Another two applications astar and soplex fail to
get accurate profiling results on four machines except AMD-Opteron. As to mcf, only
AMD-Opteron and Intel-Skylake produce accurate results.
To demonstrate the effectiveness of our approach, we define another metric f ix :

f ix =

X

|πφ,f − πφ,c |,

φ∈Φ

where πφ,f denotes the instruction profiling result (in percent) of function φ after applying our approach on HPCToolkit profiling results πφ,h . As shown in Figure 4.9, most
of the applications have a f ix below than 0.1, indicating that our recovered profiling
result is quite close to the ground truth. Besides, low f ix in figure 4.9 indicates our approach generates accurate instruction profile for non-problematic applications. Finally,
our method is able to achieve an average error reduction (hpctoolkit − f ix ) of nearly
0.178, which is a significant improvement on instruction profile accuracy.

Profiling Accuracy of Linux Perf

Same as hpctoolkit for HPCToolkit, we also define

metric perf :
perf =

X

|πφ,p − πφ,c |,

φ∈Φ

, where πφ,p represents the instruction profiling result (in percent) of function φ collected
by Perf. Figure 4.8 indicates Perf profiling result are very close to that of HPCToolkit.

Profiling Accuracy at Basic Block Level We collect all basic blocks belong to functions set φ, which are denoted as B. Sample as πφ,h and πφ,c , we also define πb,f and
πb,c as the instruction profiling result (in percent) of basic block b from HPCToolkit with

60

Intel-SandyBridge
Intel-Broadwell
Intel-Skylake

0.3

Intel-Xeon Phi
AMD-Opteron

0.2
0.1

nk
era
pag

vm
libs

ex
sop
l

f
mc

ast

om

ar

0.0

net

Metric Value in Percent

0.4

Figure 4.10: The value f ix,bb for chosen applications on five platforms. Smaller value
is better.
our approach and CCTLib, respectively. Then we define f ix,bb as:

f ix,bb =

X

|πb,f − πb,c |,

b∈B

Figure 4.10 shows the f ix,bb value of six problematic applications’ profiling accuracy.
Most of the applications have a good basic block level instruction profiling accuracy with
our approach (f ix,bb < 0.1). f ix,bb of omnet and astar are higher than other applications, since they have more branches within the hot functions. Our technique improves
the profile accuracy on the basic block level, which results in accuracy improvement on
the function level.

4.6 Case Study
In this section, we evaluate a few benchmarks with high hpctoolkit value, seen in Figure 4.7. For each application, We select functions with largest hpctoolkit , use our algorithm to fix the problematic instruction profile, and discuss the root cause of misattribution. Due to the page limit, we only show the fixed results of Pagerank, astar
and hmmer on two different Intel CPUs: Intel-Skylake and Intel-SandyBridge. The rest of
results are provided in Section 3 of complement material [63]. Our study shows that the
mis-attribution at function level of a problematic application is caused by heavy-weight
instructions located near hot small functions.

61

4.6.1 PowerGraph Pagerank
PowerGraph is a high performance graph processing framework written in C++ [1]. We
evaluate a representative data analytic application PageRank [67, 68, 69, 70], from it.
The input graph data are transformed from the Amazon product purchasing network
with 0.4 million vertices and 3.3 million edges [71]. We configure PageRank to run 50
iterations and only consider the application execution part as the profiling domain (preprocessing part is excluded in profiling). As the graph data are very large and cannot be
entirely loaded into the cache, instructions of loading data from main memory are quite
heavy, causing a severe problem of instruction mis-attribution in the profiling result.
200

200
execute gathers
gather

Percent

150

150

125
100
75

33.2

59.3

34.8

125
100
75

33.2

47.9

34.8

52.1

65.2

50

50
25

execute gathers
gather

175
Percent

175

66.8

40.7

0
Ground TruthSampling

65.2

25

66.8

0
Ground TruthSampling

Fixed

(a) pagerank, Intel-SandyBridge

Fixed

(b) pagerank, Intel-Skylake

Figure 4.11: Fixed result of 2 mis-attributed functions’ instruction profile in PageRank for
Intel-SandyBridge and Intel-Skylake.
Fixed results are shown in Figure 4.11a and Figure 4.11b for Intel-Skylake and IntelSandyBridge, respectively. The function gather is overestimated in the instruction profiling result. Users may underestimate the memory loading operations’ overhead of the
whole system, when they rely on sampling based profiling result to calculate IPC. We
effectively recover the three functions’ sampling based instruction profiling result inside
the loop from execute_gathers (line 9 in Listing 4.2), with a low metric f ix ’s value close
to 0.

62

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

// callee gather called by execute_gathers
double gather(icontext_type & context , const vertex_type& vertex , edge_type& edge)
const{
return (edge.source ().data ()/edge.source ().num_out_edges ());
}
...
// caller execute_gathers 's loop in synchronous_engine .hpp
execute_gathers(const size_t thread_id) {
...
foreach(local_edge_type local_edge , local_vertex.in_edges ()){
edge_type edge(local_edge);
if(accum_is_set) {
accum += vprog.gather(context , vertex , edge);
} else {
accum = vprog.gather(context , vertex , edge);
accum_is_set = true;
}
++ edges_touched;
}
...
}

Listing 4.2: Code of problematic loop in PowerGraph Pagerank.

4.6.2 SPEC CPU2006 astar
The benchmark astar is an application based on a 2D path-finding library for game
AI development [72]. The profiling result shows that over 38% of the instructions are
mis-attributed on Intel-SandyBridge. Listing 4.3 shows function releasepoint and function addtobound in Way2_.cpp, and function add in Arrays.cpp. These three functions
account for 41.3% of the total instructions. Function addtobound is called in a nested
for-loop belonging to function releasepoint. The function addtobound is very small, invoking add near the exit. The mod computation is very cycle-consuming in addtobound at
line 10. Figure 4.12a plots the normalized instruction profiling result for releasepoint,
addtobound and add. Here add is over-attributed more than 3 times, while releasepoint
and addtobound are under-estimated. After applying our algorithm, we obtain an instruction profile on the function level very close to the ground-truth generated by CCTLib.

4.6.3 SPEC CPU2006 hmmer
The application hmmer focuses on searching patterns in DNA sequences with hidden markov models [72]. We profile it with reference input. Its hpctoolkit value indicates

63

200

200

way2obj::releasepoint
way2obj::addtobound
way2obj::add

Percent

150
125
100
75

8.1
24.4

27.5
19.4

150

8.7
26.4

125
100

25

53.1

0
Ground TruthSampling

75

2.4

15.8
28.1

55.1

31.7

56.1

42.5

50.8

17.6

50

50
67.5

regwayobj::makebound2
regwayobj::addtobound
regwayobj::isaddtobound

175
Percent

175

64.9

25

0
Ground TruthSampling

Fixed

(a) astar, Intel-SandyBridge

Fixed

(b) astar, Intel-Skylake

Figure 4.12: Fixed result of 3 mis-attributed functions’ instruction profile in astar for
Intel-SandyBridge and Intel-Skylake.
that the instruction profile at function level is not significantly mis-attributed. However,
we find that the profiling result for store instructions5 is quite inconsistent at line level.
A loop in fast_algorithms.c, shown in Listing 4.4, contributes nearly 99% of all store
instructions. There is only one store instruction in line 137 and 134. Since line 137 is
conditional executed, the attributed value of line 137 must be smaller than that of line
134. However, the actual sampling result shown in Table 4.3 reveals that the attributed
value of line 137 is even greater than that of line 134, which seems unreasonable. Compared with the ground truth, the value of line 134 is under-attributed by more than 50%.
Table 4.3: PAPI_SR_INS instruction counts for code line of hmmer.
Code line
134
137

HPCToolkit
5.23e09
1.43e10

CCTLib
1.35e10
1.35e10

Since we are able to obtain the execution frequency of each basic block, we can
recover store instruction profiling result for each code line. Our method reports the same
result as the one from CCTLib. The execution frequencies of three basic blocks within
this loop are very close. Such mis-attribution is not caused by heavy instructions, but
skid effect also incurs mis-attribution at code line level.
5

The corresponding PAPI event name is PAPI_SR_INS.

64

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

// callee flexarray <eobj >:: add in Arrays .cpp
template <class eobj > inline void flexarray <eobj >:: add(const eobj& e){
if (elemqu == maxelemqu) doubling(true);
ep[elemqu ]=e;
elemqu ++;
}
// callee addtobound in Way2_.cpp
void way2obj :: addtobound(i32 x, i32 y){
i32 boundnum;
boundnum =(( filltact+movetime(x,y))%( maxmovetact +1));
boundar[boundnum ].add(pointt(x,y));
}
// caller releasepoint in Way2_.cpp
void way2obj :: releasepoint(i32 px , i32 py){
...
for (y=y1; y<=y2; y++)
for (x=x1; x<=x2; x++)
if ((x!=px)||(y!=py))
if (waymap[x+y*mapsizex ]. fillnum == fillnum) {
...}
else if (isaddtobound(x,y))
addtobound(x,y);
...
}

Listing 4.3: Code of problematic functions in astar

133
134
135
136
137
138
139
140
141
142

for (k = 1; k <= M; k++) {
mc[k] = mpp[k-1]
+ tpmm[k -1];
...
if (k < M) {
ic[k] = mpp[k] + tpmi[k];
if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc;
ic[k] += is[k];
if (ic[k] < -INFTY) ic[k] = -INFTY;
}
}

Listing 4.4: Code for for-loop in fast_algorithms.c from hmmer in SPEC CPU2006.

4.6.4 SPEC CPU2006 soplex
soplex is a benchmark solving linear programming with reference input, written in C++.
Our metric  reports that 15% of the instructions are mis-attributed on function level, and
its problematic loop takes nearly 20% of the total instructions.
Listing 4.5 lists the code for the problematic loop and small function called by this
for-loop, respectively. This loop is belong to function entered4X from spxsteeppr.cc.
In this loop, an inline operator function operator* from svector.h is called at line 8.
This operator function is used to obtain the dot product of two vectors. Before doing
dot product computation, the two vectors need to be loaded to the CPU. These memory
65

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

// caller entered4X in spxsteepper .cc
void SPxSteepPR :: entered4X(SPxId , int n, int start2 , int incr2 , int start1 , int
incr1){
...
for (j = pIdx.size () - 1 - start2; j >= 0; j -= incr2){
i = pIdx.index(j);
xi_ip = xi_p * pVec[i];
x = penalty_ptr[i] += xi_ip * (xi_ip * pi_p
- 2 * (thesolver ->vector(i) * workVec));
if (x < delta)
penalty_ptr[i] = delta;
else if (x > infinity)
penalty_ptr[i] = 1 / thesolver ->epsilon ();
}
...}
...
// callee operator * in svector .h
Real operator *( const Vector& w) const{
Real x = 0;
int n = size ();
Element* e = m_elem;
while (n--){
x += e->val * w[e->idx];
e++;
}
return x;
}

Listing 4.5: Code of problematic loop in soplex.

load instructions are much heavier than the instructions of this function. As a result,
operator*’s instruction number is under-attributed, while entered4X’s instruction number is over-attributed. Such miss-attribution terribly affects the function entered4X IPC
result. Especially for Intel-Skylake platform, the IPC of entered4X is over-estimated to
1.82× .
Figure 4.13a and Figure 4.13b show the fixed result of the two functions. For both
entered4X and operator*, our method recovers the instruction profile with less than
18% error, compared with ground truth result by CCTLib.

4.6.5 SPEC CPU2006 omnet
Application omnet focus on discrete event simulation of a large Ethernet network , which
is also written in C++ and configured with reference input. Based on omnet’s  value,
27% and 36% of overall instructions are mis-attributed.
Listing 4.6 gives the code of the problematic loop, which comes from the function

66

200
SPxSteepPR::entered4X
175
soplex::SVector::operator*
150
125
100
75
62.7%
76.0%
50 77.3%
25
37.3%
24.0%
22.7%
0 Ground Truth Sampling Fixed

Percent

Percent

200
SPxSteepPR::entered4X
175
soplex::SVector::operator*
150
125
100
75
60.0%
74.4%
50 78.3%
25
40.0%
25.6%
21.7%
0 Ground Truth Sampling Fixed
(a) libsvm,Intel-SandyBridge

(b) libsvm,Intel-Skylake

Figure 4.13: Fixed result of 2 mis-attributed functions’ instruction profile in soplex for
Intel-SandyBridge and Intel-Skylake.

200
cSubModIterator::operator++
175
cSimulation::module
150
125
100
8.5%
33.3%
36.4%
75
50
91.5%
63.6%
25 66.7%
0 Ground Truth Sampling Fixed

Percent

Percent

200
cSubModIterator::operator++
175
cSimulation::module
150
125
3.5%
100
29.6%
33.3%
75
50
96.5%
70.4%
25 66.7%
0 Ground Truth Sampling Fixed
(a) libsvm,Intel-SandyBridge

(b) libsvm,Intel-Skylake

Figure 4.14: Fixed result of 2 mis-attributed functions’ instruction profile in omnet for
Intel-SandyBridge and Intel-Skylake.
cSubModIterator::operator++ (short for operator++) from cmodule.cc. This loop accounts the 33.3% of the total instructions. Inline function module from csimul.h is called
by this loop at line 6. The purpose of this loop is to search a target element in an array
and return its parent. Every element’s parent needs to be checked with a point-chasing
memory access pattern at line 7, which incurs heavy data loading from memory. Due to
such heavy instruction outside module, module’s instruction profile is significantly biased.
Figure 4.14a plots the instruction counts for these two functions from HPCToolkit,
CCTLib and fixed version. HPCToolkit result shows the IPC of the module is only 10%
of the ground-truth. Such result will mislead programmer’s optimization. Our algorithm
67

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

// caller cSubModIterator :: operator ++ in cmodule .cc
cModule *cSubModIterator :: operator ++( int){
...
do{
id++;
cModule *mod = simulation.module(id);
if (mod!= NULL && parent ==mod ->parentModule ())
return mod;
}
while (id <= lastId);
...
}
...
// callee module in csimul .h
cModule *module(int id) const {
return id >=0 && id <size ? vect[id] : NULL;
}

Listing 4.6: Code for problematic loop in omnet.

fixes it with at most 11.25% error.

4.6.6 SPEC CPU2006 mcf

Percent

200
bea_is_dual_infeasible
175
primal_bea_mpp
150
125
100
75
69.7%
69.9%
86.0%
50
25 30.3%
30.1%
14.0%
0 Ground Truth Sampling
Fixed

Figure 4.15: mcf,Intel-SandyBridge
Figure 4.16: Fixed result of 2 mis-attributed functions’ instruction profile in mcf for IntelSandyBridge and Intel-Skylake.
mcf is an application implemented to do single-depot vehicle scheduling in public
mass transportation. We configure it with reference input. Its  value indicates 13% of
the total instructions are mis-attributed on Intel-SandyBridge.
Listing 4.7 lists the code for the problematic loop, which accounts nearly 40% of
the total instructions. This loop comes from function primal_bea_mpp in pbeampp.c.
68

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

// caller primal_bea_mpp in pbeampp .c
arc_t *primal_bea_mpp( long m, arc_t *arcs , arc_t *stop_arcs , cost_t *
red_cost_of_bea ){
...
for( ; arc < stop_arcs; arc += nr_group ){
if( arc ->ident > BASIC ){
red_cost = arc ->cost - arc ->tail ->potential + arc ->head ->potential;
if( bea_is_dual_infeasible ( arc , red_cost ) ){
basket_size ++;
perm[basket_size]->a = arc;
perm[basket_size]->cost = red_cost;
perm[basket_size]->abs_cost = ABS(red_cost);
}
}
}
...}
...
// callee bea_is_dual_infeasible in pbeampp .c
int bea_is_dual_infeasible( arc_t *arc , cost_t red_cost ){
return(
(red_cost < 0 && arc ->ident == AT_LOWER)
|| (red_cost > 0 && arc ->ident == AT_UPPER) );
}

Listing 4.7: Code for problematic loop in mcf.

primal_bea_mpp calls bea_is_dual_infeasible, which is an inline function called inside this loop. There are a lot of data loading with stride memory access pattern inside this loop. Such memory loading instructions are with more cycles than other instructions in function bea_is_dual_infeasible. As a result, the instruction number of
bea_is_dual_infeasible is under-estimated in HPCToolkit result.
Figure 4.15 gives the instruction profiling result for these two functions
from HPCToolkit, CCTLib and the fixed result.

HPCToolkit under-estimates

bea_is_dual_infeasible’s IPC to 46% of the ground truth. Our algorithm fixes its IPC
with an error less than 5.7%.

4.6.7 libsvm
The library libsvm is a popular software package for support vector machines [73]. We
run the training phase of the default radial basis function kernel with the input cod-rna
containing 59,535 training samples with two-class labels. Listing 4.8 shows the hottest
for-loop in svm.cpp from the function SVC_Q::get_Q, covering nearly 85% of the total
executed instructions. It calls Kernel::kernel_rbf that further invokes math functions

69

15.0%
13.8%
16.5%
54.8%
Fixed

200
Kernel::dot
175
exp
Kernel::kernel_rbf
150
SVC_Q::get_Q
125
2.0%
100 14.0%
21.5%
11.9%
75 13.6%
12.9%
50
63.6%
25 60.6%
0 Ground Truth Sampling

Percent

Percent

200
Kernel::dot
175
exp
Kernel::kernel_rbf
150
SVC_Q::get_Q
125
3.7%
100 14.0%
21.0%
11.9%
75 13.6%
20.1%
50
55.2%
25 60.6%
0 Ground Truth Sampling

(a) libsvm, Intel-SandyBridge

14.1%
13.0%
15.5%
57.4%
Fixed

(b) libsvm, Intel-Skylake

Figure 4.17: Fixed result of 4 mis-attributed functions’ instruction profile in libsvm for
Intel-SandyBridge and Intel-Skylake.
1
2
3
4
5
6
7
8
9
10

// Callee : Kernel :: kernel_rbf in svm.cpp
double kernel_rbf(int i, int j) const{
return exp(
-gamma *( x_square[i]+ x_square[j]-2*dot(x[i],x[j]))
);
}
...
// Caller : SVC_Q :: get_Q 's for -loop in svm.cpp
for(j=start;j<len;j++)
data[j] = (Qfloat)(y[i]*y[j]*( this ->* kernel_function)(i,j));

Listing 4.8: Code of problematic functions in libsvm.

exp and Kernel::dot to compute dot product in each loop execution.
Figure 4.17a and Figure 4.17b show the fixed result of this hot for-loop on IntelSandyBridge and Intel-Skylake, respectively. After comparing the HPCToolkit and
CCTLib profiling results of this for-loop, we find that SVC_Q::get_Q is significantly underattributed on both platforms. SVC_Q::get_Q contains lighter for-loop control instructions,
compared with data load and multiply instructions. While Kernel::dot’s for-loop is selfcontained, the few for-loop instructions skid to other functions. There is a heavy divide
instruction inside exp, but it is not very close to the exp’s exit, so that makes the profiling
result of exp varies between two platforms. Kernel:kernel_rbf is a small function with
around 30 instructions calling Kernel::dot and exp. As a result, Kernel:kernel_rbf’s
instructions are divided into several fragments, each fragment is easily affected by other
functions. The function Kernel:kernel_dot in the instruction profiling is doubled, com-

70

pared with ground truth. However, SVC_Q::get_Q’s IPC is also substantially overestimated, which could mislead users’ optimization strategies.

4.6.8 Effectiveness on Other Hardware Events
We have already shown that our algorithm can successfully recover sampling-based instruction profile from mis-attribution, caused by skid effect. For other instruction-related
hardware events (e.g. load or store instruction), static analysis is able to characterize
each instruction as the instruction type we are interested in. For instance, we can rely on
static analysis to determine a specific instruction is memory load instruction or not [74].
Based on this, recovered instruction profile at function level by our algorithm can also
successfully deliver the recovered specific instruction type profile at function level. For
CPU cycle event, we have explained the CPU cycle events are not affected by skid
effects of sampling based profiler, when skid CPU cycle duration is a fixed value.
Hardware-related Events. Hardware-related events like cache miss or memory
access, are not evenly distributed on all instructions [55, 75]. There is a probability
for each instruction on the control flow graph causing specific these hardware-related
events. Modern profilers cannot provide ground truth result for these hardware-related
events. Thus we are unable to verify the accuracy of the recovered profile of these
events. For these events, the probability of each instruction causing counter overflow is
not the same. For instance, every instruction has a different probability causing a cache
miss. We are not able to determine the very instruction that triggers the counter overflow
based on the mapping from an overflow point to its sample point directly. With all branch
execution frequencies, we can also formulate an optimization problem to recover the
probability of each instruction causing specific hardware-related event, based on skid
effect emulation. With this kind of probability, hardware-related event profile is able to be
recovered. Transforming our method to recover hardware related event profile remains
an important direction of future work.

71

4.7 Conclusion
This chapter describes a framework that rectifies the instruction profile inaccuracy at
function level. We design a measurement approach to quantify the skid effect on five
different CPUs. Furthermore, we invent a novel software scheme to minimize the skid
effect and recover the instruction profiling result from the mis-attribution. We study several CPU2006 benchmarks and real applications to demonstrate the effectiveness of our
approach on different CPU architectures. We foresee our scheme can be integrated into
modern profilers as an important component to produce accurate measurement.

72

Chapter 5

Conclusion and Future Work
In this dissertation, we investigate how to address challenges on system performance
optimization. Specifically, we work on the following topics:
We first present Dr-BW, a tool that uses machine learning techniques to identify
bandwidth contention in NUMA architectures. Dr-BW collects performance data with
low overhead, feeds the data into a novel machine learning model to identify contention,
and associates the analysis results with both programs and significant data objects. We
study a number of benchmarks and show that Dr-BW achieves more than 96% accuracy.
With several case studies, we demonstrate that Dr-BW is able to guide performance
optimization and yield up to a 6.5× speedup.
Next, we propose a framework that rectifies the instruction profile inaccuracy at function level [76]. We design a measurement approach to quantify the skid effect on five
different CPUs. Furthermore, we invent a novel software scheme to minimize the skid
effect and recover the instruction profiling result from the mis-attribution. We study several CPU2006 benchmarks and real applications to demonstrate the effectiveness of our
approach on different CPU architectures. We foresee our scheme can be integrated into
modern profilers as an important component to produce accurate measurement. This
dissertation reveals that profiling and modeling significantly benefit system performance
improvement. In addition, modeling based profiling also help user understand the performance bottleneck and guides the performance optimization. For future work, we are
73

considering the following research directions:
• We will extend Dr-BW to identify resource contention beyond memory bandwidth
using machine learning techniques, such as contention in instruction issue slots,
different level of caches, and I/O devices.
• Hardware-related events like cache miss or memory access, are not evenly distributed on all instructions [55, 75]. There is a probability for each instruction on
the control flow graph causing specific these hardware-related events. For these
events, the probability of each instruction causing counter overflow is not the same.
For instance, every instruction has a different probability causing a cache miss.
We are not able to determine the very instruction that triggers the counter overflow based on the mapping from an overflow point to its sample point directly. With
all branch execution frequencies, we can also formulate an optimization problem to
recover the probability of each instruction causing specific hardware-related event,
based on skid effect emulation. Then we are able to recover hardware-related
event profile.

74

Bibliography
[1] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos
Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs.
In Presented as part of the 10th USENIX Symposium on Operating Systems Design
and Implementation (OSDI 12), pages 17–30, Hollywood, CA, 2012. USENIX.
[2] Maotong Xu, Sultan Alamro, Tian Lan, and Suresh Subramaniam. Chronos: A unifying optimization framework for speculative execution of deadline-critical mapreduce jobs. In 2018 IEEE 38th International Conference on Distributed Computing
Systems (ICDCS), pages 718–729. IEEE, 2018.
[3] Maotong Xu, Sultan Alamro, Tian Lan, and Suresh Subramaniam. Optimizing speculative execution of deadline-sensitive jobs in cloud. In ACM SIGMETRICS Performance Evaluation Review, volume 45, pages 17–18. ACM, 2017.
[4] Intel 64 and ia-32 architectures software developer’s manual.

https:

//www.intel.com/content/www/us/en/architecture-and-technology/64ia-32-architectures-software-developer-vol-3b-part-2-manual.html.
[Accessed: 10-22-2018].
[5] Q. Wu, S. Flolid, S. Song, J. Deng, and L. K. John. Invited paper for the hot workloads special session hot regions in spec cpu2017. In 2018 IEEE International
Symposium on Workload Characterization (IISWC), pages 71–77, Sep. 2018.
[6] R. Panda, S. Song, J. Dean, and L. K. John. Wait of a decade: Did spec cpu 2017

75

broaden the performance horizon? In 2018 IEEE International Symposium on High
Performance Computer Architecture (HPCA), pages 271–282, Feb 2018.
[7] Xu Liu and John Mellor-Crummey. A tool to analyze the performance of multithreaded programs on NUMA architectures. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014.
[8] Renaud Lachaize, Baptiste Lepers, and Vivien Quéma. MemProf: A memory profiler for NUMA multicore systems. In Proc. of the 2012 USENIX Annual Technical
Conf., Berkeley, CA, USA, 2012.
[9] Paul J Drongowski. Instruction-based sampling: A new performance analysis technique for amd family 10h processors. Advanced Micro Devices, 2007.
[10] Dehao Chen, Neil Vachharajani, Robert Hundt, Xinliang Li, Stephane Eranian,
Wenguang Chen, and Weimin Zheng. Taming hardware event samples for precise
and versatile feedback directed optimizations. IEEE Transactions on Computers,
62(2):376–389, 2013.
[11] An introduction to last branch records. https://lwn.net/Articles/680985/. [Accessed: 10-24-2018].
[12] L. Adhianto et al. HPCToolkit: Tools for performance analysis of optimized parallel
programs. Concurrency and Computation: Practice and Experience, 2010.
[13] Intel Corporation. Intel VTune performance analyzer. http://www.intel.com/
software/products/vtune.
[14] Intel Corporation.

Linux performance tool.

http://www.brendangregg.com/

linuxperf.html.
[15] D. Eklov, N. Nikoleris, D. Black-Schaffer, and E. Hagersten. Bandwidth bandit:
Quantitative characterization of memory contention. In 2013 IEEE/ACM International Symposium on Code Generation and Optimization, 2013.
76

[16] Marc Casas and Greg Bronevetsky. Active measurement of memory resource consumption. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014.
[17] Guilherme Piccoli, Henrique N. Santos, Raphael E. Rodrigues, Christiane Pousa,
Edson Borin, and Fernando M. Quintão Pereira. Compiler support for selective
page migration in numa architectures. In Proceedings of the 23rd International
Conference on Parallel Architectures and Compilation, 2014.
[18] Jaydeep Marathe and Frank Mueller. Hardware profile-guided automatic page
placement for ccNUMA systems. In Proceedings of the eleventh ACM SIGPLAN
symposium on Principles and practice of parallel programming, pages 90–99. ACM,
2006.
[19] Matthias Diener, Eduardo H.M. Cruz, Philippe O.A. Navaux, Anselm Busse, and
Hans-Ulrich Heiß. kmaf: Automatic kernel-level management of thread and data
affinity. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, 2014.
[20] Mohammad Dashti et al. Traffic management: a holistic approach to memory placement on NUMA systems. In Proc. of the 18th Intl. Conf. on Architectural Support
for Programming Languages and Operating Systems, 2013.
[21] Xu Liu and John M. Mellor-Crummey. A data-centric profiler for parallel programs.
In Proc. of the 2013 ACM/IEEE Conference on Supercomputing, 2013.
[22] Sanath Jayasena, Saman Amarasinghe, Asanka Abeyweera, Gayashan Amarasinghe, Himeshi De Silva, Sunimal Rathnayake, Xiaoqiao Meng, and Yanbin Liu.
Detection of false sharing using machine learning. In 2013 SC-International Conference for High Performance Computing, Networking, Storage and Analysis (SC),
pages 1–9. IEEE, 2013.

77

[23] ElMoustapha Ould-Ahmed-Vall, James Woodlee, Charles Yount, Kshitij A Doshi,
and Seth Abraham. Using model trees for computer architecture performance analysis of software applications. In 2007 IEEE International Symposium on Performance Analysis of Systems & Software, pages 116–125. IEEE, 2007.
[24] Wucherl Yoo. Automated performance characterization of applications using hardware monitoring events. PhD thesis, University of Illinois at Urbana-Champaign,
2013.
[25] Jeffrey Vetter. Performance analysis of distributed applications using automatic
classification of communication inefficiencies. In Proceedings of the 14th international conference on Supercomputing, pages 245–254. ACM, 2000.
[26] Thomas Ball and James R Larus.

Optimally profiling and tracing pro-

grams. ACM Transactions on Programming Languages and Systems (TOPLAS),
16(4):1319–1360, 1994.
[27] Youfeng Wu and James R Larus. Static branch frequency and program profile
analysis. In Proceedings of the 27th annual international symposium on Microarchitecture, pages 1–11. ACM, 1994.
[28] Jennifer M Anderson, Lance M Berc, Jeffrey Dean, Sanjay Ghemawat, Monika R
Henzinger, Shun-Tak A Leung, Richard L Sites, Mark T Vandevoorde, Carl A Waldspurger, and William E Weihl. Continuous profiling: Where have all the cycles
gone?

In ACM SIGOPS Operating Systems Review, volume 31, pages 1–14.

ACM, 1997.
[29] Roy Levin, Ilan Newman, and Gadi Haber. Complementing missing and inaccurate
profiling using a minimum cost circulation algorithm. In International Conference
on High-Performance Embedded Architectures and Compilers, pages 291–304.
Springer, 2008.

78

[30] Dehao Chen, Neil Vachharajani, Robert Hundt, Shih-wei Liao, Vinodha Ramasamy, Paul Yuan, Wenguang Chen, and Weimin Zheng. Taming hardware
event samples for fdo compilation. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, pages 42–52. ACM,
2010.
[31] Bo Wu, Mingzhou Zhou, Xipeng Shen, Yaoqing Gao, Raul Silvera, and Graham
Yiu. Simple profile rectifications go a long way. In European Conference on ObjectOriented Programming, pages 654–678. Springer, 2013.
[32] Maria Dimakopoulou, Stéphane Eranian, Nectarios Koziris, and Nicholas Bambos.
Reliable and efficient performance monitoring in linux. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and
Analysis, page 34. IEEE Press, 2016.
[33] Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F Sweeney. Evaluating the accuracy of java profilers. In ACM Sigplan Notices, volume 45, pages
187–197. ACM, 2010.
[34] Robert V Lim, David Carrillo-Cisneros, W Alkowaileet, and I Scherson. Computationally efficient multiplexing of events on hardware counters. In Linux Symposium,
pages 101–110. Citeseer, 2014.
[35] Wiplove Mathur and Jeanine Cook. Toward accurate performance evaluation using
hardware counters. In ITEA Modeling and Simulation Workshop, pages 23–32,
2003.
[36] David Levinthal. Performance analysis guide for intel core i7 processor and intel
xeon 5500 processors. Intel Performance Analysis Guide, 30:18, 2009.
[37] Andrzej Nowak, Ahmad Yasin, Avi Mendelson, and Willy Zwaenepoel. Establishing
a base of trust with performance counters for enterprise workloads. In USENIX
Annual Technical Conference, pages 541–548, 2015.
79

[38] Vincent M Weaver. Advanced hardware profiling and sampling (pebs, ibs, etc.):
Creating a new papi sampling interface. 2016.
[39] Lawrence Livermore National Laboratory. LLNL Sequoia Benchmarks. https:
//asc.llnl.gov/sequoia/benchmarks. Last accessed: Dec. 12, 2013.
[40] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer,
Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous
computing. In Proc. of the 2009 IEEE Intl. Symp. on Workload Characterization
(IISWC), 2009.
[41] D. H. Bailey, E. Barszcz, et al. The NAS parallel benchmarks – summary and
preliminary results. In Proc. of the 1991 ACM/IEEE conference on Supercomputing,
1991.
[42] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PARSEC
benchmark suite: characterization and architectural implications. In Proc. of the
17th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT), 2008.
[43] Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. Thread and memory
placement on numa systems: Asymmetry matters. In Proceedings of the 2015
USENIX Conference on Usenix Annual Technical Conference.
[44] Intel® 64 and ia-32 architectures software developer�� manual. 2010.
[45] Paul J. Drongowski. Instruction-based sampling: A new performance analysis technique for AMD family 10h processors. http://developer.amd.com/Assets/AMD_
IBS_paper_EN.pdf, November 2007. Last accessed: Dec. 13, 2013.
[46] M Srinivas, B Sinharoy, RJ Eickemeyer, R Raghavan, S Kunkel, T Chen, W Maron,
D Flemming, A Blanchard, and P Seshadri. IBM POWER7 performance modeling,
verification, and evaluation. IBM Journal of Research and Development, 55(3):4–1,
2011.

80

[47] Lawrence Livermore National Laboratory. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). https://codesign.llnl.gov/lulesh.
php. Last accessed: Dec. 12, 2013.
[48] Andreas Kleen. A NUMA API for Linux. http://developer.amd.com/wordpress/
media/2012/10/LibNUMA-WP-fv1.pdf, 2005. Last accessed: Dec. 12, 2013.
[49] Xu Liu, Kamal Sharma, and John Mellor-Crummey. Arraytool: a lightweight profiler
to guide array regrouping. In Proceedings of the 23rd international conference on
Parallel architectures and compilation, pages 405–416. ACM, 2014.
[50] William E Cohen. Tuning programs with oprofile. Wide Open Magazine, 1:53–62,
2004.
[51] Vincent M Weaver. Linux perf_event features and overhead. In The 2nd International Workshop on Performance Analysis of Workload Optimized Systems, FastPath, volume 13, 2013.
[52] Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin,
John Mellor-Crummey, and Nathan R Tallent. Hpctoolkit: Tools for performance
analysis of optimized parallel programs. Concurrency and Computation: Practice
and Experience, 22(6):685–701, 2010.
[53] Intel Vtune. https://software.intel.com/en-us/intel-vtune-amplifier-xe.
[Accessed: 08-12-2017].
[54] Shasha Wen, Xu Liu, John Byrne, and Milind Chabbi. Watching for software inefficiencies with witch. In Proceedings of the Twenty-Third International Conference on
Architectural Support for Programming Languages and Operating Systems, pages
332–347. ACM, 2018.
[55] Hao Xu, Shasha Wen, Alfredo Gimenez, Todd Gamblin, and Xu Liu. Dr-bw: identifying bandwidth contention in numa architectures with supervised learning. In

81

Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International,
pages 367–376. IEEE, 2017.
[56] Nathan R. Tallent. Performance Analysis for Parallel Programs: From Multicore to
Petascale. Ph.D. dissertation, Department of Computer Science, Rice University,
March 2010.
[57] Milind Chabbi, Xu Liu, and John Mellor-Crummey. Call paths for pin tools. In
Proceedings of Annual IEEE/ACM International Symposium on Code Generation
and Optimization, page 76. ACM, 2014.
[58] Robert J Hall. Call path profiling. In Proceedings of the 14th international conference on Software engineering, pages 296–306. ACM, 1992.
[59] Trevor E Carlson, Wim Heirman, and Lieven Eeckhout. Sniper: exploring the level
of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, page 52. ACM, 2011.
[60] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi,
Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti, et al. The gem5 simulator. ACM SIGARCH Computer Architecture News,
39(2):1–7, 2011.
[61] Rafael Ubal, Julio Sahuquillo, Salvador Petit, and Pedro Lopez. Multi2sim: A simulation framework to evaluate multicore-multithreaded processors. In 19th International Symposium on Computer Architecture and High Performance Computing
(SBAC-PAD’07), pages 62–68. IEEE, 2007.
[62] Matt T Yourst. Ptlsim: A cycle accurate full system x86-64 microarchitectural simulator. In Performance Analysis of Systems & Software, 2007. ISPASS 2007. IEEE
International Symposium on, pages 23–34. IEEE, 2007.

82

[63] Complement material.

https://github.com/simon4173/ics_complement_

materials/blob/master/Complement_ICS19.pdf.
[64] Alan E Gelfand, Susan E Hills, Amy Racine-Poon, and Adrian FM Smith. Illustration
of bayesian inference in normal data models using gibbs sampling. Journal of the
American Statistical Association, 85(412):972–985, 1990.
[65] SPEC Corporation. SPEC CPU2006 benchmark suite. http://www.spec.org/
cpu2006. 3 November 2007.
[66] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff
Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: building
customized program analysis tools with dynamic instrumentation. In Acm sigplan
notices, volume 40, pages 190–200. ACM, 2005.
[67] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank
citation ranking: Bringing order to the web. Technical report, Stanford InfoLab,
1999.
[68] S. Song, M. Li, X. Zheng, M. LeBeane, J. H. Ryoo, R. Panda, A. Gerstlauer, and
L. K. John. Proxy-guided load balancing of graph processing workloads on heterogeneous clusters. In 2016 45th International Conference on Parallel Processing
(ICPP), pages 77–86, Aug 2016.
[69] Shuang Song, Xu Liu, Qinzhe Wu, Andreas Gerstlauer, Tao Li, and Lizy K. John.
Start late or finish early: A distributed graph processing system with redundancy
reduction. Proc. VLDB Endow., 12(2):154–168, October 2018.
[70] S. Song, X. Zheng, A. Gerstlauer, and L. K. John. Fine-grained power analysis
of emerging graph processing workloads for cloud operations management. In
2016 IEEE International Conference on Big Data (Big Data), pages 2121–2126,
Dec 2016.

83

[71] Jure Leskovec, Lada A Adamic, and Bernardo A Huberman. The dynamics of viral
marketing. ACM Transactions on the Web (TWEB), 1(1):5, 2007.
[72] John L Henning. Spec cpu2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 34(4):1–17, 2006.
[73] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27,
2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[74] Pengfei Su, Shasha Wen, Hailong Yang, Milind Chabbi, and Xu Liu. Redundant
loads: A software inefficiency indicator. arXiv preprint arXiv:1902.05462, 2019.
[75] Gangyi Zhu and Gagan Agrawal. A performance prediction framework for irregular
applications. In 2018 IEEE 25th International Conference on High Performance
Computing (HiPC), pages 304–313. IEEE, 2018.
[76] Hao Xu, Qingsen Wang, Shuang Song, Lizy Kurian John, and Xu Liu. Can we
trust profiling results? understanding and fixing the inaccuracy in modern profilers.
In Proceedings of the ACM International Conference on Supercomputing, pages
284–295, 2019.

84

VITA

Hao Xu

Hao Xu has been working on his Ph.D. degree in the Department of Computer
Science at the College of William and Mary since Fall 2014. He is working with
Dr. Xu Liu in the fields of building profiling tools for performance optimizations.
Hao Xu got his M.S. in 2014 from University of Chinese Academy of Sciences,
China, and B.S. in 2011 from Wuhan University of Technology, China.

85

