Techniques for Detection, Root Cause Diagnosis, and Classification of In-Production Concurrency Bugs by Kasikci, Baris Can Cengiz
POUR L'OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES
acceptée sur proposition du jury:
Prof. W. Zwaenepoel, président du jury
Prof. G. Candea, directeur de thèse
Prof. E. Berger, rapporteur
Dr M. Musuvathi, rapporteur
Prof. C. Kozyrakis, rapporteur
Techniques for Detection, Root Cause Diagnosis, and 
Classification of In-Production Concurrency Bugs
THÈSE NO 6873 (2015)
ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE
PRÉSENTÉE LE 17 DÉCEMBRE 2015
À LA FACULTÉ INFORMATIQUE ET COMMUNICATIONS
LABORATOIRE DES SYSTEMES FIABLES
PROGRAMME DOCTORAL EN INFORMATIQUE ET COMMUNICATIONS 
Suisse
2015
PAR
Baris Can Cengiz KASIKCI

TECHNIQUES FOR DETECT ION , ROOT CAUSE
D IAGNOS I S , AND CLASS I F ICAT ION OF
IN -PRODUCT ION CONCURRENCY BUGS
baris kasikci
Baris Kasikci: Techniques for Detection, Root Cause Diagnosis, and Classi-
ﬁcation of In-Production Concurrency Bugs
Everybody who learns concurrency thinks they understand it, ends
up ﬁnding mysterious races they thought weren’t possible, and
discovers that they didn’t actually understand it yet after all.
— Herb Sutter
Any fool can know. The point is to understand.
— Albert Einstein

ABSTRACT
Concurrency bugs are at the heart of some of the worst bugs that
plague software. Concurrency bugs slow down software develop-
ment because it can take weeks or even months before developers
can identify and ﬁx them.
In-production detection, root cause diagnosis, and classiﬁcation of
concurrency bugs is challenging. This is because these activities re-
quire heavyweight analyses such as exploring program paths and de-
termining failing program inputs and schedules, all of which are not
suited for software running in production.
This dissertation develops practical techniques for the detection,
root cause diagnosis, and classiﬁcation of concurrency bugs for in-
production software. Furthermore, we develop ways for developers
to better reason about concurrent programs. This dissertation builds
upon the following principles:
— The approach in this dissertation spans multiple layers of the
system stack, because concurrency spans many layers of the
system stack.
— It performs most of the heavyweight analyses in-house and re-
sorts to minimal in-production analysis in order to move the
heavy lifting to where it is least disruptive.
— It eschews custom hardware solutions that may be infeasible to
implement in the real world.
Relying on the aforementioned principles, this dissertation intro-
duces:
1. Techniques to automatically detect concurrency bugs (data races
and atomicity violations) in-production by combining in-house
static analysis and in-production dynamic analysis.
2. A technique to automatically identify the root causes of in-pro-
duction failures, with a particular emphasis on failures caused
by concurrency bugs.
3. A technique that given a data race, automatically classiﬁes it
based on its potential consequence, allowing developers to an-
swer questions such as “can the data race cause a crash or a
hang?”, or “does the data race have any observable effect?”.
We build a toolchain that implements all the aforementioned tech-
niques. We show that the tools we develop in this dissertation are
effective, incur low runtime performance overhead, and have high
accuracy and precision.
Keywords: Concurrency bugs, data race, atomicity violation, static
analysis, dynamic analysis
vii

RÉSUMÉ
Les bogues de concurrence sont au cœur des pires problèmes que
rencontrent les programmes en production. Ces bogues ralentissent
le développement de ces logiciels, demandant des semaines, voire
des mois avant que les développeurs ne puissent les identiﬁer et les
corriger.
La détection des bogues de concurrence en production, le diagnos-
tic de leur racine ainsi que leur classiﬁcation est un déﬁ. En effet,
ces activités demandent de lourdes analyses, comme explorer les dif-
férents chemins atteignables par le programme, déterminer les en-
trées ainsi que la programmation temporelle du programme incrim-
iné. Ces activités ne sont pas adaptées à des programmes déployés
production
Cette thèse développe des techniques utilisables en production pour
la détection de la racine des bogues de concurrence ainsi que leur
classiﬁcation. Nous développons aussi différents moyens pour les
dévelop-peurs aﬁn de les aider à mieux raisonner en présence de pro-
grammes concurrents. Cette thèse se base sur les principes suivants:
— Etendre son approche sur plusieurs couches du système, tout
comme les bogues auxquels elle s’attaque.
— Exécuter les analyses les plus lourdes en arrière-plan et ne garder
qu’un minimum d’analyses en production aﬁn de n’impacter le
système qu’au minimum.
— Ne pas utiliser de solutions nécessitant du matériel construit
sur mesure, ce qui peut ne pas être possible dans le monde réel.
En se basant sur ces principes, cette thèse introduit:
1. Des techniques pour détecter automatiquement des bogues de
concurrence (accès concurrent et violation de l’atomicité des in-
structions) en production en combinant de l’analyse statique en
arrière-plan et de l’analyse dynamique en production.
2. Une technique pour identiﬁer automatiquement la racine de
problèmes en production, avec une emphase toute particulière
sur les bogues de concurrence.
3. Une technique qui, pour un accès concurrent donné, le classi-
ﬁe automatiquement selon ses conséquences potentielles, per-
mettant à un développeur de rapidement répondre à des ques-
tions telles que “Cet accès concurrent cause-t-il un arrêt du pro-
gramme ou le bloque-t-il ?”, ou “Cet accès concurrent a-t-il un
effet observable ?”
Nous avons construit une série d’outils qui implémentent les tech-
niques citées ci-dessus.
ix
Nous montrons que les outils que nous avons développés dans
cette thèse sont efﬁcaces, ont un impact faible sur les performances et
sont dotés d’une haute précision.
Mots clés: Bogues de concurrence, accès concurrent, violation d’atomicité,
analyse statique, analyse dynamique
x
PUBL ICAT IONS
This dissertation primarily builds upon the ideas presented in the
following publications:
— Baris Kasikci, Benjamin Schubert, Cristiano Pereira, Gilles Pokam,
et al. “Failure Sketching: A Technique for Automated Root Cause
Diagnosis of In-Production Failures.” In: Symp. on Operating
Systems Principles. 2015
— Baris Kasikci, Cristiano Pereira, Gilles Pokam, Benjamin Schu-
bert, et al. “Failure Sketches: A Better Way to Debug.” In: Work-
shop on Hot Topics in Operating Systems. 2015
— Cristian Zamﬁr, Baris Kasikci, Johannes Kinder, Edouard Bugnion,
et al. “Automated Debugging for Arbitrarily Long Executions.”
In: Workshop on Hot Topics in Operating Systems. 2013
— Baris Kasikci, Cristian Zamﬁr, and George Candea. “RaceMob:
Crowdsourced Data Race Detection.” In: Symp. on Operating
Systems Principles. 2013
— Baris Kasikci, Cristian Zamﬁr, and George Candea. “Data
Races vs. Data Race Bugs: Telling the Difference with Portend.”
In: Intl. Conf. on Architectural Support for Programming Languages
and Operating Systems. 2012
— Baris Kasikci, Cristian Zamﬁr, and George Candea. “Auto-
mated Classiﬁcation of Data Races Under Both Strong and Weak
Memory Models.” In: ACM Transactions on Programming Lan-
guages and Systems 37.3 (2015)
— Baris Kasikci, Thomas Ball, George Candea, John Erickson, et
al. “Efﬁcient Tracing of Cold Code Via Bias-Free Sampling.” In:
USENIX Annual Technical Conf. 2014
xi

ACKNOWLEDGMENTS
As I set out to write these acknowledgements, I realized that many
people contributed to the formation of this dissertation. I would like
to apologize upfront if I forgot to mention the names of the people
who helped me throughout my PhD; it is an honest mistake.
First and foremost, I would like to thank my advisor George Can-
dea who taught me how to do research that matters. His unwavering
perfectionism is omnipresent in this dissertation. To me, George is
the embodiment of the ideal advisor: he is resourceful, he always
follows-up, and he always supports his students. I try to follow his
example with the students I work with.
I would like to thank Emery Berger, Christos Kozyrakis, Madan
Musuvathi, and Willy Zwaenepoel for being in my dissertation com-
mittee. It is an honor to get the input and criticism of such world-class
experts. They helped me greatly improve this dissertation.
I would like to thank all the members and alumni of DSLAB, who
have been an amazing company over the past ﬁve years. I learned
a lot from this amazingly talented and smart team. This dissertation
beneﬁted greatly from the honest and unbiased feedback of DSLAB.
Many thanks go to Silviu Andrica, Radu Banabic, Alexandre Bique,
Stefan Bucur, Amer Chamseddine, Vitaly Chipounov, Alexandru Copot,
Francesco Fucci, Loïc Gardiol, Horatiu Jula, Johannes Kinder, Vova
Kuznetsov, Georg Schmid, Benjamin Schubert, Ana Sima, Jonas Wag-
ner, Cristian Zamﬁr, Peter Zankov, Arseniy Zaostrovnykh, and Lisa
Zhou. I would like to especially thank Nicoletta Isaac, DSLAB’s ad-
ministrative assistant, for making my life at EPFL easier.
I would like to thank Babak Falsaﬁ and Edouard Bugnion for all
the advice and the support they gave me. They are great people and
mentors, and I am very lucky to have met them during my PhD. I
would also like to thank Nisheeth Vishnoi, for all the information he
gave me about life in academia.
I would like to thank my mentors and colleagues at Microsoft Re-
search, VMware and Intel. Many ideas in this dissertation developed
from lengthy discussions with them. In particular, many thanks go
to Thomas Ball, John Erickson, Chandra Prasad, Wolfram Schulte,
and Danny van Velzen of Microsoft; Eric von Bayer, Dilpreet Bindra,
Swathi Koundinya, Hiep Ma of VMware; Mohammad Haghighat,
Cristiano Pereira, and Gilles Pokam of Intel.
I would like to thank Kerem Kapucu, Eren Can Erdog˘an, and
Hakan Ertuna for their great friendship. I would like to also thank
other close friends Duygu Ceylan, Roy Combe, Berra Erkos¸ar, Cansu
xiii
Kaynak, Onur Kazanç, Onur Koçberber, Cüneyt Songüler, and Pınar
Tözün.
I would like to thank Vikram Adve, Gustavo Alonso, Katerina Ar-
gyraki, Olivier Crameri, Sotiria Fytraki, Christopher Ming-Yee Iu,
Ryan Johnson, James Larus, Petros Maniatis, Yanlei Zhao, and all the
anonymous reviewers of my work whose comments helped greatly
improve this dissertation.
I would like to thank VMware, Intel and the European Research
Council for supporting my research.
Last but not least, I would like to thank my family: Yildiz, Baris,
and Tia. They provided their continuous love and support through-
out my PhD. Without their company, this dissertation wouldn’t have
been possible.
xiv
CONTENTS
i setting the stage 1
1 introduction 3
1.1 Problem Deﬁnition 3
1.2 Challenges 5
1.2.1 The Runtime Performance Overhead Challenge 5
1.2.2 The Accuracy Challenge 6
1.2.3 The In-Production Challenge 8
1.3 Overview of Prior Solution Attempts 8
1.3.1 Attempts at Addressing the Overhead Challenge 9
1.3.2 Attempts at Addressing the Accuracy Challenge 9
1.3.3 Attempts at Addressing the In-Production Chal-
lenge 10
1.4 Solution Overview 10
1.5 Summary of Contributions 11
1.6 Summary of Results 13
2 background and related work 15
2.1 Deﬁnitions 15
2.1.1 Data Race 15
2.1.2 Atomicity Violation 16
2.1.3 Root Cause 17
2.2 Concurrency Bug Surveys 18
2.3 Data Race Detection Literature 19
2.3.1 Static Data Race Detection 19
2.3.2 Dynamic Data Race Detection 19
2.3.3 Mixed Static-Dynamic Data Race Detection 22
2.3.4 Detecting Data Races In Production 22
2.3.5 Data Race Avoidance 22
2.4 Atomicity Violation Detection 24
2.4.1 Static Atomicity Violation Detection 25
2.4.2 Dynamic Atomicity Violation Detection 25
2.5 Root Cause Diagnosis of In-Production Failures 26
2.6 Concurrency Bug Classiﬁcation 28
ii eliminating concurrency bugs from in-production
systems 31
3 racemob : detecting data races in production 33
3.1 Design Overview 33
3.2 Static Data Race Detection 34
3.3 Dynamic Data Race Validation 35
3.3.1 Dynamic Context Inference 35
3.3.2 On-Demand Data Race Detection 36
3.3.3 Schedule Steering 38
xv
xvi contents
3.4 Crowdsourcing the Validation 38
3.5 Reaching a Verdict 40
3.6 Implementation Details 41
4 gist : root cause diagnosis of in-production fail-
ures 43
4.1 Design Overview 44
4.2 Static Slice Computation 46
4.3 Slice Reﬁnement 48
4.3.1 Adaptive Slice Tracking 49
4.3.2 Tracking Control Flow 50
4.3.3 Tracking Data Flow 51
4.4 Identifying the Root Cause 53
4.5 Implementation Details 55
5 portend : classifying data races during testing 57
5.1 A Fine-Grained Way to Classify Data Races 59
5.2 Design Overview 61
5.3 Single-Path Analysis 67
5.4 Multi-Path Analysis 69
5.5 Symbolic Output Comparison 70
5.6 Multi-Schedule Analysis 73
5.7 Symbolic Memory Consistency Modeling 74
5.8 Classiﬁcation Verdicts 79
5.9 Portend’s Debugging Aid Output 81
5.10 Implementation Details 81
6 evaluation 85
6.1 RaceMob’s Evaluation 85
6.1.1 Experimental Setup 85
6.1.2 Effectiveness 87
6.1.3 Efﬁciency 88
6.1.4 Comparison to Other Detectors 90
6.1.5 Comparison to Concurrency Testing Tools 94
6.1.6 Scalability with Application Threads 97
6.2 Gist’s Evaluation 98
6.2.1 Experimental Setup 98
6.2.2 Automated Generation of Sketches 99
6.2.3 Accuracy of Failure Sketches 101
6.2.4 Efﬁciency 103
6.3 Portend’s Evaluation 106
6.3.1 Experimental Setup 106
6.3.2 Effectiveness 108
6.3.3 Accuracy and Precision 110
6.3.4 Efﬁciency 111
6.3.5 Comparison to Existing Data Race Detectors 114
6.3.6 Efﬁciency and Effectiveness of Symbolic Mem-
ory Consistency Modeling 115
contents xvii
6.3.7 Memory Consumption of Symbolic Memory Con-
sistency Modeling 119
iii wrapping up 121
7 ongoing and future work 123
7.1 Enhancing Security through Path Proﬁling 123
7.2 Privacy Implications of Collaborative Approaches 123
7.3 Exposing Concurrency Bugs 124
7.4 Concurrency In Large-Scale Distributed Systems 124
8 conclusions 127
bibliography 129
L I ST OF F IGURES
Figure 1 Example of a switch statement adapted from
[25] 4
Figure 2 False negatives in happens-before (HB) dynamic
race detectors: the data race on x is not de-
tected in Execution 1, but it is detected in Exe-
cution 2. 8
Figure 3 Two executions from the same program with-
out a data race. Execution 1 has a race con-
dition, because the program’s speciﬁcation de-
ﬁnes executions where x is set to 2 in T2 after
it is set to 1 in T1 as erroneous. 16
Figure 4 Two executions from different programs. Both
executions violate the atomicity requirement of
writing to x and reading from it atomically in
T1. Execution 1 has data races, whereas execu-
tion 2 does not have any data races. 17
Figure 5 RaceMob’s crowdsourced architecture: A static
detection phase, run on the hive, is followed
by a dynamic validation phase on users’ ma-
chines. 34
Figure 6 Minimal monitoring in DCI: For this example,
DCI stops tracking synchronization operations
as soon as each thread goes once through the
barrier. 37
Figure 7 The state machine used by the hive to reach
verdicts based on reports from program instances.
Transition edges are labeled with validation re-
sults that arrive from instrumented program
instances; states are labeled with RaceMob’s
verdict. 41
Figure 8 The failure sketch of pbzip2 bug. 44
Figure 9 The architecture of Gist 45
Figure 10 Adaptive slice tracking in Gist 49
Figure 11 Example of control (a) and data (b) ﬂow track-
ing in Gist. Solid horizontal lines are program
statements, circles are basic blocks. 52
Figure 12 Four common atomicity violation patterns (RWR,
WWR, RWW, WRW). Adapted from [8]. 53
xviii
List of Figures xix
Figure 13 A sample execution failing at the second read
in T1 (a), and three potential concurrency er-
rors: a RWR atomicity violation (b), 2 WR data
races (c-d). 54
Figure 14 Portend taxonomy of data races. 60
Figure 15 High-level architecture of Portend. The six shaded
boxes indicate new code written for Portend,
whereas clear boxes represent reused code from
KLEE [35] and Cloud9 [33]. 62
Figure 16 Increasing levels of completeness in terms of
paths and schedules: [a. single-pre/single-post]
 [b. single-pre/multi-post] [c. multi-pre/multi-
post]. 64
Figure 17 Simpliﬁed example of a harmful data race from
Ctrace [141] that would be classiﬁed as harm-
less by classic data race classiﬁers. 65
Figure 18 Portend prunes paths during symbolic execu-
tion. 70
Figure 19 A program to illustrate the beneﬁts of sym-
bolic output comparison 72
Figure 20 Simple multithreaded program 75
Figure 21 Lamport clocks and a happens-before graph 77
Figure 22 Write Buffering 79
Figure 23 Example debugging aid report for Portend. 81
Figure 24 Breakdown of average overhead into instrumentation-
induced overhead and detection-induced over-
head. 90
Figure 25 Contribution of each technique to lowering the
aggregate overhead of RaceMob. Dynamic de-
tection represents detection with TSAN. Race-
Mob without DCI and on-demand detection just
uses static data race detection to prune the num-
ber of accesses to monitor. 93
Figure 26 Concurrency testing benchmarks: bench1 is shown
in Fig. 2, thus not repeated here. In bench2, the
accesses to x in thread T1 and T3 can race, but
the long sleep in T3 and T4 causes the signal-
wait and lock-unlock pairs to induce a happens-
before edge between T1 and T4. bench3 has a
similar situation to bench2. In bench4, the ac-
cesses to variables x, y, z from T1 and T2 are
racing if the input is either in1, in2, or in3. 95
Figure 27 Data race detection coverage for RaceMob vs.
RaceFuzzer. To do as well as RaceMob, Race-
Fuzzer must have a priori access to all test cases
(the RaceFuzzer3 curve). 96
Figure 28 RaceMob scalability: Induced overhead as a
function of the number of application threads. 98
Figure 29 The failure sketch of Curl bug #965. 99
Figure 30 The failure sketch of Apache bug #21287. The
grayed-out components are not part of the ideal
failure sketch, but they appear in the sketch
that Gist automatically computes. 101
Figure 31 Accuracy of Gist, broken down into relevance
accuracy and ordering accuracy. 102
Figure 32 Contribution of various techniques to Gist’s ac-
curacy. 103
Figure 33 Gist’s average runtime performance overhead
across all runs as a function of tracked slice
size. 104
Figure 34 Tradeoff between slice size and the resulting
accuracy and latency. Accuracy is in percent-
age, latency is in the number of failure recur-
rences. 105
Figure 35 Comparison of the full tracing overheads of
Mozilla rr and Intel PT. 106
Figure 36 Breakdown of the contribution of each tech-
nique toward Portend’s accuracy. We start from
single-path analysis and enable one by one the
other techniques: ad-hoc synchronization de-
tection, multi-path analysis, and ﬁnally multi-
schedule analysis. 111
Figure 37 Simpliﬁed examples for each data race class
from real systems. (a) and (b) are from ctrace,
(c) is from memcached and (d) is from pbzip2.
The arrows indicate the pair of racing accesses. 112
Figure 38 Change in classiﬁcation time with respect to
number of preemptions and number of depen-
dent branches for some of the data races in Ta-
ble 9. Each sample point is labeled with data
race id. 113
Figure 39 Portend’s accuracy with increasing values of
k. 114
Figure 40 A program with potential write reordering. 116
Figure 41 A program with potential write reordering that
leads to a crash. 116
Figure 42 A program with no potential for write reorder-
ing. 117
Figure 43 A program that uses barriers and has a poten-
tial write reordering that leads to a crash. 117
Figure 44 Running time of Portend-weak and Portend-
seq 119
xx
List of Tables xxi
Figure 45 Memory usage of Portend-weak and Portend-
seq 120
L I ST OF TABLES
Table 1 Data race detection with RaceMob. The static
phase reports Data race candidates (row 2). The
dynamic phase reports verdicts (rows 3-10). Causes
hang and Causes crash are data races that caused
the program to hang or crash. Single order are
true data races for which either the primary
or the alternate executed (but not both) with
no intervening synchronization; Both orders are
data races for which both executed without in-
tervening synchronization. 86
Table 2 Runtime overhead of data race detection as a
percentage of uninstrumented execution. Av-
erage overhead is 2.32%, and maximum over-
head is 4.54%. 88
Table 3 Data race detection results with RaceMob, Thread-
Sanitizer (TSAN), and RELAY. Each cell shows
the number of reported data races. The data
races reported by RaceMob and TSAN are all
true data races. The only true data races among
the ones detected by RELAY are the ones in the
row “RaceMob”. To the best of our knowl-
edge, two of the data races that cause a hang
in SQLite were not previously reported. 89
Table 4 RaceMob aggregate overhead vs. TSAN’s aver-
age overhead, relative to uninstrumented exe-
cution. RaceMob’s aggregate overhead is across
all the executions for all users. For TSAN, we
report the average overhead of executing all
the available test cases. 92
Table 5 RaceMob vs. concurrency testing tools: Ratio
of data races detected in each benchmark to
the total number of data races in that bench-
mark. 96
xxii List of Tables
Table 6 Bugs used to evaluate Gist. Bug IDs come from
the corresponding ofﬁcial bug database. Source
lines of code are measured using sloccount [214].
We report slice and sketch sizes in both source
code lines and LLVM instructions. Time is re-
ported in minutes:seconds. 100
Table 7 Programs analyzed with Portend. Source lines
of code are measured with the cloc utility. 107
Table 8 “Spec violated” data races and their consequences. 108
Table 9 Summary of Portend’s classiﬁcation results. We
consider two data races to be distinct if they
involve different accesses to shared variables;
the same data race may be encountered multi-
ple times during an execution—these two dif-
ferent aspects are captured by the Distinct data
races and Data race instances columns, respec-
tively. Portend uses the stack traces and the
program counters of the threads making the
racing accesses to identify distinct data races.
The last 5 columns classify the distinct data
races. The states same/differ columns show for
how many data races the primary and alter-
nate states were different after the data race, as
computed by the Record/Replay Analyzer [152]. 109
Table 10 Portend’s classiﬁcation time for the 93 data races
in Table 9. 113
Table 11 Accuracy for each approach and each classi-
ﬁcation category, applied to the 93 data races
in Table 9. “Not-classiﬁed” means that an ap-
proach cannot perform classiﬁcation for a par-
ticular class. 115
Table 12 Portend’s effectiveness in bug ﬁnding and state
coverage for two memory model conﬁgurations:
sequential memory consistency mode and Por-
tend’s weak memory consistency mode. 119
Part I
S ETT ING THE STAGE
In this part, we deﬁne the problem tackled in this disser-
tation along with the associated challenges for solving it,
and prior solution attempts. We give a brief overview of
the solution we propose, followed by a thorough treat-
ment of related work on detection, root cause diagnosis,
and classiﬁcation of concurrency bugs.

1
INTRODUCT ION
In this chapter, we elaborate on the deﬁnition of the problem ad-
dressed in this dissertation (§1.1); we describe the challenges of de-
tection, root cause diagnosis and classiﬁcation of concurrency bugs
for in-production software (§1.2); we summarize prior attempts at
solving the problem (§1.3). We then give an overview of the solution
we propose in this dissertation (§1.4); we summarize our contribu-
tions (§1.5) and ﬁnally our results (§1.6).
1.1 problem definition
Concurrency bugs such as data races, atomicity violations, and
deadlocks are at the root of many software problems [132]. These
problems have lead to losses of human lives [124], caused massive ma-
terial losses [198], and triggered various security vulnerabilities [49,
78, 222]. Perhaps more subtly, concurrency bugs increase the difﬁ-
culty of reasoning about concurrent programs because of their spo-
radic occurrence and unpredictable effects.
Concurrency bugs proliferated in modern software after the advent
of multicore processors. As hardware became increasingly parallel,
developers wrote more programs that tried to leverage such paral-
lelism by relying on concurrency. Since then, multithreading and par-
allel programming became widespread. Concurrency is desirable for
getting more performance out of parallel hardware, but it comes with
a cost: concurrent programs are hard to write correctly, and therefore
it is easy to make mistakes when writing such programs (e.g., data
races).
During the transition to the multicore era (early 2000s), mainstream Threads were in use
by the mid 90s for
multiprocessor
systems, however it
is the transition to
multicore
architectures that
made them
mainstream
programming languages were not designed to support concurrent
programming natively, which contributed to the proliferation of con-
currency bugs. For example, C and C++, which were among the most
popular programming languages when this transition happened [201],
were speciﬁed as single-threaded languages [28], without reference to
the semantics of threads.
Concurrency was added to these mainstream languages through li-
braries (e.g. Pthreads [87] and Windows threads [217]), which added
informal constructs that developers could use to restrict access to
shared variables (e.g., pthread_mutex_lock). These constructs were in-
formal, because they did not change the nature of the C/C++ compil-
ers that were inherently oblivious to concurrency.
3
4 introduction
unsigned x;
...
if (x < 4) {
  ... code that doesn't change x ...
  switch (x) {
    case 0:
     ...
    case 1:
     ...
    case 2:
     ...
    case 3:
  }
}
1 
2
3
4
5
6
7
8
9
10
11
12
13
14
Figure 1 – Example of a switch statement adapted from [25]
Despite the presence of libraries attempting to add concurrency
support to C/C++, associated compilers would generate code as if
the programs were single-threaded, thereby occasionally violating the
intended semantics of concurrent programs. For instance, consider
the program snippet in Fig. 1, where a compiler could compile the
program to emit a branch table for the switch statement and omit
bounds check for the branch table because it already knows that x < 4.
If the resulting program loads x twice, once on line 3, and once on
line 5, and x is modiﬁed between the two loads by another thread
(i.e., there is a data race on x), the program may take a wild branch
and will most probably crash.
Atomicity violations, data races, and deadlocks can all cause sim-
ilar subtle behavior and cause software to fail in hard-to-predict cir-
cumstances [26]. Moreover, the subtle behavior of such bugs compli-
cate reasoning about concurrent programs.
It is challenging to ﬁx concurrency bugs as it is. However, if fail-
ures due to concurrency bugs only occur in production, the problem
is exacerbated. This is because developers traditionally rely on repro-
ducing failures in order to understand the associated bugs and ﬁx
them. However, if such bugs only recur in production and cannot beIntuitively, a root
cause is the real
reason behind a
failure; we talk
about root causes in
detail in §2.
reproduced in-house, diagnosing the root cause and ﬁxing the bugs
is truly hard. In [178], developers noted: “We don’t have tools for the
once every 24 hours bugs in a 100 machine cluster.” An informal poll
on Quora [171] asked “What is a coder’s worst nightmare,” and the
most popular answer was “The bug only occurs in production and can’t
be replicated locally,”.
To address these problems, this dissertation introduces techniques
for the detection, root cause diagnosis, and classiﬁcation of concur-
rency bugs that occur in production. We introduce techniques that
are applicable to concurrency bugs in general. However, we focus
on concurrency bugs that occur in production, because such bugs
present additional challenges as we describe in the next section (§1.2).
1.2 challenges 5
1.2 challenges
Researchers and practitioners have observed that concurrency bugs
are hard to detect and ﬁx [75, 100, 110, 111, 178]. In this section, we
ﬁrst explain the fundamental challenges of the detection, root cause
diagnosis, and classiﬁcation of concurrency bugs, namely the run-
time performance overhead challenge (§1.2.1) and the accuracy chal-
lenge (§1.2.2). We then elaborate on why performing these tasks is
even more challenging in production (§1.2.3).
1.2.1 The Runtime Performance Overhead Challenge
Although static
analysis tools do not
impose runtime
overhead, they suffer
from false positives,
which is related to
the accuracy
challenge we discuss
in §1.2.2
The runtime tracing that is required for the detection, root cause
diagnosis, and classiﬁcation of concurrency bugs incurs high runtime
performance overhead. In this section, we discuss the challenges that
arise from runtime overheads of techniques and tools that perform
dynamic program analysis, because purely static analysis has no run-
time overhead.
Dynamic concurrency bug detection, whether it is the detection
of data races, atomicity violations, or deadlocks, is expensive. This
is because concurrency bug detection requires monitoring memory
accesses and synchronization operations, and performing intensive
computations at runtime [51, 183] (e.g., building a happens-before
relationship [119] graph for data race detection).
For instance, dynamic data race detection needs to monitor many
memory accesses and synchronization operations, therefore it incurs
high runtime overhead (as high as 200× in industrial-strength tools
like Intel Parallel Studio [89]). The lion’s share of instrumentation
overhead is due to monitoring memory reads and writes, reported to
account for as much as 96% of all monitored operations [64].
Similarly, atomicity violation detectors incur high overheads (up to
45× in the case of state-of-the-art detector AVIO-S [133] and up to
65× in the case of SVD [221]). The overhead of atomicity violation
detection stems from tracking updates to each monitored memory
access and performing the necessary checks for determining whether
a given access constitutes an atomicity violation or not.
With regards to the classiﬁcation of concurrency bugs, prior work
mostly focused on data race classiﬁcation [95, 100, 109, 110, 152, 200], By classiﬁcation, we
mean the classiﬁca-
tion of true positives
(i.e., real bugs).
Identiﬁcation of false
positives (i.e.,
reports that do not
correspond to real
bugs) is consid-
ered separately in
this dissertation
because data race detectors tend to report many data races. The abun-
dance of data races in modern software pushes developers to under-
stand which data races have higher impact, in order to prioritize their
ﬁxing.
Classifying data races according to their potential consequences re-
quires more computationally-intensive analyses than mere data race
detection, and therefore imposes signiﬁcant runtime overhead. In
order to classify data races based on their consequences, not only
6 introduction
do data races need to be detected, but further analyses need to be
enabled to monitor data races’ effects on the program state and out-
put. Moreover, accurate classiﬁcation of data races requires exploring
multiple program paths and schedules to gain sufﬁcient conﬁdence
in the classiﬁcation results, and this further increases the runtime per-
formance overhead. For example, a state of the art data race classiﬁca-
tion tool, Record/Replay analyzer [152], incurs 45× runtime overhead
when performing data race classiﬁcation.
Finally, root cause diagnosis of concurrency bugs requires tracking
memory accesses and certain relations among memory accesses (e.g.,
their execution order), and therefore incurs large runtime overhead.
The overhead of root cause diagnosis of concurrency bugs is further
exacerbated because root cause diagnosis techniques typically require
gathering execution information from multiple program executions
in order to isolate the failing thread schedules and inputs [136]. For
example the Delta Debugging technique [45, 231]—a state of the art
technique for isolating bug inducing inputs and thread schedules—
requires gathering execution information from several dozens (50 to
100) of runs before homing in on bugs’ root causes. Another state
of the art concurrency bug isolation technique CBI [98] also relies on
gathering execution information from multiple inputs, and it incurs
overheads as high as 460×.
1.2.2 The Accuracy Challenge
Static detection of concurrency bugs works without actually run-
ning programs, therefore it does not incur any runtime performance
overhead [149, 150]. However, this comes at the expense of false pos-
itives (i.e., bug reports that do not correspond to actual bugs). False
positives arise because static analysis cannot reason about the pro-
gram’s full runtime execution context.
False positives in static analysis of concurrency bugs arise because
of four main reasons: ﬁrst, static detectors perform some approxima-
tions such as conﬂating program paths during analysis or constrain-
ing the analysis to be intraprocedural (as opposed to interprocedural)
in order to scale to large code bases. Second, static analyzers cannot
always accurately infer which program contexts are multithreaded.
Third, static analyzers typically model the semantics of lock/unlock
synchronization primitives but not other primitives, such as barriers,
semaphores, or wait/notify constructs. Finally, static analyzers can-
not accurately determine whether two memory accesses alias or not.
Static classiﬁcation of concurrency bugs typically relies on heuris-
tics, and therefore inherently has false positives as is the case with
most heuristic-based approaches. For instance, DataCollider [100]
prunes data race reports that appear to correspond to updates of
statistics counters and to read-write conﬂicts involving different bits
1.2 challenges 7
of the same memory word, or that involve variables known to devel-
opers to have intentional data races (e.g., a “current time” variable is
read by many threads while being updated by the timer interrupt).
Updates on a statistics counter might be considered harmless for the
cases investigated by DataCollider, but if a counter gathers critical
statistics related to resource consumption in a language runtime, clas-
sifying a race on such a counter as harmless may be incorrect. More
importantly, even data races that developers consider harmless may
become harmful (e.g., cause a crash or a hang) when the code is
compiled with a different compiler or when the program executes on
some hardware with a different memory model [26, 29].
Dynamic detectors and classiﬁers [82, 89, 187] tend to report fewer
false positives. Developers prefer tools that have fewer false posi-
tives, because they do not have the time to cherry-pick true positives
(i.e., reports corresponding to real bugs) in the presence of false posi-
tives [20].
Dynamic root cause diagnosis techniques [8, 9, 98, 108, 127, 180, To the best of our
knowledge, there do
not exist root cause
diagnosis schemes
that are fully static.
However, we
cautiously speculate
that static root cause
diagnosis will suffer
similarly from false
positives.
204] typically rely on statistical analysis for isolating the root causes
of bugs, and therefore they are susceptible to false positives. These
techniques gather execution information from multiple failing and
successful executions to determine the key differences between those
executions. The accuracy of statistical analysis hinges on the num-
ber of samples gleaned, therefore dynamic root cause diagnosis tech-
niques can have false positives if they cannot monitor a sufﬁciently
large sample of executions.
On the other hand of the spectrum are false negatives (i.e., real
bug reports that are missed). False negatives can be an artifact of the
approximations used in static analysis, or they may occur because
a certain analysis (static or dynamic) is unable to analyze a certain
portion of the code.
False negatives are typical of dynamic detection, root cause diagno- Data race detection
using causal
precedence [191],
can predict some
data races that do
not occur during
actual program
executions without
any false positives
sis, and classiﬁcation of concurrency bugs, because dynamic analysis
can only operate on executions it witnesses, which are typically only
a tiny subset of a program’s possible executions.
False negatives also arise because of fortuitous events. For example,
while monitoring a subset of executions, dynamic data race detectors
may incorrectly infer happens-before relationships that are mere arti-
facts of the witnessed thread interleaving. To illustrate this point, con-
sider Fig. 2. In execution 1, the accesses to the shared variable x are
ordered by an accidental happens-before relationship (due to a fortu-
itous ordering of the lock acquire and release operations) that masks
the true data race. Therefore, a precise dynamic detector would not
ﬂag this as a data race. However, this program does have a data
race, which manifests itself under a different thread schedule. This
is shown in execution 2, where there is no happens-before relation-
8 introduction
x = 1
lock(l)
...
unlock(l)
Time
shared x
lock(l)
...
unlock(l)
x = 2
HB edge
x = 1
lock(l)
...
unlock(L)
shared x
lock(l)
...
unlock(l)
x = 2
Execution 1 Execution 2
No HB edge
Thread T1 Thread T2 Thread T1 Thread T2
Figure 2 – False negatives in happens-before (HB) dynamic race detectors:
the data race on x is not detected in Execution 1, but it is detected
in Execution 2.
ship between accesses to x; a precise dynamic detector would have
reported a data race only if it witnessed this latter thread schedule.
1.2.3 The In-Production Challenge
Any in-production detection, classiﬁcation, and root cause diagno-
sis technique needs to incur very low performance overhead and min-
imally perturb real-user executions. The overhead challenge (§1.2.1)
is exacerbated in production, because users will not tolerate perfor-
mance degradation—even if it comes with increased reliability. Solu-
tions that perturb the actual behavior of production runs nondeter-
ministically may mask the bug frequently but not always, and thus
make it harder to detect the bug and remove the potential for (even
occasional) failure [147].
Moreover, a great challenge is posed by bugs that only recur in pro-
duction and cannot be reproduced in-house. The ability to reproduce
failures is essential for detecting, classifying and diagnosing the root
causes of bugs. A recent study at Google [178] revealed that develop-
ers’ ability to reproduce bugs is crucial to ﬁxing them. However, in
practice, it is not always possible to reproduce bugs, and practition-
ers report that it takes weeks to ﬁx hard-to-reproduce concurrency
bugs [75].
To summarize this section, detection, classiﬁcation and root cause
diagnosis of concurrency bugs pose signiﬁcant challenges. In particu-
lar, it is hard to efﬁciently and accurately perform these tasks. These
challenges are further exacerbated in production.
1.3 overview of prior solution attempts
In this section, we present an overview of prior attempts at address-
ing the challenges of detection, root cause diagnosis and classiﬁcation
of concurrency bugs. We brieﬂy summarize how existing techniques
1.3 overview of prior solution attempts 9
attempt to address the aforementioned challenges. We also elaborate
on the shortcomings of prior attempts.
1.3.1 Attempts at Addressing the Overhead Challenge
In order to reduce the runtime performance overhead, certain dy-
namic concurrency bug detectors combine static analysis with dy-
namic analysis. For example, Goldilocks [56] uses thread escape anal-
ysis [151] to reduce the set of memory accesses that needs to be mon-
itored at runtime. A similar approach was proposed earlier by Choi,
et al. [44], using a variant of escape analysis. Certain approaches [3,
181] introduce a type system to reduce the overhead of data race and
atomicity violation detection. Despite these assisting static analyses,
existing concurrency bug detectors still incur overheads that make
them impractical for in-production use.
Another way in which existing dynamic detectors and root cause
diagnosis techniques attempt to address the overhead challenge is
sampling. Sampling-based concurrency bug detection and root cause
diagnosis tracks synchronization operations whenever sampling is
enabled. For instance, sampling-based data race detectors [30, 139]
reduce runtime performance overhead. Although sampling reduces
runtime overhead, this may come at the expense of false negatives:
since these detectors do not monitor all runtime events, they may
miss certain bugs.
Another common way prior work tries to cope with the overhead To the best of our
knowledge, attempts
to overcome the
overhead challenge
for classiﬁcation of
concurrency bugs.
challenge is through the usage of customized hardware. HARD [233]
uses special hardware support for data race detection. LBRA/L-
CRA [9] uses hardware extensions to diagnose root causes of bugs.
These techniques indeed alleviate the overhead challenge, but the
hardware support they introduce has not been implemented and de-
ployed in the real world.
1.3.2 Attempts at Addressing the Accuracy Challenge
In order to deal with false positives, dynamic tools employ ﬁlter-
ing, which is typically unsound. Unsound ﬁltering can ﬁlter out true
positives along with false positives. Although this type of ﬁltering
reduces false positives, it cannot eliminate all of them. For example,
even after attempting to ﬁlter out false data race reports, RacerX still
has 37%–46% false positives [58].
False negatives in concurrency bug detection can be trivially re-
duced or even eliminated by ﬂagging more bug reports, but this will
come at the expense of increased false positives. For instance, a data
race detector could report a data race for every pair of memory ac-
cesses in a program. This strategy will eliminate all false negatives,
but it will likely introduce a lot of false positives. Static concurrency
10 introduction
bug detection tools (e.g., RELAY [206]) do not go to such extremes.
Nevertheless they rely on unsound techniques such as using inaccu-
rate but fast alias analysis to ﬂag as many bugs as possible (i.e., they
reduce false negatives), and consequently suffer from false positives
(84%).
Hybrid data race detectors [157] overcome the false negatives dueWe explain hybrid
data race detectors
in detail in §2.3.2.3
to fortuitous happens-before relationships by combining two of the
primary dynamic data race detection algorithms, namely happens be-
fore based data race detection and lockset-based data race detection.To the best of our
knowledge, no prior
attempts were made
to overcome the
accuracy challenge
of classifying
concurrency bugs.
Although this combination reduces false negatives, it can introduce
false positives due to the imprecise nature of lockset-based data race
detection.
1.3.3 Attempts at Addressing the In-Production Challenge
Recall from §1.2.3 that the in production challenge exacerbates the
overhead challenge, therefore prior work used similar methods to cope
with the in-production challenge as it did for the overhead challenge.
Below, we outline a few of the techniques that prior work used to
deal with the in production challenge in addition to the techniques
used to cope with the overhead challenge (which we talked about in
§1.3.1).
To alleviate the aggravated overhead challenge, prior work em-
ploys a variant of sampling. In particular, a common way prior
work addresses the in-production challenge is through collaborative
approaches like CCI [98] and CBI [127] that rely on monitoring ex-
ecutions at multiple user endpoints. There are two outstanding is-
sues with collaborative approaches: although they reduce runtime
overhead per user endpoint for which they perform detection or root
cause diagnosis, the reduced overhead is still not suitable (up to 9×)
for most in-production environments. Second, because these collabo-
rative approaches sample a subset of the executions—in order not to
impose overhead for every execution they monitor—, they may miss
rare failures that only recur in production. This happens because
sampling further reduces the probability of encountering failures that
rarely recur in the ﬁrst place.
1.4 solution overview
In this section, we present an overview of the solution we propose
to the problem we deﬁned in §1.1.
In this dissertation, we address the challenge of in-production de-
tection and root cause diagnosis of concurrency bugs by ﬁrst employ-
ing deep static program analysis ofﬂine, and subsequently perform-
ing lightweight dynamic analysis online at user endpoints. Static anal-
ysis and dynamic analysis work synergistically in a feedback loop:
1.5 summary of contributions 11
static analysis reduces the overhead of the ensuing dynamic analysis
and dynamic analysis improves the accuracy of static analysis.
More speciﬁcally, with regards to data race detection, our key objec-
tive is to have a good data race detector that can be used (1) always-on
and (2) in production. This is why we use static analysis to reduce
the number of memory accesses that need to be monitored at run-
time, thereby reducing overhead by up to two orders of magnitude
compared to existing sampling-based techniques. Because we don’t
rely on sampling an execution during data race detection, our data
race detection ends up being more accurate.
We attack the problem of detection of atomicity violations and root
cause diagnosis of failures due to concurrency bugs using a technique
we call failure sketching. Failure sketching is a technique that automat-
ically produces a high level execution trace called the failure sketch
that includes the statements that lead to a failure and the differences
between the properties of failing and successful program executions.
We show in this dissertation that these differences, which are com-
monly accepted as pointing to root causes [127, 180, 231], indeed
point to the root causes of the failures we evaluated (§6). Identify-
ing the root causes of failures also allows detecting the bugs that are
associated with those failures.
Addressing the challenge of data race classiﬁcation requires ﬁrst
addressing the challenge of in-production data race detection, which
we do via hybrid static-dynamic analysis as we just mentioned.
We then do the classiﬁcation entirely ofﬂine, because classiﬁca-
tion is a computationally-expensive process: multiple program paths
and schedules need to be explored in order to understand the conse-
quences of a data race, and it is not possible to do such analyses in
production without incurring prohibitive runtime performance over-
heads or utilizing many more resources.
In this dissertation, we do not introduce new hardware mecha-
nisms that would conveniently solve the aforementioned challenges.
There are two key reasons for this: (1) not inventing our own custom
hardware solution that solves a challenge we are facing allows us to
come up with novel contributions (detailed below in §1.5); (2) the
techniques we develop are broadly applicable, because they do not
depend on a hardware feature that has not been implemented and
deployed in the real world.
1.5 summary of contributions
This dissertation introduces the ﬁrst data race detector that can
both be used always-on in production and provides good accuracy.
Data race detection with low overhead has been a longstanding
problem. Because data race detection is very costly, to our knowl-
edge, prior work has not attempted to explore data race detection in-
12 introduction
production. In this dissertation, we tackle the problem of in-produc-
tion data race detection via:
— A two-phase static-dynamic approach for detecting data races
in real world software in a way that is more accurate than the
state of the art.
— A new algorithm for dynamically detecting data races on-demand,
which has lower overhead than state-of-the-art dynamic detec-
tors, including those based on sampling.
— A crowdsourcing framework that, unlike traditional testing, taps
directly into real user executions to detect data races.
The second contribution of this dissertation is failure sketching,
a low overhead technique to automatically build failure sketches,
which succinctly represent a failure’s root cause.
Root cause diagnosis of in-production failures—especially failures
due to concurrency bugs—has long been explored. To our knowl-
edge, there is no prior work that can perform root cause diagnosis
of in-production failures with low overhead and without resorting
to custom hardware or system state checkpointing infrastructure. In
this dissertation, we achieve root cause diagnosis of in-production
failures via:
— A hybrid static-dynamic approach that combines in-house static
program analysis with in-production collaborative and adaptive
dynamic analysis.
— A ﬁrst practical demonstration of how Intel Processor Trace, a
new technology that started shipping in early 2015 Broadwell
processors [90], can be used to perform root cause diagnosis.
The third and ﬁnal contribution of this dissertation is a technique
to automatically classify data races based on their potential conse-
quences.
Prior work on data race classiﬁcation has not been accurate, either
because it relied on heuristics or because the abstraction-level of the
classiﬁcation criteria was not correctly identiﬁed. In this dissertation,
we solve the classiﬁcation problem via:
— A four-category taxonomy of data races that is ﬁner grain, more
precise and, we believe, more useful than what has been em-
ployed by the state of the art.
— A technique for predicting the consequences of data races that
combines multi-path and multi-schedule analysis with symbolic
program-output comparison to achieve high accuracy in conse-
quence prediction, and thus classiﬁcation of data races accord-
ing to their severity.
— Symbolic memory consistency modeling, a technique that can
be used to model various architectural memory models in a
principled way in order to perform data race classiﬁcation un-
der those memory models.
1.6 summary of results 13
1.6 summary of results
We built prototypes of all the techniques we present in this disserta-
tion, and we evaluated them. In this section, we give an overview of
our evaluation results. Later in §6, we detail these evaluation results.
We evaluated RaceMob, our in-production data race detector on
ten different systems, including Apache, SQLite, and Memcached. It
found 106 real data races while incurring an average runtime over-
head of 2.32% and a maximum overhead of 4.54%. Three of the data
races hang SQLite, four data races races crash Pbzip2, and one data
race in Aget causes data corruption. Of all the 841 data race candi-
dates found during the static detection phase, RaceMob labeled 77%
as likely false positives. Compared to three state-of-the-art data race
detectors [30, 187, 206] and two concurrency testing tools [110, 185],
RaceMob has lower overhead and better accuracy than all of them.
We evaluated, Gist, our root cause diagnosis prototype using 11
failures from 7 different programs including Apache, SQLite, and
Memcached. The Gist prototype managed to automatically build fail-
ure sketches, which point developers to failures’ root causes, with an
average accuracy of 96% for all the failures, while incurring an aver-
age performance overhead of 3.74%. On average, Gist incurs 166×
less runtime performance overhead than a state-of-the art record/re-
play system.
We evaluated our data race classiﬁcation prototype Portend, by ap-
plying Portend to 93 data race reports from 7 real-world applications:
it classiﬁed 99% of the detected data races accurately in less than
5 minutes per data race on average. Compared to state-of-the-art
data race classiﬁers, Portend is up to 89% more accurate in predict-
ing the consequences of data races (§6.3.7). This improvement comes
from Portend’s ability to perform multi-path and multi-thread sched-
ule analysis, as well as Portend’s ﬁne grained classiﬁcation scheme.
We found not only that multi-path multi-schedule analysis is criti-
cal for high accuracy, but also that the “post-race state comparison”
approach used in state-of-the-art classiﬁers does not work well on
real-world programs, despite being perfect on simple microbench-
marks (§6.3.3).

2
BACKGROUND AND RELATED WORK
In this chapter, we ﬁrst deﬁne important terms used in this disser-
tation (§2.1). Then, we brieﬂy review surveys that examine concur-
rency bug characteristics (§2.2). Finally, we talk about the literature
on the detection (§2.3, §2.4), root cause diagnosis (§2.5), and classiﬁca-
tion (§2.6) of concurrency bugs. Throughout this chapter, we explain
how prior work relates to this dissertation whenever applicable.
2.1 definitions
In this section, we give deﬁnitions for key concepts used in this
dissertation.
2.1.1 Data Race
Two memory accesses are conﬂicting if they access a shared mem- Data races can occur
in single threaded
programs with
signal handlers as
well [197].
ory location and at least one of the two accesses is a write. A data
race occurs when two threads make a conﬂicting access, and these
accesses are not ordered by a happens-before relationship [119]— if
memory effects of an operation O1 in a process P1 becomes visible
to to a process P2 before P2 performs O2, we say that O1 happened
before O2.
A happens-before relationship can only be established using non-
ad hoc synchronization. Ad hoc synchronization is custom synchro-
nization devised by a developer that relies on loops to synchronize
shared variables. By ad hoc synchronization, we do not refer to
custom correct implementations of synchronization constructs, but
rather to the incorrect synchronization operations that are widespread
in real-world code [220], and that lead to concurrency bugs.
The terms data race and race condition are often incorrectly used in-
terchangeably. There is a subtle yet important distinction between
these terms that has garnered attention from both the academic [101,
154] and the practical [15] community. A data race is a condition,
which can be precisely deﬁned as we did above. This precise deﬁni-
tion allows the accurate detection of a data race to be automated. A
race condition on the other hand is a ﬂaw that occurs in the timing or
the ordering of events that leads to erroneous program behavior. It is
not always possible to precisely deﬁne different types of race condi-
tions, therefore accurate detection of race conditions may not always
be possible.
15
16 background and related work
lock(l)
x = 1
unlock(l)
Time
shared x
lock(l)
x = 2
unlock(l)
lock(l)
x = 1
unlock(L)
shared x
lock(l)
x = 2
unlock(l)
Execution 1 Execution 2
Thread T1 Thread T2 Thread T1 Thread T2
Figure 3 – Two executions from the same program without a data race. Exe-
cution 1 has a race condition, because the program’s speciﬁcation
deﬁnes executions where x is set to 2 in T2 after it is set to 1 in T1
as erroneous.
We note that data races and race conditions are neither a necessary
nor a sufﬁcient condition for the occurrence of one another. To see
why this is the case, consider the example in Fig. 3. In this example,
the writes to the shared variable x in threads T1 and T2 are protected
by locks, therefore they are always happening in some order enforced
by the order with which the locks are acquired at runtime (either as
in execution 1 or as in execution 2). That is, writes’ atomicity cannot
be violated; there is always a happens-before relationship between
the two writes in any execution. It is not possible to determine which
write happened before the other until after the program executes. The
reason why there is no ﬁxed ordering between the writes is because
locks cannot provide such ordering. If the programs’ correctness is
compromised, say when the write to x in T2 is followed by the write
to x in T1 (execution 1), we say that there is a race condition, although
technically, there is no data race.
2.1.2 Atomicity Violation
Atomicity is a property of a multithreaded program segment that
allows the segment to appear as if it occurred instantaneously to the
rest of the system. In that regard, atomicity is similar to the lin-
earizability [84] property for concurrent objects and the serializability
property of database transactions [158].
An atomicity violation occurs when operations that are supposed
to be executed atomically do not, because the operations do not re-
side in the same critical section. This happens because developers
make incorrect assumptions about which operations should execute
atomically and thus fail to enclose such operations in a critical section
(e.g., via using locks) [134].
In this dissertation, we deﬁne atomicity violations as bugs. In that
regard, we do not employ the same deﬁnition of atomicity violation
as some prior work [66] that treats any violation of serializability as
2.1 definitions 17
lock(l)
x = 1
unlock(l)
Time
shared x
lock(l)
x = 2
unlock(l)
if(x){
 ...
shared x
x = 2
Execution 1 Execution 2
Thread T1 Thread T2Thread T1 Thread T2
lock(l)
if(x){
 ...
unlock(l)
x = 1
Figure 4 – Two executions from different programs. Both executions violate
the atomicity requirement of writing to x and reading from it
atomically in T1. Execution 1 has data races, whereas execution 2
does not have any data races.
an atomicity violation. While this may be technically true, programs
can have many violations of serializability that are not indicative of a
bug. In other words, in the context of this dissertation, an atomicity
violation is a violation of serializability that leads to a failure (e.g.,
crash or hang).
In that regard, we consider atomicity violations as race conditions. If atomic regions
were easy to
pre-specify,
developers could
simply enclose them
in critical sections.
In other words, atomicity violations are ﬂaws in the timing or the
ordering of events. It is generally not possible to automatically and
accurately detect atomicity violations unless portions of the code that
ought to execute atomically are pre-speciﬁed.
Atomicity violations may or may not involve a data race. Fig. 4
demonstrates this. Let’s assume that for both execution 1 and execu-
tion 2, the write to x and the read from x in thread T1 should occur
atomically. Note that in execution 1, the accesses in thread 1 are in-
volved in a data race with the access in T2, whereas in execution 2,
there are no data races. However, both executions violate the atomic-
ity requirement of the program.
2.1.3 Root Cause
Intuitively, a root cause is the gist of the failure; it is a cause, or a
combination of causes, which when removed from the program, pre-
vents the failure associated with the root cause from recurring [215].
A more precise attempt in [228] describes a root cause in terms of
its relation to a failure. In particular, a failure occurs when a program
produces wrong output according to a speciﬁcation. Then, the root
cause of a failure is the negation of the predicate that needs to be
enforced so that the execution is constrained to not encounter the
failure.
Despite the aforementioned precise deﬁnition attempt, it is difﬁcult
to formally deﬁne a failure’s root cause. This is because, in general,
18 background and related work
there are multiple ﬁxes that could ﬁx a given failure. Different de-
velopers may choose to eliminate a failure in different ways (e.g., by
enforcing different predicates on the execution as per the above def-
inition). In that regard, developers’ perception of a root cause may
vary.
Due to the difﬁculty of providing a precise deﬁnition, in the context
of this dissertation, we resort to a purely statistical deﬁnition of a root
cause. In particular, we deﬁne root causes as events that are primarily
correlated with failures. We later show empirically that such events
indeed point to root causes that developers end up eliminating in
order to ﬁx bugs (§6.2).
2.2 concurrency bug surveys
As concurrent programming gained more momentum after the early
2000s, concurrency bugs proliferated, and researchers studied con-
currency bugs. One of the ﬁrst bug studies to include concurrency
bugs collected only 12 concurrency bugs [37] from three systems ap-
plications, namely MySQL, GNOME and Apache. This study vali-
dated the hypothesis that generic recovery techniques such as process
pairs [77] can be used to mask concurrency bugs.
The ﬁrst comprehensive concurrency bug study is due to Shan Lu
et al. [131, 132]. This study examined 105 randomly selected bugs
from 4 real world systems, namely MySQL, Apache, Mozilla, and
OpenOfﬁce. A key ﬁnding of this study was that almost all of the
non-deadlock concurrency bugs were due to developers’ violation
of an ordering or atomicity assumption. This key result highlights
the importance of developing techniques to detect and ﬁx data races,
atomicity violations and order violations in order to deal with con-
currency bugs.
Another study conducted at Microsoft [75] further revealed results
highlighting the importance of concurrency bugs. 72% of all the par-
ticipants to this study considered that detecting and ﬁxing concur-
rency bugs is hard. Participants said that ﬁxing concurrency bugs
can take days (63.4%) to weeks (8.3%) to months (0.9%). 66% of par-
ticipants said that they have to deal with concurrency bugs as part of
their daily routine .
A more recent study [179] in the context of root cause diagnosis
of bugs determined that although most of the bugs can be repro-
duced in-production by running the program with the same set of
inputs (82%), the remainder of the bugs had non-deterministic behav-
ior. One of the conclusions of this study was that determining fault-
triggering inputs for concurrency bugs and reproducing failures due
to concurrency bugs is signiﬁcantly harder than for other bugs.
Although not a survey per se, the common vulnerability and ex-
posure database contains a comprehensive list of common vulnera-
2.3 data race detection literature 19
bilities and exposures (CVEs) related to concurrency bugs [49]. At
the time of the writing of this dissertation, common vulnerability
database contains 336 vulnerabilities related to concurrency bugs. These
CVEs impact kernels (Linux, Windows), browsers (Chrome, Internet
Explorer), ﬁrewalls (in Cisco IOS) among other software.
RADBench [94] is a study and a benchmark suite of concurrency
bugs in popular software such as Chromium, Mozilla SpiderMonkey,
Apache httpd, and Memcached. RADBench comes bundled with test
cases to reproduce these concurrency bugs.
2.3 data race detection literature
Detection of data races garnered a lot of attention, because data
races are one of the most notorious concurrency bugs. They can
cause other bugs such as atomicity violations and deadlocks; their
occurrence is typically sporadic, and their effects subtle.
Data race detection can be broadly classiﬁed into three classes:
static data race detection, dynamic data race detection, and mixed
static-dynamic data race detection.
2.3.1 Static Data Race Detection
Static data race detectors [58, 65, 86, 105, 104, 150, 166, 206, 149,
168, 191] analyze the program source code without executing it. They
reason about multiple program paths at once, and thus typically miss
few data races (i.e., have a low rate of false negatives) [157]. Static
detectors can also run fast, and they can scale to large code bases
if they employ necessary approximations (e.g., conﬂating program
paths). The problem is that static data race detectors tend to have
many false positives (i.e., produce reports that do not correspond to
real data races). For instance, 84% of data races reported by RELAY
are not true data races [206]. This can send developers on a wild
goose chase, making the use of static detectors potentially frustrating
and expensive.
In this dissertation, we use static data race detection to help re-
duce the overhead of our mixed static-dynamic data race detection
technique (§3).
2.3.2 Dynamic Data Race Detection
Dynamic data race detectors [7, 14, 30, 44, 52, 53, 55, 57, 64, 82,
89, 100, 111, 139, 142, 148, 155, 164, 165, 167, 177, 182, 187, 226]
typically monitor memory accesses and synchronization operations
at runtime, and determine if the monitored accesses race with each
other. Such detectors can achieve low rates of false positives. Alas,
dynamic detectors miss all the data races that are not seen in the
20 background and related work
directly observed execution (i.e., they have false negatives), and these
can be numerous.
Moreover, the instrumentation required to monitor all executed
memory accesses makes dynamic detectors incur high runtime over-
heads (200× for Intel Thread Checker [89], 30× for Google Thread-
Sanitizer [187]). As a result, dynamic detectors are not practical for
in-production use, rather only during testing—this deprives them of
the opportunity to observe real-user executions, thus missing data
races that only occur in real-user environments. Some dynamic data
race detectors employ sampling [30, 139] to decrease runtime over-
head, but this comes with further false negatives. Sampling causes
false positives, because most of the time, sampled events are not use-
ful for the purposes of bug detection. For instance the study in [125]
found that fewer than 1 in 25,000 randomly sampled events were
indicative of failures, and that over 99.996% of the sampled execu-
tion proﬁle was discarded as not being relevant for bug detection. In
this dissertation, we rely on static program analysis to make more
informed decisions for gathering execution information.
Below, we present a policy-centric classiﬁcation of dynamic data
race detection algorithms. Whenever applicable, we also talk about
various mechanisms with which these data race detection policies are
implemented.
2.3.2.1 Happens-Before-Based Data Race Detection
Happens-before relationship [119] based detectors [30, 44, 64, 82,
109, 110, 111, 139, 142, 184, 187] track the happens-before relation-
ships between memory accesses of a program during execution, and
detect data races based on those relationships.
More speciﬁcally, if two memory accesses access the same mem-
ory location, at least one of the accesses is a write, and there is no
happens-before relationship between the two accesses, these detec-
tors ﬂag a data race.
Dynamic detectors that solely use happens-before relationships do
not have false positives as long as they are aware of all the synchro-
nization mechanisms employed by the developer. Happens-before
based dynamic data race detectors can have false positives if develop-
ers use custom synchronization primitives to which the detectors are
oblivious.
As we discussed in the limitations of prior work (§1.2.2), happens-
before-based data race detection is susceptible to false negatives be-
cause of fortuitous happens-before edges that get created merely as
an artifact of an arbitrary execution. In some other executions, these
edges may not get created and happens-before based data race detec-
tion could ﬂag a data race. In the next section (§2.3.2.2), we describe
another dynamic data race detection algorithm that avoids false posi-
tives due to fortuitous happens-before edges.
2.3 data race detection literature 21
In this dissertation, we present a happens before-based data race
detection algorithm that aggressively starts and stops tracking happens-
before relationships based on the events at runtime in order to reduce
runtime performance overhead (§3).
2.3.2.2 Lockset-Based Data Race Detection
Locksets describe the locks held by a program at any given point
in the execution.
Lockset-based data race detection [52, 155, 167, 177, 182, 233] checks
whether all shared memory accesses follow a consistent locking disci-
pline. A locking discipline is a policy that ensures the absence of data
races. A trivial locking discipline would require all memory accesses
in a program to be protected by the same lock.
The simple locking discipline that Eraser [182] (the ﬁrst lockset-
based data race detector uses) states that every shared variable access
should be protected by some common lock. In other words, Eraser
requires any thread accessing a given shared variable to hold a com-
mon lock while it is performing the access in order to consider that
access as non-racing. If Eraser determines that the program violates
this locking discipline at runtime, it will ﬂag a data race.
Eraser’s simple locking discipline is overly strict, and as a result,
it can report many false positives. To lower the number of false
positives, Eraser employs several reﬁnements. For example, Eraser
will not report data races due to the initialization of shared variables,
which is frequently done without holding a lock. Eraser employs
other similar heuristics to lower false positives, however Eraser can-
not completely eliminate false positives. Furthermore, these heuris-
tics can potentially introduce false negatives.
HARD [233] is a hardware implementation of the lockset-based
data race detection. HARD uses Bloom ﬁlters [23] to store locksets
and uses bitwise logic operations on the locksets. HARD was able
to detect 54 data races out of 60 randomly-injected data races in six
SPLASH-2 applications (20% more than happens-before based data
race detection) with overheads ranging between 0.1% to 2.6%.
In the next section (§2.3.2.3), we present a software-only data race
detection technique that reduces the false positive rates of lockset
based data race detection.
2.3.2.3 Hybrid Data Race Detection
Perhaps unfortunately, in the data race detection literature, hybrid We believe that the
term “hybrid” is
better suited for the
combination of static
and dynamic data
race detection.
data race detection implies data race detection that combines the
two major dynamic data race detection algorithms, namely happens-
before-based data race detection and lockset-based data race detec-
tion. This is because the ﬁrst piece of work that called a data race
22 background and related work
detection algorithm “hybrid” [157] combined these two dynamic data
race detection algorithms.
Hybrid data race detection works in two stages: the hybrid data
race detector has an always-on lockset-based data race detector, which
when ﬂags a potential data race, veriﬁes whether the potential data
race is indeed a true data race or not by using a happens-before data
race detector.
Hybrid data race detection improves the accuracy of data race de-
tection by reporting fewer false positives than lockset-based data race
detection [157].
2.3.3 Mixed Static-Dynamic Data Race Detection
Some data race detectors combine static analysis and dynamic anal-
ysis in order to reduce the runtime overhead of data race detection.
Goldilocks [56] and the mixed static-dynamic detector from [44]
used a static thread escape analysis phase to eliminate the need to
track thread-local variables. This dissertation takes a similar approach
to these techniques, but uses a complete static data race detector—
which is more accurate than just using thread escape analysis—to
detect most data races (i.e., to have few false negatives). It then uses a
novel dynamic data race detection algorithm to achieve lower runtime
overhead and higher accuracy than existing mixed-static dynamic
data race detectors.
2.3.4 Detecting Data Races In Production
To our knowledge, this dissertation and the RaceMob system is the
ﬁrst to explore always-on in-production detection of data races.
A recent system called Litecollider [22] also explores in-production
detection of data races. LiteCollider has a two stage in-house data
race detection scheme (as opposed to a single stage in-house static
analysis in RaceMob): ﬁrst LiteCollider detects data races statically
and then uses in-house alias analysis and lockset-based data race de-Collision analysis
uses various
techniques to try to
make conﬂicting
accesses occur
simultaneously.
tection to further prune the set of potential data races to detect at pro-
duction time. Then, in production, LiteCollider uses collision analysis
(similar to DataCollider [100]) to detect data races. Although LiteCol-
lider’s in-house dynamic analysis reduces the number of candidate
data races to dynamically detect in production, it also introduces false
negatives [22].
2.3.5 Data Race Avoidance
In addition to the previously-described data race detection tech-
niques, there are a number of techniques that rely on system or lan-
2.3 data race detection literature 23
guage support to avoid data races at execution time or by construc-
tion.
To our knowledge, the ﬁrst principled approach to avoiding data
races using language support was due to Lamport’s work on mon-
itors [85]. Monitors bundle a number of variables and procedures
together with a lock that is automatically acquired at entry to each
procedure in the bundle and released at the exit from the procedure.
The shared variables in the monitor can only be accessed by proce-
dures in the monitor when the monitor lock is held.
Monitors provide a static (e.g., compile-time) guarantee that ac-
cesses to static shared global variables are data race free, but if shared
variables are allocated dynamically, monitors don’t work well. Lamp-
son and Redell note [120] in their experiences on using monitors
in the Mesa programming language that the limited applicability of
monitors was a signiﬁcant drawback when designing systems such
as the Pilot operating system [174].
Research on deterministic execution systems have gained tremen-
dous popularity in recent years [12, 18, 19, 24, 48, 50, 130, 199]. Deter-
ministic execution requires making the program merely a function of
its inputs [202], thereby eliminating the unpredictable behavior due to
data races. These systems allow executing arbitrary concurrent soft- According to some
experts,
deterministic
execution systems
do not inherently
simplify parallel
programming [80].
ware deterministically using hardware, runtime, or operating system
support. Alas, deterministic execution systems typically incur high
overheads. Therefore, they are not adopted in production software
yet. DMP [50] proposes hardware extensions that allow arbitrary soft-
ware to be executed deterministically with low overhead. However,
hardware support required by DMP is not readily available.
Determinator [12] is a novel operating system kernel that aims to
deterministically execute arbitrary programs. Determinator allocates
each thread of execution a private working space that is a copy of
the global state. Threads reconcile their view of the global state at
well-deﬁned points in a program’s execution. The use of private
workspaces eliminates all read/write conﬂicts in Determinator, and
write/write conﬂicts are transformed into runtime exceptions. Deter-
minator allows running a number coarse-grain parallel benchmarks
with comparable performance to a nondeterministic kernel. Current
operating system kernels are not built with Determinator’s determin-
ism guarantees, and it is unclear if they will be in the future.
Some deterministic execution techniques rely on language support.
StreamIt [199] is a stream-based programming language that allows
threads to communicate only via explicitly deﬁned streams, and there-
fore provides determinism for stream-based programs. DPJ [24] is a
type and effect system for Java that is deterministic by default and
only allows explicit non-determinism. These techniques that rely on
language support allow the developer to build deterministic systems
by construction; however, they are not widely adopted yet.
24 background and related work
In order to achieve deterministic execution in the general case, data
races must be eliminated from programs in some way. However,
eliminating all data races leads to high overheads due to excessive
increase in synchronization operations. We believe that this overhead
has been an important obstacle to the widespread adoption of deter-
ministic execution systems in production software. Combined with
the techniques developed in this dissertation, it may be possible to re-
lax determinism guarantees and eliminate data races that really mat-
ter from the point of view of a developer or user, and make determin-
istic execution more practical.
Another way to achieve deterministic execution is by using trans-
actional memory systems [70, 79, 83]. Transactional memory systems
avoid concurrency bugs by rolling back system state upon a conﬂict.
Transactional memory systems have not been widely adopted in pro-
duction yet, but this may change in the future with commercial hard-
ware companies providing transaction support in hardware [91].
Some programming languages like Rust [175] do not allow devel-Rust can only
provide data race
freedom, and
programs written in
Rust can still other
concurrency bugs
like atomicity
violations
opers to write code with data races, thereby eliminating data races
by construction. Rust achieves data race freedom using its owner-
ship system. In particular, the compiler will ensure that only a single
thread can have a mutable reference to a data element at a time, effec-
tively eliminating data races. Although data race freedom is useful,
Rust requires developers to reason about the ownership model when
writing code, which may complicate the already difﬁcult task of con-
current programming. This dissertation takes an alternate approach
by developing techniques that developers can use to eliminate data
races and other concurrency bugs from code written in their language
of choice.
2.4 atomicity violation detection
It is challenging to build an atomicity violation detector that does
not have false positives. The reason behind this is that as opposed to
data races, atomicity violations are high-level semantic bugs. Atomic-
ity violations occur because developers’ assumptions regarding atom-
icity properties of program statements (or segments) is incorrect. Alas,
in the absence of a formal speciﬁcation (or annotations [63]) of correct
atomicity requirements for a program, it is challenging to detect the
violations of atomicity.
Atomicity violation detectors mainly come in two ﬂavors: static [169,
160] and dynamic [21, 63, 66, 133].
2.4 atomicity violation detection 25
2.4.1 Static Atomicity Violation Detection
Static atomicity violation detectors operate similarly to static data
race detectors in that they reason about atomicity violations using the
source code and without executing the program.
Von Praun et al. [169] developed a technique that relies on an ab-
stract model of threads and data to detect potential atomicity vio-
lations. While this technique has few false negatives when run on
programs with previously-known synchronization problems, it suf-
fers from false positives. Another heuristic-based approach [160] stat-
ically searches for a pattern that is typically indicative of an atomicity
violation. Similarly, this system has a low number of false negatives,
but it suffers from false positives.
2.4.2 Dynamic Atomicity Violation Detection
Atomizer [63] relies on annotations to denote blocks that are sup-
posed to execute atomically and checks whether such blocks indeed
execute atomically at runtime or not. Atomizer can also use heuristics
to automatically annotate certain blocks as atomic (e.g., all synchro-
nized blocks [96].). Atomizer performs atomicity checking by com-
bining a lockset-based analysis with Lipton’s theory of reduction for
parallel programs [129]. If the program uses synchronization mecha-
nisms other than locks, Atomizer can report false positives.
Velodrome [66] soundly and completely checks for violations of se-
rializability in a program by recording a trace of the execution and
reasoning about the dependencies between operations in the trace.
Although Velodrome’s strategy ensures that it detects all atomicity
violations in a program, not all serializability violations are neces-
sarily bugs. Therefore, according to our atomicity violation deﬁni-
tion (§2.1.2), Velodrome reports false positives.
AVIO [133] automatically extracts invariants that aim to capture de-
velopers’ assumptions about atomic code regions. Then at runtime,
AVIO checks whether these invariants are violated in order to detect
atomicity violations. AVIO is effective at detecting atomicity viola-
tions; nevertheless, AVIO’s atomicity invariant extraction is imperfect
and can lead to false positives.
Atom-Aid [134] uses architectural support to arbitrarily group con-
secutive memory operations to reduce the probability of atomicity
violations. Atom-Aid reduces the probability that an atomicity viola-
tion will lead the program to a failure by 98.7% to 100%. Systems such
as Atom-Aid can partially use hardware transactional memory [83,
172] support. Nevertheless, Atom-Aid requires dynamically selecting
program segments to execute in a transaction, whereas current trans-
actional memory support requires explicitly deﬁning transactions in
the code.
26 background and related work
In summary, false positives in atomicity violation detection are
hard to avoid for both static and dynamic techniques that attempt
to detect patterns for atomicity violations. In contrast, in this disser-
tation, we adopt a statistics-based approach to detect atomicity viola-
tions. In short, to statistically detect atomicity violations, we correlate
failures with events that look like violations of atomicity across many
executions at user endpoints. Prior work uses similar techniques for
ﬁnding the root cause of failures due to atomicity violations and other
bugs as well, which we talk about in the next section (§2.5).
2.5 root cause diagnosis of in-production failures
In this section, we review a variety of techniques that have been
developed to date to understand the root causes of failures and to
help developers with debugging. We review general techniques for
root cause diagnosis as well as special techniques targeted towards
failures due to concurrency bugs. We talk about techniques that are
geared towards both testing time root cause diagnosis as well as in-
production root cause diagnosis.
Delta debugging [231] isolates program inputs and variable values
that cause a failure by systematically narrowing the state difference
between a failing and a successful run. Delta debugging achieves
this by repeatedly reproducing the failing and successful run, and
altering variable values. Delta debugging has also been extended to
isolate failure-inducing control ﬂow information [45]. As opposed to
delta debugging, in this dissertation, we target bugs that are hard to
reproduce and aim to generate a (potentially imperfect) explanation
of the root cause of a failure even with a single failing execution.
Cooperative approaches such as cooperative bug isolation (CBI)
[127], cooperative concurrency bug isolation (CCI) [98], PBI [8], LBRA/L-
CRA [9] utilize statistical techniques to isolate failure root causes.
CBI, CCI and PBI rely on sampling executions in order to reduce
the runtime performance overhead. LBRA/LCRA does not resort to
sampling, because it introduces custom hardware extensions to do
root cause diagnosis with low overhead. LBRA/LCRA relies on ob-
serving a failure multiple times to statistically isolate the root cause.
However, LBRA/LCRA only works well for bugs with short root
cause to failure distances, because the hardware support that it re-
lies on has limited capacity to record events such as branches and
cache coherency messages [159]. LBRA/LCRA preserves the privacy
of users to some extent, because it does not track the data ﬂow of
a program. In this dissertation, we use different failure predicting
events for multithreaded bugs (e.g., atomicity violations) than these
systems, to allow developers to differentiate between different types
of concurrency bugs.
2.5 root cause diagnosis of in-production failures 27
Windows Error Reporting (WER) [73], is a large-scale cooperative
error reporting system operating at Microsoft. After a failure, WER
collects snapshots of memory and processes them using a number of
heuristics (e.g., classiﬁcation based on call stacks and error codes) to
cluster reports that likely point to the same bug. Systems like WER
can use root cause diagnosis techniques we develop in this disserta-
tion to improve their clustering of bugs, and help developers ﬁx the
bugs faster.
Symbiosis [136] uses a technique called differential schedule projec-
tions that displays the set of data ﬂows and memory operations that
are responsible for a failure in a multithreaded program. Symbiosis
proﬁles a failing program’s schedule and generates non-failing alter-
nate schedules. Symbiosis then determines the data ﬂow differences
between the failing schedule and the non-failing schedule in order to
help developers identify root causes of failures. Unlike Symbiosis, in
this dissertation, we do not assume that we have access to a failing
program execution that can be reproduced in-house for the purposes
of root cause diagnosis.
PRES [162] records execution sketches, which are abstractions of
real executions (e.g., just an execution log of functions), and performs
state space exploration on those sketches to reproduce failures. The
sketches PRES builds can be used to reason about how a certain fail-
ure occurred and do root cause diagnosis.
Previous work also explored adaptive monitoring schemes for gath-
ering execution information from end users. SWAT [81] adaptively
samples program segments at a rate that is inversely proportional to
their execution frequency. RaceTrack [226] adaptively monitors parts
of a program that are more likely to harbor data races. Bias free
sampling [106] allows a developer to provide an adaptive scheme for
monitoring a program’s behavior. Adaptive bug isolation [10] uses
heuristics to adaptively estimate and track program behaviors that
are likely predictors of failures. In this dissertation, we rely on static
analysis to bootstrap and guide runtime monitoring, thereby achiev-
ing low latency and low overhead in root cause diagnosis.
HOLMES [39] uses path proﬁles to perform root cause diagnosis.
The main motivation behind HOLMES is that path information is
richer and more expressive than other execution information such as
return values or scalar relations among variables that prior work [126]
resorted to. HOLMES does not track any data values when perform-
ing root cause diagnosis, whereas techniques we develop in this dis-
sertation rely on tracking data values for performing root cause di-
agnosis of concurrency bugs. Tracking data values provides richer
debugging information to developers.
SherLog [227] uses a combination of program analysis and execu-
tion logs from a failed production run in order to automatically gen-
erate control and data ﬂow information that aims to help developers
28 background and related work
diagnose the root causes of errors. Unlike Sherlog, in this disserta-
tion, we do not assume that logging is always enabled at execution
time.
ConSeq [232] computes static slices to identify shared memory
reads starting from potentially failing statements (e.g., assert). It then
records correct runs and, during replay, it perturbs schedules around
shared memory reads to try to uncover bugs. In this dissertation, we
use static slicing to identify all control and data dependencies to the
failure point and do root cause diagnosis of a given failure, without
relying on record and replay.
Triage [204], Giri [180], and DrDebug [209] use dynamic slicing
for root cause diagnosis. Triage works for systems running on a sin-
gle processor and uses custom checkpointing support [170]. DrDe-
bug and Giri assume that failures can be reproduced in-house by
record/replay and that one knows the inputs that lead to the failure,
respectively. In this dissertation, we do not assume that failures can
be reproduced in-house.
Tarantula [102] and Ochiai [1] record all program statements that
get executed during failing and successful runs, to perform statistical
analysis of the recorded statements for root cause diagnosis. In this
dissertation, we do not rely on an infrastructure to record all program
statements during an execution.
Exterminator [156] and Clearview [163] automatically detect and
generate patches for certain types of bugs (e.g., memory errors and
security exploits). The techniques we develop in this dissertation
can help these tools diagnose failures for which they can generate
patches.
2.6 concurrency bug classification
In this section, we will review techniques for classifying concur-
rency bugs based on their consequences.
As mentioned previously, we are mainly aware of classiﬁcation
schemes for data races. The reason why classiﬁcation schemes mainly
exist for data races is because data race detectors (both dynamic and
static), report many data races. Therefore, in practice, developers
need to understand the consequences of data races in order to priori-
tize their ﬁxing.
Prior work on data race classiﬁcation employs record/replay anal-
ysis [152], heuristics [100], detection of ad hoc synchronization pat-
terns [95, 200] or simulation of the memory model [62].
Record/replay analysis [152] records a program execution and tries
to enforce a thread schedule in which the racing threads access a
memory location in the reverse order of the original data race. Then,
it compares the contents of memory and registers, and uses a differ-
ence as and indication of potential harmfulness. In this dissertation,
2.6 concurrency bug classification 29
we do not attempt an exact comparison, but rather use a symbolic
comparison technique for program outputs (as opposed to the low
level internal memory state), and we explore multiple program paths
and schedules to increase classiﬁcation accuracy.
DataCollider [100] uses heuristics to prune predeﬁned classes of
likely-to-be harmless data races, thus reporting fewer harmless races
overall. DataCollider will consider data races on statistics counters
and on variables known to developers to have intentional data races
as harmless data races. DataCollider’s heuristic-based pruning can
introduce false negatives.
Helgrind+ [95] and Ad Hoc Detector [200] eliminate data race re-
ports due to ad hoc synchronization. Detecting ad hoc synchroniza-
tions or happens-before relationships that are generally not recog-
nized by data race detectors can help further prune harmless data
race reports, as demonstrated recently by ATDetector [97].
Adversarial memory [62] ﬁnds data races that occur in systems
with memory consistency models that are more relaxed than sequen-
tial consistency, such as the Java memory model [137]. This approach
uses a model of memory that returns stale yet valid values for mem-
ory reads (using various heuristics), in an attempt to crash target
programs. If a data race causes the program to crash as a result of us-
ing the adversarial memory approach, that data race will be classiﬁed
as harmful. In this dissertation, we follow a similar approach when
exploring the consequences of data races. We use a special model of
memory that buffers all prior writes to memory to be able to later
return them based on the semantics of a memory model (§5.7). In
this way, we can systematically explore all possible memory model
behaviors.
Prior work employed bounded model checking to formally verify
concurrent algorithms for simple data types under weak memory
models [34]. Formal veriﬁcation of programs under relaxed memory
models (using bounded model checking or any other technique) is a
difﬁcult problem, and is undecidable in the general case [11]. In this
dissertation, we do not aim to formally verify a program under weak
memory consistency. Instead, we focus on determining the combined
effects of data races and weak memory models.
RaceFuzzer [185] generates random schedules from a pair of rac-
ing accesses to determine whether the race is harmful or not. Race-
Fuzzer performs schedule fuzzing with the goal of ﬁnding bugs, not
classifying data races.
Output comparison was used by Pike to ﬁnd concurrency bugs
while fuzzing thread schedules [68]. Pike users can also write state
summaries to expose latent semantic bugs that may not always man-
ifest in the program output.
Frost [205] follows a similar approach to Record/Replay Analyzer
and Pike in that it explores complementary schedules and detects and
30 background and related work
avoids potentially harmful races by comparing the program states
after program execute with these schedules. This detection is based
on state comparison and is therefore, prone to false positives as we
later show (§6).
Part II
EL IM INAT ING CONCURRENCY BUGS FROM
IN -PRODUCT ION SYSTEMS
In this part, we describe the design, implementation, and
evaluation of the techniques and tools we developed for
the detection, root cause diagnosis, and classiﬁcation of
in-production concurrency bugs.
First, we present a technique to accurately detect data
races in-production. We then present a general technique
for diagnosing the root causes of in-production failures,
and we primarily focus on failures caused by concurrency
bugs. Our technique for root cause diagnosis allows both
detecting bugs and providing an explanation of how the
failure happened. Then, we discuss how we can perform
data race classiﬁcation under arbitrary memory models.
Finally, we present a comprehensive evaluation of all the
prototypes we developed.

3
RACEMOB : DETECT ING DATA RACES IN
PRODUCT ION
In this section, we present RaceMob, a new data race detector that
combines static and dynamic data race detection to obtain both good
accuracy and low runtime overhead. For a given program P, Race-
Mob ﬁrst uses a static detection phase with few false negatives to
ﬁnd potential data races; in a subsequent dynamic phase, RaceMob
crowdsources the validation of these alleged data races to user ma-
chines that are anyway running P. RaceMob provides developers
with a dynamically updated list of data races, split into “conﬁrmed
true races”, “likely false positives”, and “unknown”—developers can
use this list to prioritize their debugging attention. To minimize run-
time overhead experienced by users of P, RaceMob adjusts the com-
plexity of data race validation on-demand to balance accuracy and
cost. By crowdsourcing validation, RaceMob amortizes the cost of
validation and (unlike traditional testing) gains access to real user
executions. RaceMob also helps discovering user-visible failures like
crashes or hangs, and therefore helps developers reason about the
consequences of data races. To the best of our knowledge, RaceMob
is the ﬁrst data race detector that combines sufﬁciently low overhead
to be always-on with sufﬁciently good accuracy to improve developer
productivity.
3.1 design overview
RaceMob is a crowdsourced, two-phase static–dynamic data race
detector. It ﬁrst statically detects potential data races in a program,
then crowdsources the dynamic task of validating these potential data
races to users’ sites. This validation is done using an on-demand data
race detection algorithm. The beneﬁts of crowdsourcing are twofold: We deﬁne data race
detection coverage as
the ratio of true data
races found in a
program by a
detector to the total
number of true data
races in that
program.
ﬁrst, data race validation occurs in the context of real user executions;
second, crowdsourcing amortizes the per-user validation cost. Data
race validation conﬁrms true data races, thereby increasing the data
race detection coverage.
The usage model is presented in Fig. 5. First, developers set up
a “hive” service for their program P; this hive can run centralized
or distributed. The hive performs static data race detection on P
and ﬁnds potential data races (§3.2); these go onto P’s list of data
races maintained by the hive, and initially each one is marked as “Un-
known”. Then the hive generates an instrumented binary P ′, which
users download 1© and use instead of the original P. The instrumen-
33
34 racemob : detecting data races in production
Static Race
Detection 
Dynamic Race 
Validation
HIVE USER  SITE
Binary
Instrumentation
1
2
prog
ram 
distr
ibuti
on
new validation tasks
“Race”,
“NoRac
e”, “Tim
eout”
Dynamic Context
Inference
On-demand
Detection
4
Command
and
Control
List of
Races
True race
Likely FP
Unknown
r1
r3
r6
r2
r4
r7
r8
r5
r9
“Fe
asi
ble
”
6
3
Adv
anc
e
val
ida
tion
 tas
ks
5
Figure 5 – RaceMob’s crowdsourced architecture: A static detection phase,
run on the hive, is followed by a dynamic validation phase on
users’ machines.
tation in P ′ is commanded by the hive, to activate the validation of
speciﬁc data races in P 2©. Different users will typically be validat-
ing different, albeit potentially overlapping, sets of data races from
P (§3.3). The ﬁrst phase of validation, called dynamic context infer-
ence (§3.3.1), may decide that a particular racing interleaving for data
race r is feasible, at which point it informs the hive 3©. At this point,
the hive instructs all copies of P ′ that are validating r to advance
r to the second validation phase 4©. This second phase runs Race-
Mob’s on-demand detection algorithm (§3.3.2), whose result can be
one of Race, NoRace, or Timeout 5©. As results come in to the hive,
it updates the list of data races: if a “Race” result came in from the
ﬁeld for data race r, the hive promotes r from “Unknown” to “True
Race”; the other answers are used to decide whether to promote r
from “Unknown” to “Likely False Positive” or not (§3.5). For data
races with status “Unknown” or “Likely False Positive,” the hive re-
distributes “validation tasks” 6© among the available users (§3.4). We
now describe each step in further detail.
3.2 static data race detection
RaceMob can use any static data race detector, regardless of whether
it is complete or not. We chose Relay, a lockset-based data race detec-
tor [206]. Locksets describe the locks held by a program at any given
point in the program (as we explained in §2.3.2.2). Relay performs its
analysis bottom-up through the program’s control ﬂow graph while
computing function summaries that summarize which variables are
accessed and which locks are held in each function. Relay then com-
poses these summaries to perform data race detection: it ﬂags a data
race whenever it sees at least two accesses to memory locations that
are the same or may alias, and at least one of the accesses is a write,
and the accesses are not protected by at least one common lock.
Relay is complete (i.e., does not miss data races) if the program
does not have inline assembly and does not use pointer arithmetic.
3.3 dynamic data race validation 35
Relay may become incomplete if conﬁgured to perform ﬁle-based
alias analysis or aggressive ﬁltering, but we disable these options in
RaceMob. As suggested in [123], it might be possible to make Relay
complete by integrating program analysis techniques for assembly
code [13] and by handling pointer arithmetic [216].
Based on the data race reports from Relay, RaceMob instruments
the suspected-racing memory accesses as well as all synchronization
operations in the program. This instrumentation will later be com-
manded (in production) by RaceMob to perform on-demand data
race detection.
The hive activates parts of the instrumentation on-demand when
the program runs, in different ways for different users. The activa-
tion mechanism aims to validate as many data races as possible by
uniformly distributing the validation tasks across the user popula-
tion.
3.3 dynamic data race validation
The hive instructs the instrumented programs for which memory
accesses to perform data race validation. The validation task sent by
the hive to the instrumented program consists of a data race candi-
date to validate and one desired order (of two possible) of the racing
accesses. We call these possible orders the primary and the alternate.
The dynamic data race validation phase has three stages: dynamic
context inference (§3.3.1), on-demand data race detection (§3.3.2), and
schedule steering (§3.3.3). Instrumentation for each stage is present
in all the programs; however, stages 2 and 3 are toggled on/off sepa-
rately from stage 1, which is always on. Next, we explain each stage
in detail.
3.3.1 Dynamic Context Inference
Dynamic context inference (DCI) is a lightweight analysis that partly
compensates for the inaccuracies of the static data race detection
phase. RaceMob performs DCI to ﬁgure out whether the statically
detected data races can occur in a dynamic program execution con-
text.
DCI validates two assumptions made by the static data race detec-
tor about a race candidate. First, the static detector’s abstract analysis
hypothesizes aliasing as the basis for some of its race candidates, and
DCI looks for concrete instances that can validate this hypothesis. Sec-
ond, the static detector hypothesizes that racing accesses are made by
different threads, and DCI aims to validate this as well. Once these
two hypotheses are conﬁrmed, the user site communicates this to
the hive, and the hive promotes the race candidate to the next phase.
36 racemob : detecting data races in production
Without a conﬁrmation from DCI, the race remains in the “Unknown”
state.
The motivation for DCI comes from our observation that the major-
ity of the potential data races detected by static data race detection
(53% in our evaluation) are false positives due to only alias analysis
inaccuracies and the inability of static data race detection to infer
multithreaded program contexts.
For every potential data race r with racing instructions r1 and r2,
made by threads T1 and T2, respectively, DCI determines whether
the potentially racing memory accesses to addresses a1 and a2 made
by r1 and r2, respectively, may alias with each other (i.e., a1 = a2),
and whether these accesses are indeed made by different threads (i.e.,
T1 = T2). To do this, DCI keeps track of the address that each poten-
tially racing instruction accesses, along with the accessing thread’s ID
at runtime. Then, the instrumentation checks to see if at least one pair
of accesses is executed. If yes, the instrumented program notiﬁes the
hive, which promotes r to the next stages of validation (on-demand
data race detection and schedule steering) on all user machines where
r is being watched. If no access is executed by any instrumented in-
stance of the program, DCI continues watching r’s potentially racing
memory accesses until further notice.
DCI has negligible runtime overhead (0.01%) on top of the binary
instrumentation overhead (0.77%); therefore, it is feasible to have DCI
always-on. DCI’s memory footprint is small: it requires maintaining
12 bytes of information per potential racing instruction (8 bytes for
the address, 4 bytes for the thread ID). DCI is sound because, for
every access pair that it reports as being made from different threads
and to the same address, DCI has clear concrete evidence from an
actual execution. DCI is of course not guaranteed to be complete.
3.3.2 On-Demand Data Race Detection
In this section, we explain how on-demand data race detection
works; for clarity, we restrict the discussion to a single potential data
race.
On-demand data race detection starts tracking happens-before rela-
tionships once the ﬁrst potentially racing access is made, and it stops
tracking once a happens-before relationship is established between
the ﬁrst accessing thread and all the other threads in the program (in
which case a “NoRace” result is sent to the hive). Tracking also stops
if the second access is made before such a happens-before relation-
ship is found (in which case a “Race” result is sent to the hive).
Intuitively, RaceMob tracks a minimal number of accesses and syn-
chronization operations. RaceMob needs to track both racing accesses
to validate a potential data race. However, RaceMob does not need
3.3 dynamic data race validation 37
Time
barrier_wait(b)
...
...
barrier_wait(b)
...
...
barrier_wait(b)
...
...
1
Thread TThread T1 2 Thread T3
2
3
firstAccess
secondAccess
loop loop loop
Figure 6 – Minimal monitoring in DCI: For this example, DCI stops track-
ing synchronization operations as soon as each thread goes once
through the barrier.
to track any memory accesses other than the target racing accesses,
because any other access is irrelevant to this data race.
Sampling-based data race detection (e.g., Pacer [30]) adopts a sim-
ilar approach to on-demand data race detection by tracking synchro-
nization operations whenever sampling is enabled. The drawback of
Pacer’s approach is that it may start tracking synchronization opera-
tions too soon, even if the program is not about to execute a racing
access. RaceMob avoids this by turning on tracking synchronization
operations on-demand, when an access reported by the static data
race detection phase is executed.
RaceMob tracks synchronization operations—and thus, happens-
before relationships—using an efﬁcient, dynamic, vector-clock algo-
rithm similar to DJIT+ [164]. We maintain vector clocks for each
thread and synchronization operation, and the clocks are partially
ordered with respect to each other.
We illustrate in Fig. 6 how on-demand data race detection stops
tracking synchronization operations, using a simple example derived
from the fmm program [188]: ﬁrstAccess executes in the beginning of
the program in T1 1©, and the program goes through a few thou-
sand iterations of synchronization-intensive code 2©. Finally, T3 ex-
ecutes secondAccess 3©. It is sufﬁcient to keep track of the vector
clocks of all threads only up until the ﬁrst time they go through the
barrier_wait statement, as this establishes a happens-before relation-
ship between ﬁrstAccess in T1 and any subsequent access in T2 and T3.
Therefore, on-demand data race detection stops keeping track of the
vector clocks of threads T1, T2, and T3 after they each go through
barrier_wait once.
RaceMob distinguishes between a static racing instruction in the
program and its dynamic instance at runtime, and it can enable on-
demand data race detection for any dynamic instance of a potential
racing access. However, practice shows that going beyond the ﬁrst
dynamic instances adds little value (§6.1).
Our experimental evaluation shows that on-demand data race de-
tection reduces the overhead of dynamic race detection (§6).
38 racemob : detecting data races in production
3.3.3 Schedule Steering
The schedule steering phase further improves RaceMob’s data race
detection coverage by exploring both the primary and the alternate
executions of potentially racing accesses. This has the beneﬁt of de-
tecting data races that may be hidden by accidental happens-before
relationships (as discussed in §1.2.2 and Fig. 2).
Schedule steering tries to enforce the order of the racing memory
accesses provided by the hive, i.e., either the primary or the alter-
nate. Whenever the intended order is about to be violated (i.e., the
undesired order is about to occur), RaceMob pauses the thread that is
about to make the access, by using a wait operation with a timeout τ,
to enforce the desired order. Every time the hive receives a “Timeout”
from a user, it increments τ for that user (up to a maximum value), to
more aggressively steer it toward the desired order, as described in
the next section.
Prior work [110, 152, 185] used techniques similar to schedule steer-
ing to detect whether a known data race can cause a failure or not.
RaceMob, however, uses schedule steering to increase the likelihood
of encountering a suspected race and to improve data race detection
coverage.
Our evaluation shows that schedule steering helps RaceMob to ﬁnd
two data races (one in Memcached and the other one in Pfscan) that
would otherwise be missed. It also helps RaceMob uncover failures
(e.g., data corruption, hangs, crashes) that occur as a result of data
races, thereby enabling developers to reason about the consequences
of data races and ﬁx the important ones early, before they affect more
users. However, users who do not wish to help in determining the
consequences of data races can easily turn off schedule steering.
3.4 crowdsourcing the validation
Crowdsourcing is deﬁned as the practice of obtaining needed ser-
vices, ideas, or content by soliciting contributions from a large group
of people and especially from the online community. RaceMob gath-
ers validation results from a large base of users and merges them to
come up with verdicts for potential data races.
RaceMob’s main motivation for crowdsourcing data race detection
is to access real user executions. This enables RaceMob, for instance,
to detect the important but often overlooked class of input-dependent
data races [110], i.e., races that occur only when a program is run with
a particular input. RaceMob found two such races in Aget, and we de-
tail them in §6.1.2. Crowdsourcing also enables RaceMob to leverage
many executions to establish statistical conﬁdence in the detection
verdict. We also believe crowdsourcing is more applicable today than
ever: collectively, software users have overwhelmingly more hard-
3.4 crowdsourcing the validation 39
ware than any single software development organization, so lever-
aging end-users for data race detection is particularly advantageous.
Furthermore, such distribution helps reduce the per-user overhead to
barely noticeable levels.
Validation is distributed across a population of users, with each
user receiving only a small number of data races to validate. The
hive distributes validation tasks, which contain the locations in the
program of two memory accesses suspected to be racing, along with
a particular order of these accesses. Completing the validation task
consists of conﬁrming, in the end-user’s instance of the program,
whether the indicated ordering of the memory accesses is possible.
If a user site receives more than one data race to validate, it will ﬁrst
validate the data race whose racing instruction is ﬁrst reached during
execution.
There exists a wide variety of possible assignment policies that can
be implemented in RaceMob. By default, if there are more users than
races, RaceMob initially randomly assigns a single data race to each
user for validation. Assigned validation tasks that fail to complete
within a time bound are then distributed to additional users as well,
in order to increase the probability of completing them. Such multi-
ple assignment could be done from the very beginning, in order to
reach a verdict sooner. The number of users asked to validate a data
race r could be based, for example, on the expected severity of r, as
inferred based on heuristics or the static analysis phase. Conversely,
in the unlikely case that there are more data races to validate than
users, the default policy is to initially distribute a single validation
task to each user, thereby leaving a subset of the data races unas-
signed. As users complete their validation tasks, RaceMob assigns
new tasks from among the unassigned data races. Once a data race is
conﬁrmed as a true data race, it is removed from the list of data races
being validated, for all users.
During schedule steering, whenever a data race candidate is “stub-
born” and does not exercise the desired order despite the wait intro-
duced by the instrumentation, a “Timeout” is sent to the hive. The
hive then instructs the site to increase the timeout τ, to more aggres-
sively favor the alternate order; every subsequent timeout triggers a
new “Timeout” response to the hive and a further increase in τ. Once
τ reaches a conﬁgured upper bound, the hive instructs the site to
abandon the validation task. At this point, or even in parallel with
increasing τ, the hive could assign the same task to additional users.
There are two important steps in achieving low overhead at user
sites. First, the timeout τ must be kept low. For example, to preserve
the low latency of interactive applications, RaceMob uses an upper
bound τ  200 ms; for I/O bound applications, higher timeouts can
be conﬁgured. Second, the instrumentation at the user site disables
schedule steering for a given data race after a ﬁrst steering attempt
40 racemob : detecting data races in production
for a given race, regardless of whether it succeeded or not; this is
particularly useful when the racing accesses are in a tight loop. Steer-
ing is resumed when a new value of τ comes in from the hive. It
is certainly possible that a later dynamic instance of the potentially
racing instruction might be able to exercise the desired order, had
steering not been disabled; nevertheless, in our evaluation we found
that RaceMob achieves higher accuracy than state-of-the art data race
detectors using a single steering attempt.
RaceMob monitors each user’s validation workload and aims to
balance the global load across the user population. Rebalancing does
not require users to download a new version of the program, but
rather the hive simply toggles validation on/off at the relevant user
sites. In other words, each instance of the instrumented program
P ′ is capable of validating, on demand, any of the data races found
during the static phase—the hive instructs it which data race(s) is/are
of interest to that instance of P ′.
If an instance of P ′ spends more time doing data race validation
than the overall average, then the hive redistributes some of that in-
stance’s validation tasks to other instances. Additionally, RaceMob
reshufﬂes tasks whenever one program experiences more timeouts
than the average. In this way, we reduce the number of outliers, in
terms of runtime overhead, during the dynamic phase.
Crowdsourcing offers RaceMob the opportunity to tap into a large
number of executions, which makes it possible to only perform a
small amount of monitoring per user site without harming the preci-
sion of detection. This in turn reduces RaceMob’s runtime overhead,
making it more palatable to users and easier to adopt.
3.5 reaching a verdict
true race RaceMob decides a data race candidate is a true data
race whenever either the primary or the alternate orders are executed
at a user site with no intervening happens-before relationship be-
tween the corresponding memory accesses. Among the true data
races, some can be speciﬁcation-violating data races in the sense
of [110] (e.g., that cause crash or deadlock). In the case of a crash,
the RaceMob instrumentation catches the SIGSEGV signal and sub-
mits a crash report to the hive. In the case of an unhandled SIGINT
(e.g., the user pressed Ctrl-C), RaceMob prompts the user with a di-
alog asking whether the program has failed to meet expectations. If
yes, the hive is informed that the enforced schedule leads to a speciﬁ-
cation violation. Of course, users who ﬁnd this kind of “consequence
reporting” too intrusive can disable schedule steering altogether.
likely false positive RaceMob concludes that a potential data
race is likely a false positive if at least one user site reported a No-
3.6 implementation details 41
Unknown
True Race
Likely FP
Race
NoRace
Timeout
[δ<max]
Race
Timeout
[δ≥max]
Figure 7 – The state machine used by the hive to reach verdicts based on
reports from program instances. Transition edges are labeled
with validation results that arrive from instrumented program
instances; states are labeled with RaceMob’s verdict.
Race result to the hive (i.e., on-demand race detection discovered a
happens-before relationship between the accesses in the primary or
alternate). RaceMob cannot provide a deﬁnitive verdict on whether
the race is a false positive or not, because there might exist some
other execution in which the purported false positive proves to be a
real data race (e.g., due to an unobserved input dependency). The
“Likely False Positive” verdict, especially if augmented with the num-
ber of corresponding NoRace results received at the hive, can help
developers decide whether to prioritize for ﬁxing this particular data
race over others. RaceMob continues validation for “Likely False Pos-
itive” data races for as long as the developers wish.
unknown As long as the hive receives no results from the valida-
tion of a potential data race r, the hive keeps the status of the data
race “Unknown”. Similarly, if none of the program instances report
that they reached the maximum timeout value, r’s status remains
“Unknown”. However, if at least one instance reaches the maximum
timeout value for r, the corresponding report is promoted to “Likely
False Positive”.
The “True Race” verdict is deﬁnite: RaceMob has proof of the data
race occurring in a real execution of the program. The “Likely False
Positive” verdict is probabilistic: the more NoRace or Timeout reports
are received by the hive as a result of distinct executions, the higher
the probability that a data race report is indeed a false positive, even
though there is no precise probability value that RaceMob assigns to
this outcome.
3.6 implementation details
RaceMob can use any data race detector that outputs data race can-
didates; preferably it should be complete (i.e., not miss data races).
We use Relay, which analyzes programs that are turned into CIL, an
intermediate language for C programs [153]. The instrumentation en-
42 racemob : detecting data races in production
gine at the hive is based on LLVM [122]. We wrote a 500-LOC plugin
that converts Relay reports to the format required by our instrumen-
tation engine.
The instrumentation engine is an LLVM static analysis pass. It
avoids instrumenting empty loop bodies that have a data race on
a variable in the loop condition (e.g., of the form while(notDone){}).
These loops occur often in ad-hoc synchronization [220]. Not instru-
menting such loops avoids excessive overhead that results from run-
ning the instrumentation frequently. When such loops involve a data
race candidate, they are reported by the hive directly to developers.
We encountered this situation in two of the programs we evaluated,
and both cases were true data races (thus validating prior work that
advocates against ad-hoc synchronization [220]), so this optimization
did not effect RaceMob’s accuracy.
Whereas Fig. 7 indicates three possible results from user sites (Race,
NoRace, and Timeout), our prototype also implements a fourth one
(NotSeen), to indicate that a user site has not witnessed the race it
was expected to monitor. Technically, NotSeen can be inferred by the
hive from the absence of any other results. However, for efﬁciency
purposes, we have a hook at the exit of main, as well as in the sig-
nal handlers, that send a NotSeen message to the hive whenever the
program terminates without having made progress on the validation
task.
RaceMob uses compiler-based instrumentation, but other techniques
are also possible. For example, we plan to use dynamic binary rewrit-
ing, which would also allow us to dynamically remove instrumen-
tation for data races that are not enabled in a given instance of the
program. The instrumentation is for the most part inactive; the hive
activates part of the instrumentation on-demand, when the program
runs.
4
GIST: ROOT CAUSE D IAGNOS I S OF
IN -PRODUCT ION FA ILURES
In this chapter, we present failure sketching, a technique that auto-
matically produces a high-level execution trace called the failure sketch
that includes the statements that lead to a failure and the differences
between the properties of failing and successful program executions.
We show in our evaluation (§6) that the differences between failing
and successful executions which are displayed on the failure sketch,
point to root causes [127, 180, 231], of failures for the bugs we evalu-
ated (§6.2). As mentioned earlier in §2.1.3 In the context of our work,
we are talking about a statistical deﬁnition of a root cause: we con-
sider events that are primarily correlated with a failure as the root
causes of that failure.
Root cause diagnosis allows detecting bugs as well as providing
an explanation of how the failure associated with the bug occurred:
which branches were taken, which data values were computed, which
thread schedule led to the failure, etc. After all, understanding how Root cause diagnosis
of a failure subsumes
the detection of the
bug that caused the
failure.
a failure occurred requires detecting what the failure causing bug is.
Therefore, in the rest of this chapter, when we refer to root cause
diagnosis, we imply both detection of a bug and an explanation of
how a given bug led to a failure.
Failure sketching is a general technique for root cause diagnosis of
concurrency bugs as well as some sequential bugs due to failing input
values or rare paths that a program takes. Despite this generality, in
the context of this dissertation, and especially when evaluating failure
sketching, we focus on concurrency bugs.
Fig. 8 shows an example failure sketch for a bug in Pbzip2, a mul-
tithreaded compression tool [72]. Time ﬂows downward, and execu-
tion steps are enumerated along the ﬂow of time. The failure sketch
shows the statements (in this case statements from two threads) that
affect the failure and their order of execution with respect to the enu-
merated steps (i.e., the control ﬂow). The arrow between the two
statements in dotted rectangles indicates the difference between fail-
ing executions and successful ones. In particular, in failing execu-
tions, the statement f->mut from T1 is executed before the statement
mutex_unlock(f->mut) in T2. In non-failing executions, cons returns be-
fore main sets f->mut to NULL. The failure sketch also shows the value
of the variable f->mut (i.e., the data ﬂow) in a dotted rectangle in step
7, indicating that this value is 0 in step 7 only in failing runs. A devel-
oper can use the failure sketch to ﬁx the bug by introducing proper
synchronization that eliminates the offending thread schedule. This
43
44 gist : root cause diagnosis of in-production failures
main(){
  queue* f = init(size);
  create_thread(cons, f);
  ...
  free(f->mut);
  f->mut = NULL; 
  ...
}
Time Thread T1 Thread T2
1
2
3
4
5
6
7
cons(queue* f){
  ...
  mutex_unlock(f->mut);
}
1
2
3
4
5
6
7
Failure Sketch for Pbzip2 bug #1
Type: Concurrency bug, segmentation fault 
f->mut
0
1
2
3
4
5
6
7
Failure (segfault)
{
Figure 8 – The failure sketch of pbzip2 bug.
is exactly how pbzip2 developers ﬁxed this bug, albeit four months
after it was reported [72].
The insight behind the work presented in the chapter is that fail-
ure sketches can be built automatically, by using a combination of
static program analysis and cooperative dynamic analysis. The use
of a brand new hardware feature in Intel CPUs helps keep runtime
performance overhead low (3.74% in our evaluation).
We built a prototype, Gist, that automatically generates a failure
sketch for a given failure. Given a failure, Gist ﬁrst statically com-
putes a program slice that contains program statements that can po-
tentially affect the program statement where the failure manifests
itself. Then, Gist performs data and control ﬂow tracking in a co-
operative setup by targeting either multiple executions of the same
program in a data center or users who execute the same program.
Gist uses hardware watchpoints to track the values of data items in
the slice, and uses Intel Processor Trace [90] to track the control ﬂow.
Although Gist relies on Intel Processor Trace and hardware watch-
points for practical and low-overhead control and data ﬂow track-
ing, Gist’s primary novelty is in the combination of static program
analysis and dynamic runtime tracing to judiciously select how and
when to trace program execution in order to best extract the informa-
tion for failure sketches.
In the rest of this chapter, we describe the overview of Gist’s de-
sign (§4.1). We then explain each component of Gist’s design in detail:
we ﬁrst describe static slicing in Gist (§4.2), we then explain how Gist
performs slice reﬁnement (§4.3), and we ﬁnally describe how Gist
identiﬁes the root cause of a failure (§4.4).
4.1 design overview
Gist, our system for building failure sketches has three main com-
ponents: the static analyzer, the failure sketch computation engine,
and the runtime that tracks production runs. The static analyzer and
the failure sketch computation engine constitutes the server side of
4.1 design overview 45
Program P
(source) Static Analyzer
Gist-server
Runtime
Tracks
executions 
of P with 
watchpoints
and Intel PT
Backward slice 
Failure
Report
(coredump,
stack trace, 
etc...)
Runtime traces
Failure
Sketch
Engine
1
Refinement
Instrumentation
3
2
4
Gist-client
Failure
sketch
5
Figure 9 – The architecture of Gist
Gist, and they can be centralized or distributed, as needed. The run-
time constitutes the client-side of Gist, and it runs in a cooperative
setting such as in a data center or in multiple users’ machines, simi-
lar to RaceMob [111].
The usage model of Gist is shown in Fig. 9. Gist takes as input a
program (source code and binary) and a failure report 1© (e.g., stack
trace, the statement where the failure manifests itself, etc). Gist, being
a developer tool, has access to the source code. Using these inputs,
Gist computes a backward slice [213] by computing the set of pro-
gram statements that potentially affect the statement where the fail-
ure occurs. Gist uses an interprocedural, path-insensitive and ﬂow-
sensitive backward slicing algorithm. Then, Gist instructs its runtime,
running in a data center or at user endpoints, 2© to instrument the
programs and gather more traces (e.g., branches taken and values
computed at runtime). Gist then uses these traces to reﬁne the slice
3©: reﬁnement removes from the slice the statements that don’t get
executed during the executions that Gist monitors, and it adds to the
slice statements that were not identiﬁed as being part of the slice ini-
tially. Reﬁnement also determines the inter-thread execution order
of statements accessing shared variables and the values that the pro-
gram computes. Reﬁnement is done using hardware watchpoints for
data ﬂow and Intel Processor Trace (Intel PT) for control ﬂow. Gist’s
failure sketch engine gathers execution information from failing and
successful runs 4©. Then, Gist determines the differences between
failing and successful runs and builds a failure sketch 5© for the de-
veloper to use.
Gist operates in a cooperative setting [111, 127] where multiple in-
stances of the same software execute in a data center or in multiple
users’ machines. Gist’s server side (e.g., a master node) performs of-
ﬂine analysis and distributes instrumentation to its client side (e.g.,
a node in a data center). Gist incurs low overhead, so it can be kept
always-on and does not need to resort to sampling an execution [127],
thus avoiding missing information that can increase root cause diag-
nosis latency.
Gist operates iteratively: the instrumentation and reﬁnement con-
tinues as long as developers see ﬁt, continuously improving the ac-
curacy of failure sketches. Gist generates a failure sketch after a ﬁrst
46 gist : root cause diagnosis of in-production failures
failure using static slicing. Our evaluation shows that, in some cases,
this initial sketch is sufﬁcient for root cause diagnosis, whereas in
other cases reﬁnement is necessary (§4.3).
We now describe how each component of Gist works and explain
how they solve the challenges presented in §1.2. We ﬁrst describe
how Gist computes the static slice followed by slice reﬁnement via
adaptive tracking of control and data ﬂow information. Then, we
describe how Gist identiﬁes the root cause of a failure using multiple
failing and successful runs.
4.2 static slice computation
Gist uses an interprocedural, path-insensitive and ﬂow-sensitive
backward slicing algorithm to identify the program statements that
may affect the statement where the failure manifests itself at runtime.
We chose to make Gist’s slicing algorithm interprocedural because
failure sketches can span the boundaries of functions. The algorithm
is path-insensitive in order to avoid the cost of path-sensitive analy-
ses that do not scale well [5, 173]. However, this is not a shortcoming,
because Gist can recover precise path information at runtime using
low-cost control ﬂow tracking (§4.3.2). Finally, Gist’s slicing algo-
rithm is ﬂow-sensitive because it traverses statements in a speciﬁc or-
der (backwards) from the failure location. Flow-sensitivity generates
static slices with statements in the order they appear in the program
text (except some out-of-order statements due to path-insensitivity,
which are ﬁxed using runtime tracking), thereby helping the devel-
oper to understand the ﬂow of statements that lead to a failure.
Algorithm 1 describes Gist’s static backward slicing: it takes as
input a failure report (e.g., a coredump, a stack trace) and the pro-
gram’s source code, and it outputs a static backward slice. For clarity,
we deﬁne several terms we use in the algorithm. CFG refers to the
control ﬂow graph of the program (Gist computes a whole-program
CFG as we explain shortly). An item (line 7) is an arbitrary program
element. A source (line 8, 16) is an item that is either a global vari-
able, a function argument, a call, or a memory access. Items that
are not sources are compiler intrinsics, debug information, and inline
assembly. The deﬁnitions for a source and an item are speciﬁc to
LLVM [122], which is what we use for the prototype (§4.5). The func-
tion getItems (line 1) returns all the items in a given statement (e.g.,
the operands of an arithmetic operation). The function getRetValues
(line 11) performs an intraprocedural analysis to compute and return
the set of items that can be returned from a given function call. The
function getArgValues (line 14) computes and returns the set of argu-
ments that can be used when calling a given function. The function
getReadOperand (line 20) returns the item that is read, and the function
getWrittenOperand (line 23) returns the item that is written.
4.2 static slice computation 47
Input : Failure report report, program source code program
Output : Static backward slice slice
1 workSet ← getItems(failingStmt)
2 function init ()
3 failingStmt ← extractFailingStatement(report)
4 function computeBackwardSlice (failingStmt, program)
5 cfg ← extractCFG(program)
6 while !workSet.empty() do
7 item ← workSet.pop()
8 if isSource(item) then
9 slice.push(item)
10 if isCall(item) then
11 retValues ← getRetValues(item, cfg)
12 workSet ← workSet∪ retValues
13 else if isArgument(item) then
14 argValues ← getArgValues(item, cfg)
15 workSet ← workSet∪ argValues
16 function isSource (item)
17 if item is (global||argument||call||memory access) then
18 return true
19 else if item is read then
20 workSet ← workSet∪ item.getReadOperand()
21 return true
22 else if item is write then
23 workSet ← workSet∪ item.getWrittenOperand()
24 return true
25 return false
Algorithm 1 : Backward slice computation (Simpliﬁed)
Gist’s static slicing algorithm differs from classic static slicing [213]
in two key ways:
First, Gist addresses a challenge that arises for multithreaded pro-
grams because of the implicit control ﬂow edges that get created due
to thread creations and joins. For this, Gist uses a compiler pass to
build the thread interprocedural control ﬂow graph (TICFG) of the pro-
gram [219]. An interprocedural control ﬂow graph (ICFG) of a program
connects each function’s CFG with function call and return edges.
TICFG then augments ICFG to contain edges that represent thread
creation and join statements (e.g., a thread creation edge is akin to a
callsite with the thread start routine as the target function). TICFG
represents an overapproximation of all the possible dynamic control
ﬂow behaviors that a program can exhibit at runtime. TICFG is use-
ful for Gist to track control ﬂow that is implicitly created via thread
creation and join operations (§4.3.2).
Second, unlike other slicing algorithms [190], Gist does not use
static alias analysis. Alias analysis could determine an overapproxi-
mate set of program statements that may affect the computation of a
48 gist : root cause diagnosis of in-production failures
given value and augment the slice with this information. Gist does
not employ static alias analysis because, in practice, it can be over
50% inaccurate [111], which would increase the static slice size that
Gist would have to monitor at runtime, thereby increasing its per-
formance overhead. Gist compensates for the lack of alias analysis
with runtime data ﬂow tracking, which adds the statements that Gist
misses to the static slice (§4.3.3).
The static slice that Gist computes has some extraneous items that
do not pertain to the failure, because the slicing algorithm lacks ac-
tual execution information. Gist weeds out this information using
accurate control ﬂow tracking at runtime (§4.3.2).
4.3 slice refinement
Slice reﬁnement removes the extraneous statements from the slice
and adds to the slice the statements that could not be statically iden-
tiﬁed. Together with root cause identiﬁcation (§4.4), the goal of slice
reﬁnement is to build what we call the ideal failure sketch.
We deﬁne an ideal failure sketch to be one that: 1) contains only state-
ments that have data and/or control dependencies to the statement
where the failure occurs; 2) shows the failure predicting events that
have the highest positive correlation with the occurrence of failures.
Different developers may have different standards as to what is the
“necessary” information for root cause diagnosis; nevertheless, we be-
lieve that including all the statements that are related to a failure and
identifying the failure predicting events, constitute a reasonable and
practical set of requirements for root cause diagnosis. Failure pre-
dictors are identiﬁed by determining the difference of key properties
between failing and successful runs.
For example, failure sketches display the partial order of statements
involved in data races and atomicity violations. However, certain de-
velopers may want to know the total order of all the statements in
an ideal failure sketch. In our experience, focusing on the partial or-
der of statements that matter from the point of view of root cause
diagnosis is more useful than having a total order of all statements.
Moreover, obtaining the total order of all the statements in a failure
sketch would be difﬁcult without undue runtime performance over-
head using today’s technology.
We now describe Gist’s slice reﬁnement strategy in detail. We ﬁrst
describe adaptive tracking of a static slice to reduce the overhead
of reﬁnement (§4.3.1), then we describe how Gist tracks the control
ﬂow (§4.3.2) and the data ﬂow (§4.3.3) to 1) add to the slice state-
ments that get executed in production but are missing from the slice,
and 2) remove from the slice statements that don’t get executed in
production.
4.3 slice refinement 49
Static
slice
AST
(1st iteration)
}σ
1
= 2
Root 
cause
AST
(2nd iteration)
}σ2= 4
failure
(a) (b) (c)
AST
(3rd iteration)
}σ3= 8
(d)
Figure 10 – Adaptive slice tracking in Gist
4.3.1 Adaptive Slice Tracking
Gist employs Adaptive Slice-Tracking (AsT) to track increasingly
larger portions of the static slice, until it builds a failure sketch that
contains the root cause of the failure that it targets. Gist performs
AsT by dynamically tracking control and data ﬂow while the soft-
ware runs in production. AsT does not track all the control and data
elements in the static slice at once in order to avoid introducing per-
formance overhead.
It is challenging to pick the size of the slice, σ, to monitor at run-
time, because 1) a too large σ would cause Gist to do excessive run-
time tracking and increase overhead; 2) a too small σ may cause Gist
to track too many runs before identifying the root cause, and so in-
crease the latency of root cause diagnosis.
Based on previous observations that root causes of most bugs are
close to the failure locations [170, 212, 232], Gist initially enables run-
time tracking for a small number of statements (σ = 2 in our ex-
periments) backward from the failure point. We use this heuristic
because even a simple concurrency bug is likely to be caused by two
statements from different threads. This also allows Gist to avoid ex-
cessive runtime tracking if the root cause is close to the failure (i.e.,
the common case). Nonetheless, to reduce the latency of root cause
diagnosis, Gist employs a multiplicative increase strategy for further
tracking the slice in other production runs. More speciﬁcally, Gist
doubles σ for subsequent AsT iterations, until a developer decides
that the failure sketch contains the root cause and instructs Gist to
stop AsT.
Consider the static slice for a hypothetical program in Fig. 10.a,
which displays the failure point (bottom-most solid line) and the root
cause (dashed line). In the ﬁrst iteration (Fig. 10.b), AsT starts track-
ing σ1 = 2 statements back from the failure location. Gist cannot
build a failure sketch that contains the root cause of this failure by
tracking 2 statements, as the root cause lies further backwards in the
slice. Therefore, in the second and third iterations (Fig. 10.c-d), AsT
tracks σ2 =4 and σ3 =8 statements, respectively. Gist can build a
failure sketch by tracking 8 statements.
In summary, AsT is a heuristic to resolve the tension between per-
formance overhead, root cause diagnosis latency, and failure sketch
50 gist : root cause diagnosis of in-production failures
accuracy. We elaborate on this tension in our evaluation (§6.2). AsT
does not limit Gist’s ability to track larger slices and build failure
sketches for bugs with greater root-cause-to-failure distances, although
it may increase the latency of root cause diagnosis.
4.3.2 Tracking Control Flow
Gist tracks control ﬂow to increase the accuracy of failure sketches
by identifying which statements from the slice get executed during
the monitored production runs. Static slicing lacks real execution in-
formation such as dynamically computed call targets, therefore track-
ing the dynamic control ﬂow is necessary for high accuracy failure
sketches.
Static slicing and control ﬂow tracking jointly improve the accuracy
of failure sketches: control ﬂow traces identify statements that get
executed during production runs that Gist monitors, whereas static
slicing identiﬁes statements that have a control or data dependency
to the failure. The intersection of these statements represents the
statements that relate to the failure and that actually get executed
in production runs. Gist statically determines the locations where
control ﬂow tracking should start and stop at runtime in order to
identify which statements from the slice get executed.
Although control ﬂow can be tracked in a relatively straightforward
manner using software instrumentation [135], hardware facilities of-
fer an opportunity for a design with lower overhead. Our design em-
ploys Intel PT, a set of new hardware monitoring features for debug-
ging. In particular, Intel PT records the execution ﬂow of a program
and outputs a highly-compressed trace (~0.5 bits per retired assembly
instruction) that describes the outcome of all branches executed by a
program. Intel PT can be programmed to trace only user-level code
and can be restricted to certain address spaces. Additionally, with the
appropriate software support, Intel PT can be turned on and off by
writing to processor-speciﬁc registers. Intel PT is currently available
in Broadwell processors, and we control it using our custom kernel
driver (§4.5). Future families of Intel processors are also expected to
provide Intel PT functionality.
We explain how Gist tracks the statements that get executed via
control ﬂow tracking using the example shown in Fig. 11.a. The ex-
ample shows a static slice composed of three statements (stmt1, stmt2,
stmt3). The failure point is stmt3. Let us assume that, as part of AsT,
Gist tracks these three statements. At a high level, Gist identiﬁes all
entry points and exit points to each statement and starts and stops
control-ﬂow tracking at each entry point and at each exit point, re-
spectively. Tracing is started to capture control ﬂow if the statements
in the static slice get executed, and is stopped once those statements
4.3 slice refinement 51
complete execution. We use postdominator analysis to optimize out
unnecessary tracking.
In this example, Gist starts its analysis with stmt1. Gist converts
the branch decision information to statement execution information
using the technique shown in box I in Fig. 11.a. It ﬁrst determines bb1,
the basic block in which stmt1 resides, and then determines the pre-
decessor basic blocks p11...p1n of bb1. The predecessor basic blocks
of bb1 are blocks from which control can ﬂow to bb1 via branches.
As a result, Gist starts control ﬂow tracking in each predecessor basic
block p11...p1n (i.e., entry points). If Gist’s control ﬂow tracking de-
termines at runtime that any of the branches from these predecessor
blocks to bb1 was taken, Gist deduces that stmt1 was executed.
Gist uses an optimization when a statement it already processed
strictly dominates the next statement in the static slice. A statement d
strictly dominates a statement n (written d sdom n) if every path from
the entry node of the control ﬂow graph to n goes through d, and
d = n. In our example, stmt1 sdom stmt2, therefore, Gist will have
already started control ﬂow tracking for stmt1 when the execution
reaches stmt2, and so it won’t need special handling to start control
ﬂow tracking for stmt2.
However, if a statement that Gist processed does not strictly dom-
inate the next statement in the slice, Gist stops control ﬂow tracking.
In our example, after executing stmt2, since the execution may never
reach stmt3, Gist stops control ﬂow tracking after stmt2 gets executed.
Otherwise tracking could continue indeﬁnitely and impose unneces-
sary overhead. Intuitively, Gist stops control ﬂow tracking right after
stmt2 gets executed as shown in box II of Fig. 11.a. More precisely,
Gist stops control ﬂow tracking after stmt2 and before stmt2’s imme-
diate postdominator. A node p is said to strictly postdominate a node
n if all the paths from n to the exit node of the control ﬂow graph pass
through p, and n = p. The immediate postdominator of a node n (ip-
dom(n)) is a unique node that strictly postdominates n and does not
strictly postdominate any other strict postdominators of n.
Finally, as shown in box III in Fig. 11.a, Gist processes stmt3 using
the combination of techniques it used for stmt1 and stmt2. Because
control ﬂow tracking was stopped after stmt2, Gist ﬁrst restarts it
at each predecessor basic block p31...p3n of the basic block bb3 that
contains stmt3, then Gist stops it after the execution of stmt3.
4.3.3 Tracking Data Flow
Similar to control ﬂow, data ﬂow can also be tracked in software,
however this can be prohibitively expensive [204]. Existing hardware
support can be used for low overhead data ﬂow tracking. In this
section, we describe why and how Gist tracks data ﬂow.
52 gist : root cause diagnosis of in-production failures
(a)
stmt
1
p
11
p
1n
..........
bb
1
stmt
2
stmt
2
Static Slice
stmt
3
I
II
III
ipdom
2
startstart
stop
sdom
stmt
3
ipdom
3
stop
stmt
1
(b)
p
31
p
3n
..........
bb
3
startstart I
stmt
2
stmt
3
Static Slice
read(x)
stmt
2
idom
2
insertWatchpoint(&x)
Figure 11 – Example of control (a) and data (b) ﬂow tracking in Gist. Solid
horizontal lines are program statements, circles are basic blocks.
Determining the data ﬂow in a program increases the accuracy of
failure sketches in two ways:
First, Gist tracks the total order of memory accesses that it mon-
itors to increase the accuracy of the control ﬂow shown in the fail-
ure sketch. Tracking the total order is important mainly for shared
memory accesses from different threads, for which Intel PT does not
provide order information. Gist uses this order information in failure
sketches to help developers reason about concurrency bugs.
Second, while tracking data ﬂow, Gist discovers statements that
access the data items in the monitored portion of the slice that were
missing from that portion of the slice. Such statements exist because
Gist’s static slicing does not use alias analysis (due to alias analysis’
inaccuracy) for determining all statements that can access a given
data item.
Gist uses hardware watchpoints present in modern processors to
track the data ﬂow (e.g., x86 has 4 hardware watchpoints [88]). They
enable tracking the values written to and read from memory locations
with low runtime overhead.
For a given memory access, Gist inserts a hardware watchpoint for
the address of the accessed variable at a point right before the access
instruction. More precisely, the inserted hardware watchpoint must
be located before the access and after the immediate dominator of
that access. Fig. 11.b shows an example, where Gist places a hardware
watchpoint for the address of variable x, just before stmt2 (read(x)).
Gist employs several optimizations to economically use its budget
of limited hardware watchpoints when tracking the data ﬂow. First,
Gist only tracks accesses to shared variables: it does not place a hard-
ware watchpoint for the variables allocated on the stack. Gist main-
tains a set of active hardware watchpoints to make sure to not place a
second hardware watchpoint at an address that it is already watching.
If the statements in the slice portion that AsT monitors access more
memory locations than the available hardware watchpoints on a user
machine, Gist uses a cooperative approach to track the memory loca-
tions across multiple production runs. In a nutshell, Gist’s collabora-
4.4 identifying the root cause 53
write x
R
W
R
read x
read x
Thread T
1
Thread T
2
write x
W
W
R
write x
read x
Thread T
1
Thread T
2
write x
R
W
W
read x
write x
Thread T
1
Thread T
2
read x
W
R
W
write x
write x
Thread T
1
Thread T
2
Figure 12 – Four common atomicity violation patterns (RWR, WWR, RWW,
WRW). Adapted from [8].
tive approach instructs different production runs to monitor different
sets of memory locations in order to monitor all the memory locations
that are in the slice portion that Gist monitors. However, in practice,
we did not encounter this situation (§6.2).
4.4 identifying the root cause
In this section, we describe how Gist determines the differences of
key execution properties (i.e., control and data ﬂow) between failing
and successful executions in order to do root cause diagnosis and
statistically detect concurrency bugs (e.g., atomicity violations).
For root cause diagnosis, Gist follows a similar approach to cooper-
ative bug isolation [8, 98, 127], which uses statistical methods to cor-
relate failure predictors to failures in programs. A failure predictor
is a predicate that, when true, predicts that a failure will occur [126].
Carefully crafted failure predictors point to failure root causes [127].
Gist-generated failure sketches contain a set of failure predictors
that are both informative and good indicators of failures. A failure
predictor is informative if it contains enough information regarding
the failure (e.g., thread schedules, critical data values). A failure pre-
dictor is a good indicator of a failure if it has high positive correlation
primarily with the occurrence of the failure.
Gist deﬁnes failure predictors for both sequential and multithreaded
programs. For sequential programs, Gist uses branches taken and
data values computed as failure predictors. For multithreaded pro-
grams, Gist uses the same predicates it uses for sequential programs,
as well as special combinations of memory accesses that portend con-
currency failures. In particular, Gist considers the common single-
variable atomicity violation patterns shown in Fig. 12 (i.e., RWR (Read,
Write, Read), WWR, RWW, WRW) and data race patterns (WW, WR,
RW) as concurrency failure predictors.
For both failing and successful runs, Gist logs the order of accesses
and the value updates to shared variables that are part of the slice it
tracks at runtime. Then, using an ofﬂine analysis, Gist searches the
aforementioned failure-predicting memory access patterns in these
access logs. Gist associates each match with either a successful run or
a failing run. Gist is not a bug detection tool, but it can understand
common failures, such as crashes, assertion violations, and hangs.
54 gist : root cause diagnosis of in-production failures
write x
R
W
R
R
read x
read x (1)
Thread T1 Thread T2
read x (2)
write x
R
W
R
read x
read x (2)
Thread T1 Thread T2
write xW
R read x (1)
Thread T1 Thread T2
(b)
(a)
(c)
write xW
R read x (2)
Thread T1 Thread T2
(d)
Failure
Figure 13 – A sample execution failing at the second read in T1 (a), and
three potential concurrency errors: a RWR atomicity violation
(b), 2 WR data races (c-d).
Other types of failures can either be manually given to Gist, or Gist
can be used in conjunction with a bug ﬁnding tool.
Once Gist has gathered failure predictors from failing and success-
ful runs, it uses a statistical analysis to determine the correlation of
these predictors with the failures. Gist computes the precision P (how
many runs fail among those that are predicted to fail by the predic-
tor?), and the recall R (how many runs are predicted to fail by the
predictor among those that fail?). Gist then ranks all the events by
their F-measure, which is the weighted harmonic mean of their pre-
cision and recall Fβ = (1+ β2) P.Rβ2.P+R to determine the best failure
predictor. Gist favors precision by setting β to 0.5 (a common strat-
egy in information retrieval [176]), because its primary aim is to not
confuse the developers with potentially erroneous failure predictors
(i.e., false positives).
The failure sketch presents the developer with the highest-ranked
failure predictors for each type of failure predictor (i.e., branches,
data values, and statement orders). An example of this is shown
in Fig. 8, where the dotted rectangles show the highest-ranked fail-
ure predictor. Gist’s root cause detection process also enables it to
statistically detect concurrency bugs such as atomicity violations.
As an example, consider the execution trace shown in Fig. 13.a.
Thread T1 reads x, after which thread T2 gets scheduled and writes
to x. Then T1 gets scheduled back and reads x twice in a row, and
the program fails (e.g., the second read could be made as part of an
assertion that causes the failure). This execution trace has three mem-
ory access patterns that can potentially be involved in a concurrency
bug: a RWR atomicity violation in Fig. 13.b and two data races (or
order violations) in Fig. 13.c and 13.d. For this execution, Gist logs
these patterns and their outcome (i.e., failure and success: 13.b and
13.d fail, whereas the pattern in 13.c succeeds. Gist keeps track of the
outcome of future access patterns and computes their F-measure to
identify the highest ranked failure predictors.
There are two key differences between Gist and cooperative bug iso-
lation (CBI). First, Gist tracks all data values that are part of the slice
that it monitors at runtime, allowing it to diagnose the root cause of
4.5 implementation details 55
failures caused by a certain input, as opposed to CBI, which tracks
ranges of some variables. Second, Gist uses different failure predic-
tors than CCI [98] and PBI [8], which allow developers to distinguish
between different kinds of concurrency bugs, whereas PBI and CCI
use the same predictors for failures with different root causes (e.g.,
invalid MESI [159] state for all of RWR, WWR, RWW atomicity viola-
tions).
4.5 implementation details
Gist’s static slicing algorithm is built on the LLVM framework [122].
As part of this algorithm, Gist ﬁrst augments the intraprocedural con-
trol ﬂow graphs of each function with function call and return edges
to build the interprocedural control ﬂow graph (ICFG) of the pro-
gram. Then, Gist processes thread creation and join functions (e.g.,
pthread_create, pthread_join) to determine which start routines the
thread creation functions may call at runtime and where those rou-
tines will join back to their callers, using data structure analysis [121].
Gist augments the edges in the ICFGs of the programs using this
information about thread creation/join in order to build the thread
interprocedural control ﬂow graph (TICFG) of the program. Gist uses
the LLVM information ﬂow tracker [99] as a starting point for its slic-
ing algorithm.
Gist currently inserts a small amount of instrumentation into the
programs it runs, mainly to start/stop Intel PT tracking and place
a hardware watchpoint. To distribute the instrumentation, Gist uses
bsdiff [32] to create a binary patch ﬁle that it ships off to user end-
points or to a data center. We plan to investigate more advanced live
update systems such as POLUS [38] or Courgette [76]. Another alter-
native is to use binary rewriting frameworks such as DynamoRio [31]
or Paradyn [145].
Trace collection is implemented via a Linux kernel module which
we refer to as the Intel PT kernel driver [115, 116]. The kernel driver
conﬁgures and controls the hardware using the documented MSR
(Machine Speciﬁc Register) interface. The driver allows ﬁltering of
what code is traced using the privilege level (i.e. kernel vs. user-
space) and CR3 values, thus allowing tracing of individual processes.
The driver uses a memory buffer sized at 2 MB, which is sufﬁcient to
hold traces for all the applications we have tested. The driver relies on
the Intel PT trace decoding library [47] to decode control ﬂow traces.
Finally, Gist-instrumented programs use an ioctl interface that our
driver provides to turn tracing on/off.
Gist’s hardware watchpoint use is based on the ptrace system call.
Once Gist sets the desired hardware watchpoints, it detaches from
the program (using the PTRACE_DETACH), thereby not incurring
any performance overhead. Gist’s instrumentation handles hardware
56 gist : root cause diagnosis of in-production failures
watchpoint triggers atomically in order to maintain a total order of
accesses among memory operations. Gist logs the program counter
when a hardware watchpoint is hit, which it later translates into
source line information at developer site. Gist does not need debug
information to do this mapping: it uses the program counter and the
offset at which the program binary is loaded to compute where in the
actual program this address corresponds to.
5
PORTEND : CLASS I FY ING DATA RACES DUR ING
TEST ING
Ideally, programs would have no data races at all. In this way,
programs would avoid possible catastrophic effects due to data races.
This either requires programs to be data race-free by design, or it
requires ﬁnding and ﬁxing all data races in a program. However,
modern software still has data races either because it was written
carelessly, or the complexity of the software made it very difﬁcult to
properly synchronize threads, or the beneﬁts of ﬁxing all data races
using expensive synchronization did not justify its costs.
From a programming languages standpoint, attempting to classify
data races is only meaningful for some languages. One such language
is the Java programming language. The Java memory model [138] de-
ﬁnes semantics for programs with data races, because Java must sup-
port the execution of untrusted sandboxed code, and such code could
contain data races. Therefore, attempting to classify data races in Java
is a meaningful endeavor from a programming languages point of
view. Similar arguments apply to assembly languages, where data
races are permitted.
On the other hand, recent C [93] and C++ [92] standards do not
provide meaningful semantics for programs involving data races. In Compilers do not
always employ
transformations that
will break code with
data races [26].
other words, data races in those languages constitute undeﬁned be-
havior. As a consequence, C and C++ compilers are allowed to per-
form optimizations on code with data races that may transform seem-
ingly benign data races into harmful ones [2, 26, 27].
From a practical standpoint, developers may choose to prioritize
the ﬁxing of data races (and they do so) regardless of the implications
of language standards. This happens primarily because modern mul-
tithreaded software tends to have a large number of data races, and it
may be impractical to try to ﬁx all the data races in a given program
at once. For example, Google’s Thread Sanitizer [187] reports over
1,000 unique data races in Firefox (written in C++) when the browser
starts up and loads http://bbc.co.uk.
Another reason why developers sometimes choose to not ﬁx all
data races is because synchronizing all racing memory accesses would
introduce performance overheads that may be considered unaccept-
able. For example, developers have not ﬁxed a data race that can lead
to lost updates in memcached for a year—ultimately ﬁnding an alter-
nate solution—because it leads to a 7% drop in throughput [143]. Per-
formance implications led to 23 data races in Internet Explorer and
Windows Vista being purposely left unﬁxed [152]. Similarly, several
57
58 portend : classifying data races during testing
data races have been left unﬁxed in the Windows 7 kernel, because
ﬁxing those races did not justify the associated costs [100].
Another reason why data races go unﬁxed is that 76%–90% of data
races are actually considered to be harmless [58, 100, 152, 206]—
harmless races are assumed to not affect program correctness, either
fortuitously or by design, while harmful races lead to crashes, hangs,
resource leaks, even memory corruption or silent data loss. Decid-
ing whether a race is harmful or not involves a lot of human labor
(with industrial practitioners reporting that it can take days, even
weeks [75]), so time-pressed developers may not even attempt this
high-investment/low-return activity.
In order to construct programs that are free of data races by design,
novel languages and language extensions that provide a deterministic
programming model have been proposed [199, 24]. Deterministic pro-
grams are data race-free, and therefore, their behavior is not timing
dependent. Even though these models may be an appropriate solu-
tion for the long term, the majority of modern concurrent software is
written in mainstream languages such as C, C++, and Java, and these
languages don’t provide any data race-freedom guarantees.
In order to eliminate all data races in current mainstream software,
developers need to ﬁrst ﬁnd them. This can be achieved using data
race detectors such as RaceMob that we discussed in the previous sec-
tion. However, given the large number of data race reports in mod-
ern software, we argue that data race detectors should also triage
reported data races based on the consequences they could have in
future executions. This way, developers are better informed and can
ﬁx the critical bugs ﬁrst. A data race detector should be capable of in-
ferring the possible consequences of a reported data race: is it a false
positive, a harmful data race, or a data race that has no observable
harmful effects and left in the code perhaps for performance reasons?
Alas, automated classiﬁers [95, 100, 152, 200] are often inaccurate
(e.g., [152] reports a 74% false positive rate in classifying harmful
races). To our knowledge, no data race detector/classiﬁer can do this
without false positives.
In this chapter, we describe Portend, a technique and tool that,
given a data race (e.g., detected using RaceMob), analyzes the code,
infers each data race’s potential consequences and automatically clas-
siﬁes them into four categories: “speciﬁcation violated”, “output dif-
fers”, “k-witness harmless” and “single ordering”. In Portend, harm-
lessness is circumstantial rather than absolute; it implies that Portend
did not witness a harmful effect of the data race in question for k
different executions. For the ﬁrst two categories, Portend produces a
replayable trace that demonstrates the effect, making it easy on the
developer to ﬁx the race.
5.1 a fine-grained way to classify data races 59
Portend, has support for classifying data races under different mem-
ory models (e.g., a weak memory model [54]) using a technique called
symbolic memory consistency modeling (SMCM).
Portend works in-house, because its analysis is computationally
intensive, and therefore not suited to in-production use.
Portend operates on binaries, not on source code (more speciﬁcally
on LLVM [122] bitcode obtained from a compiler or from a machine-
code-to-LLVM translator like RevGen [40]). Therefore, it can effec-
tively classify both source-code-level data races and assembly-level
data races that are not forbidden by any language-speciﬁc memory
model (e.g., C [93] and C++ [92]).
In the rest of this chapter, we ﬁrst introduce our classiﬁcation scheme
(§5.1); give an overview of Portend’s design (§5.2); describe how Por-
tend performs single path analysis (§5.3), multipath analysis (§5.4),
symbolic output comparison (§5.5), multi-schedule analysis (§5.6); de-
scribe symbolic memory consistency modeling (§5.7); Portend’s clas-
siﬁcation verdicts (§5.8); its debugging aid output (§5.9); and its im-
plementation details (§5.10).
5.1 a fine-grained way to classify data races
A simple harmless vs. harmful classiﬁcation scheme is undecidable
in general (as will be explained below), so prior work typically resorts
to “likely harmless” and/or “likely harmful.” Alas, in practice, this is
less helpful than it seems (§6.3). We therefore propose a new scheme
that is more precise.
Note that there is a distinction between false positives and harmless
data races: when a purported data race is not a true data race, we say
it is a false positive. When a true data race’s consequences are deemed
to be harmless for all the witnessed executions, we refer to it as a
harmless data race 1.
A false positive is harmless in an absolute sense (since it is not a
data race to begin with), but not the other way around, harmless data
races are still true data races. Static [58] and lockset [182] data race
detectors typically report false positives.
If a data race does not have any observable effect (crash, hang, data
corruption) for all the witnessed executions, we say that the data race
is harmless with respect to those executions. Harmless data races
are still true data races. Note that our deﬁnition of harmlessness is
circumstantial; it is not absolute. It is entirely possible that a harm-
less data race for some executions can become harmful in another
execution.
1. In the rest of the text, whenever we mention harmless data races, we refer to
this deﬁnition. If we refer to another deﬁnition adopted by prior work, we make the
distinction clear.
60 portend : classifying data races during testing
Our proposed scheme classiﬁes the true data races into four cate-
gories: “spec violated”, “output differs”, “k-witness harmless”, and
“single ordering”. We illustrate this taxonomy in Fig. 14.
true positives false positives
harmful harmless
specViol outDiff k-witness singleOrd
Figure 14 – Portend taxonomy of data races.
“Spec violated” corresponds to data races for which at least one
ordering of the racing accesses leads to a violation of the program’s
speciﬁcation. These are, by deﬁnition, harmful. For example, data
races that lead to crashes or deadlocks are generally accepted to vi-
olate the speciﬁcation of any program; we refer to these as “basic”
speciﬁcation violations. Higher level program semantics could also
be violated, such as the number of objects in a heap exceeding some
bound, or a checksum being inconsistent with the checksummed data.
Such semantic properties must be provided as explicit predicates to
Portend, or be embedded as assert statements in the code.
“Output differs” is the set of data races for which the two order-
ings of the racing accesses can cause the program to generate dif-
ferent outputs, thus making the output depend on scheduling that
is beyond the application’s control. Such data races are often con-
sidered harmful: one of those outputs is likely “the incorrect” one.
However, “output differs” data races can also be considered as harm-
less, whether intentional or not. For example, a debug statement that
prints the ordering of the racing memory accesses is intentionally
order-dependent, thus an intentional and harmless data race. An ex-
ample of an unintentional harmless data race is one in which one or-
dering of the accesses may result in a duplicated syslog entry—while
technically a violation of any reasonable logging speciﬁcation, a de-
veloper may decide that such a benign consequence makes the data
race not worth ﬁxing, especially if they face the risk of introducing
new bugs or degrading performance when ﬁxing the bug.
As with all high level program semantics, automated tools cannot
decide on their own whether an output difference violates some non-
explicit speciﬁcation or not. Moreover, whether the speciﬁcation has
been violated or not might even be subjective, depending on which
developer is asked. It is for this reason that we created the “output
differs” class of data races: we provide developers a clear character-
ization of the output difference and let them decide using the pro-
vided evidence whether that difference matters.
“K-witness harmless” are data races for which the harmless classi-
ﬁcation is performed with some quantitative level of conﬁdence: the
higher the k, the higher the conﬁdence. Such data races are guaran-
5.2 design overview 61
teed to be harmless for at least k combinations of paths and sched-
ules; this guarantee can be as strong as covering a virtually inﬁnite
input space (e.g., a developer may be interested in whether the data
race is harmless for all positive inputs, not caring about what hap-
pens for zero or negative inputs). Portend achieves this using a sym-
bolic execution engine [33, 35] to analyze entire equivalence classes
of inputs. Depending on the time and resources available, developers
can choose k according to their needs—in our experiments we found
k = 5 to be sufﬁcient to achieve 99% accuracy (manually veriﬁed) for
all the tested programs. The value of this category will become obvi-
ous in this chapter. We also evaluate the individual contributions of
exploring paths versus schedules in §6.3.
“Single ordering” are data races for which only a single ordering
of the accesses is possible, typically enforced via ad hoc synchro-
nization [220]. In such cases, although no explicit synchronization
primitives are used, the shared memory could be protected using
busy-wait loops that synchronize on a ﬂag. Considering these to be
non-data races is inconsistent with our deﬁnition (§2.1.1) because the
ordering of the accesses is not enforced using non-ad hoc synchro-
nization primitives, even though it may not actually be possible to
exercise both interleavings of the memory accesses (hence the name
of the category). Such ad hoc synchronization, even if bad practice,
is frequent in real-world software [220]. Previous data race detectors
generally cannot tell that only a single order is possible for the mem-
ory accesses, and thus report this as an ordinary data race. Such data
races can turn out to be both harmful [220] or they can be a major
source of harmless data races [95, 200]. That is why we have a dedi-
cated class for such data races.
5.2 design overview
Portend feeds the target program through its own data race de-
tector or through RaceMob (or even a third party one, if preferred),
analyzes the program and the report automatically, and determines
the potential consequences of the reported data race. The report is
then classiﬁed, based on these predicted consequences, into one of
the four categories in Fig. 14. To achieve the classiﬁcation, Portend
performs targeted analysis of multiple schedules of interest, while
at the same time using symbolic execution [35, 113, 114] to simul-
taneously explore multiple paths through the program; we call this
technique multi-path multi-schedule data race analysis. Portend can thus
reason about the consequences of the two orderings of racing mem-
ory accesses in a richer execution context than prior work. When
comparing program states or program outputs, Portend employs sym-
bolic output comparison, meaning it compares constraints on program
output as well as path constraints when these outputs are made, in
62 portend : classifying data races during testing
addition to comparing the concrete values of the output, in order to
generalize the comparison to more possible inputs that would bring
the program to the speciﬁc data race and to determine if the data
race affects the constraints on program output. Unlike prior work,
Portend can accurately classify even data races that, given a ﬁxed
ordering of the original racing accesses, are harmless along some ex-
ecution paths, yet harmful along others. In §5.2 we go over one such
data race (Fig. 17) and explain how Portend handles it.
Fig. 15 illustrates Portend’s architecture. Portend is based on Cloud9
[33], a parallel symbolic execution engine that supports running multi-
threaded C/C++ programs. Cloud9 is in turn based on KLEE [35],
which is a single-threaded symbolic execution engine. Cloud9 has a
number of built-in checkers for memory errors, overﬂows and division-
by-0 errors; on top of which, Portend adds an additional deadlock
detector. Portend has a built-in data race detector that implements
a dynamic happens-before algorithm [119]. This detector relies on a
component that tracks Lamport clocks [119] at runtime (details are
in §5.7). Portend’s analysis and classiﬁcation engine performs multi-
path multi-schedule data race analysis and symbolic output compar-
ison. This engine also works together with the Lamport clock tracker
and the symbolic memory consistency modeling (SMCM) plugin to
perform classiﬁcation. The SMCM plugin deﬁnes the rules according
to which a memory read operation from a shared memory location
can return previously written values to that location. The SMCM
plugin is crucial for classifying data races under different memory
consistency models.
Div-by-0KLEE Overﬂow
Memory Error Deadlock
Record & Replay Engine
POSIX Threads Model
Analysis & Classiﬁcaon
Engine
DetectorDetector
Detector DetectorM
u
lt
i-
th
re
a
d
e
d
S
y
m
b
o
li
c 
E
xe
cu
ti
o
n
 
E
n
g
in
e
 (
C
lo
u
d
9
)
Race 
Report
(optional)
Program
Portend
specViol
outDiﬀ
k-witness
singleOrd
Dynamic Data Race Detector 
Lamport Clock 
Tracker 
Symbolic Memory
Model Plugin 
Figure 15 – High-level architecture of Portend. The six shaded boxes indi-
cate new code written for Portend, whereas clear boxes repre-
sent reused code from KLEE [35] and Cloud9 [33].
When Portend determines that a data race is of “spec violated” va-
riety, it provides the corresponding evidence in the form of program
inputs (including system call return values) and thread schedule that
reproduce the harmful consequences deterministically. Developers
can replay this “evidence” in a debugger to ﬁx the data race.
5.2 design overview 63
We now give an overview of our approach and illustrate it with an
example, describe the ﬁrst step, single-path/single-schedule analysis
(§5.3), followed by the second step, multi-path analysis (§5.4) and
symbolic output comparison (§5.5) augmented with multi-schedule
analysis (§5.6). We introduce SMCM and describe how it can be used
to model various memory models while performing data race classiﬁ-
cation (§5.7). We describe Portend’s data race classiﬁcation (§5.8) and
the generated report that helps developers debug the data race (§5.9).
Portend’s data race analysis starts by executing the target program
and dynamically detecting data races (e.g., developers could run their
existing test suites under Portend). Portend detects data races using a
dynamic happens-before algorithm [119] or using RaceMob. Alterna-
tively, if another detector is used, Portend can start from an existing
execution trace; this trace must contain the thread schedule and an
indication of where in the trace the suspected data race occurred. We
developed a plugin for Thread Sanitizer [187] to create a Portend-
compatible trace; we believe such plugins can be easily developed for
other dynamic data race detectors [82].
Portend has a record/replay infrastructure for orchestrating the
execution of a multi-threaded program; it can preempt and sched-
ule threads before/after synchronization operations and/or racing
accesses. Portend uses Cloud9 to enumerate program paths and to
collect symbolic constraints.
A trace consists of a schedule trace and a log of system call inputs.
The schedule trace contains the thread id and the program counter
at each preemption point. Portend treats all POSIX threads synchro-
nization primitives as possible preemption points and uses a single-
processor cooperative thread scheduler. Portend can also preempt
threads before and after any racing memory access. We use the fol-
lowing notation for the trace: (T1 : pc0) → (T2 → RaceyAccessT2 : pc1)
→ (T3 → RaceyAccessT3 : pc2) means that thread T1 is preempted af-
ter it performs a synchronization call at program counter pc0; then
thread T2 is scheduled and performs a memory access at program
counter pc1, after which thread T3 is scheduled and performs a mem-
ory access at pc2 that is racing with the previous memory access of
T1. The schedule trace also contains the absolute count of instruc-
tions executed by the program up to each preemption point. This is
needed in order to perform precise replay when an instruction exe-
cutes multiple times (e.g., a loop) before being involved in a data race;
this is not shown as part of the schedule trace, for brevity. The log
of system call inputs contains the non-deterministic program inputs
(e.g., gettimeofday).
In a ﬁrst analysis step (illustrated in Fig. 16.a), Portend replays the
schedule in the trace up to the point where the data race occurs. Then
it explores two different executions: one in which the original sched-
ule is followed (the primary) and one in which the alternate ordering
64 portend : classifying data races during testing
race
pr
im
ary
alternate
(a) (b) (c)
race race
Figure 16 – Increasing levels of completeness in terms of paths and sched-
ules: [a. single-pre/single-post]  [b. single-pre/multi-post] 
[c. multi-pre/multi-post].
of the racing accesses is enforced (the alternate). As described in §2.6,
some classiﬁers compare the primary and alternate program state im-
mediately after the data race, and, if different, ﬂag the data race as
potentially harmful, and, if same, ﬂag the data race as potentially
harmless. Even if program outputs are compared rather than states,
“single-pre/single-post” analysis (Fig. 16.a) may not be accurate, as
we will show below. Portend uses “single-pre/single-post” analysis
mainly to determine whether the alternate schedule is possible at all.
In other words, this stage identiﬁes any ad hoc synchronization that
might prevent the alternate schedule from occurring.
If there is a difference between the primary and alternate post-data
race states, we do not consider the data race as necessarily harmful.
Instead, we allow the primary and alternate executions to run inde-
pendently of each other, and we observe the consequences. If, for
instance, the alternate execution crashes, the data race is harmful. Of
course, even if the primary and alternate executions behave identi-
cally, it is still not certain that the data race is harmless: there may be
some unexplored pair of primary and alternate paths with the same
pre-data race preﬁx as the analyzed pair, but which does not behave
the same. This is why single-pre/single-post analysis is insufﬁcient,
and we need to explore multiple post-data race paths. This motivates
“single-pre/multi-post” analysis (Fig. 16.b), in which multiple post-
data race execution possibilities are explored—if any primary/alter-
nate mismatch is found, the developer must be notiﬁed.
Even if all feasible post-data race paths are explored exhaustively
and no mismatch is found, one still cannot conclude that the data race
is harmless: it is possible that the absence of a mismatch is an artifact
of the speciﬁc pre-data race execution preﬁx, and that some different
preﬁx would lead to a mismatch. Therefore, to achieve higher con-
ﬁdence in the classiﬁcation, Portend explores multiple feasible paths
even in the pre-data race stage, not just the one path witnessed by
the data race detector. This is illustrated as “multi-pre/multi-post”
analysis in Fig. 16c. The advantage of doing this vs. considering
5.2 design overview 65
 1:  int id = 0, MAX_SIZE = 32;
 5:  int main(int argc, char *argv[])){
 6:   pthread_t t1, t2;
 8:   pthread_create (&t1, 0, reqHandler, 0);
10:   ...                     
11:   
14:     
18:   void * updateStats(void* arg){
19:   if(useHashTable){
20:     update1();
Thread T1
       unlock(l);
lock(l);
 id++;
void * reqHandler( *arg){void 
17:     ...   
13:     ...                     
  while(1){
15:    
16:
12:
 2:  bool useHashTable; 
 7:   useHashTable = getOption(argc, argv); 
 9:   pthread_create (&t2, 0, updateStats, 0);
22:   } else {
23:     update2();
25:   void update1(){
26:     int tmp = id;
27:     if (hash_table.contains(tmp))
28:       hash_table[tmp] = getStats();
29:   void update2(){
31:       stats_array[id] = getStats();
30:     if (id < MAX_SIZE) 
 2:  int stats_array[MAX_SIZE];
 4: 
21:     printf(..., hash_table[id]);
24:     ...
Thread T2
Thread T3
Figure 17 – Simpliﬁed example of a harmful data race from Ctrace [141] that
would be classiﬁed as harmless by classic data race classiﬁers.
these as different data races is the ability to systematically explore
these paths.
Finally, we combine multi-path analysis with multi-schedule analysis,
since the same path through a program may generate different out-
puts depending on how its execution segments from different threads
are interleaved. The branches of the execution tree in the post-race
execution in Fig. 16.c correspond to different paths that stem from
both multiple inputs and schedules, as we detail in §5.6.
Of course, exploring all possible paths and schedules that expe-
rience the data race is impractical, because their number typically
grows exponentially with the number of threads, branches, and pre-
emption points in the program. Instead, we provide developers a
“dial” to control the number k of path/schedule alternatives explored
during analysis, allowing them to control the “volume” of paths and
schedules in Fig. 16. If Portend classiﬁes a data race as “k-witness
harmless”, then a higher value of k offers higher conﬁdence that the
data race is harmless for all executions (i.e., including the unexplored
ones), but it entails longer analysis time. We found k = 5 to be suf-
66 portend : classifying data races during testing
ﬁcient for achieving 99% accuracy in our experiments in less than 5
minutes per data race on average.
To illustrate the beneﬁt of multi-path multi-schedule analysis over
“single-pre/single-post” analysis, consider the code snippet in Fig. 17,
adapted from a real data race bug. This code has racing accesses to
the global variable id. Thread T1 spawns threads T2 and T3; thread T2
updates id (line 15) in a loop and acquires a lock each time. However,
thread T3, which maintains statistics, reads id without acquiring the
lock—this is because acquiring a lock at this location would hurt per-
formance, and statistics need not be precise. Depending on program
input, T3 can update the statistics using either the update1 or update2
functions (lines 20-23).
Say the program runs in Portend with input –use-hash-table, which
makes useHashTable=true. Portend records the primary trace (T1 :
pc9) → ... (T2 → RaceyAccessT1 : pc15) → (T3 → RaceyAccessT3 : pc26) →
... T1. This trace is fed to the ﬁrst analysis step, which replays the trace
with the same input, except it enforces the alternate schedule (T1 :
pc9) → ...(T3 → RaceyAccessT3 : pc26) → (T2 → RaceyAccessT2 : pc15) →
... T1. Since the printed value of hash_table[id] at line 21 would be the
same for the primary and alternate schedules, a “single-pre/single-
post” classiﬁer would deem the data race harmless.
However, in the multi-path multi-schedule step, Portend explores
additional paths through the code by marking program input as sym-
bolic, i.e., allowing it to take on any permitted value. When the trace
is replayed and Portend reaches line 19 in T3 in the alternate schedule,
useHashTable could be both true and false, so Portend splits the execu-
tion into two executions, one in which useHashTable is set to true and
one in which it is false. Assume, for example, that id = 31 when check-
ing the if condition at line 30. Due to the data race, id is incremented
by T2 to 32, which overﬂows the statically allocated buffer (line 31).
Note that in this alternate path, there are two racing accesses on id,
and we are referring to the access at line 31.
Portend detects the overﬂow (via Cloud9), which leads to a crashed
execution, ﬂags the data race as “spec violated”, and provides the
developer the execution trace in which the input is –no-hash-table, and
the schedule is (T1 : pc9) → ...(T3 → RaceyAccessT3 : pc30) → (T2 →
RaceyAccessT2 : pc15)→ (T3 : pc31). The developer can replay this trace
in a debugger and ﬁx the race.
Note that this data race is harmful only if the program input is
–no-hash-table, the given thread schedule occurs, and the value of id is
31; therefore the crash is likely to be missed by a traditional single-
path/single-schedule data race detector.
We now describe Portend’s data race analysis in detail: Sections
§5.3–§5.6 focus on the exploration part of the analysis, in which Por-
tend looks for paths and schedules that reveal the nature of the data
race, and §5.8 focuses on the classiﬁcation part.
5.3 single-path analysis 67
Input : Primary execution trace primary
Output : Classiﬁcation result ∈ {specViol, outDiff , outSame, singleOrd}
1 current ← execUntilFirstThreadRacyAccess(primary)
2 preDataRaceCkpt ← checkpoint(current)
3 execUntilSecondThreadRacyAccess(current)
4 postDataRaceCkpt ← checkpoint(current)
5 current ← preDataRaceCkpt
6 preemptCurrentThread(current)
7 alternate ← execWithTimeout(current)
8 if alternate.timedOut then
9 if detectInﬁniteLoop(alternate) then
10 return specViol
11 else
12 return singleOrd
13 else
14 if detectDeadlock(alternate) then
15 return specViol
16 primary ← exec(postDataRaceCkpt)
17 if detectSpecViol(primary)∨ detectSpecViol(alternate) then
18 return specViol
19 if primary.output = alternate.output then
20 return outDiff
21 else
22 return outSame
Algorithm 2 : Single-Pre/Single-Post Analysis (singleClassify)
5.3 single-path analysis
The goal of this ﬁrst analysis step is to identify cases in which the
alternate schedule of a data race cannot be pursued, and to make a
ﬁrst classiﬁcation attempt based on a single alternate execution. Al-
gorithm 2 describes the approach.
Portend starts from a trace of an execution of the target program,
containing one or more data races, along with the program inputs
that generated the trace. For example, in the case of the pbzip2 ﬁle
compressor used in our evaluation, Portend needs a ﬁle to compress
and a trace of the thread schedule.
As mentioned earlier, such traces are obtained from running, for
instance, the developers’ test suites (as done in CHESS [147]) with a
dynamic data race detector enabled.
Portend takes the primary trace and plays it back (line 1). Note
that current represents the system state of the current execution. Just
before the ﬁrst racing access, Portend takes a checkpoint of system
state; we call this the pre-data race checkpoint (line 2). The replay is
then allowed to continue until immediately after the second racing
68 portend : classifying data races during testing
access of the data race we are interested in (line 3), and the primary
execution is suspended in this post-data race state (line 4).
Portend then primes a new execution with the pre-data race9 check-
point (line 5) and attempts to enforce the alternate ordering of the
racing accesses. To enforce this alternate order, Portend preempts
the thread that did the ﬁrst racing access (Ti) in the primary exe-
cution and allows the other thread (Tj) involved in the data race
to be scheduled (line 6). In other words, an execution with the
trace ...(Ti → RaceyAccessTi : pc1) → (Tj → RaceyAccessTj : pc2)... is
steered toward the execution ...(Tj → RaceyAccessTj : pc2) → (Ti →
RaceyAccessTi : pc1)...
This attempt could fail for one of three reasons: (a) Tj gets sched-
uled, but Ti cannot be scheduled again; or (b) Tj gets scheduled but
RaceyAccessTj cannot be reached because of a complex locking scheme
that requires a more sophisticated algorithm [103] than Algorithm 2
to perform careful scheduling of threads; or (c) Tj cannot be sched-
uled, because it is blocked by Ti. Case (a) is detected by Portend via a
timeout (line 8) and is classiﬁed either as “spec violated”, correspond-
ing to an inﬁnite loop (i.e., a loop with a loop-invariant exit condition)
in line 10 or as ad hoc synchronization in line 12. Portend does not im-
plement the more complex algorithm mentioned in (b), and this may
cause it to have false positives in data race classiﬁcation. However,
we have not seen this limitation impact the accuracy of data race clas-
siﬁcation for the programs in our evaluation. Case (c) can correspond
to a deadlock (line 15) and is detected by Portend by keeping track
of the lock graph. Both the inﬁnite loop and the deadlock case cause
the data race to be classiﬁed as “spec violated”, while the ad hoc syn-
chronization case classiﬁes the data race as “single ordering” (more
details in §5.8). While it may make sense to not stop if the alternate
execution cannot be enforced, under the expectation that other paths
with other inputs might permit the alternate ordering, our evaluation
suggests that continuing adds little value (§6).
If the alternate schedule succeeds, Portend executes it until it com-
pletes, and then records its outputs. Then, Portend allows the pri-
mary to continue (while replaying the input trace) and also records
its outputs. During this process, Portend watches for “basic” speci-
ﬁcation violations (crashes, deadlocks, memory errors, etc.) as well
as “high level” properties given to Portend as predicates—if any of
these properties are violated, Portend immediately classiﬁes (line 18)
the data race as “spec violated”. If the alternate execution completes
with no speciﬁcation violation, Portend compares the outputs of the
primary and the alternate; if they differ, the data race is classiﬁed as
“output differs” (line 20), otherwise the analysis moves to the next
step. This is in contrast to replay-based classiﬁcation [152], which
compares the program state immediately after the data race in the
primary and alternate interleavings.
5.4 multi-path analysis 69
5.4 multi-path analysis
The goal of this step is to explore variations of the single paths
found in the previous step (i.e., the primary and the alternate) in
order to expose Portend to a wider range of execution alternatives.
First, Portend ﬁnds multiple primary paths that satisfy the input
trace, i.e., they (a) all experience the same thread schedule (up to the
data race) as the input trace, and (b) all experience the target data
race condition. These paths correspond to different inputs from the
ones in the initial race report. Second, Portend uses Cloud9 to record
the “symbolic” outputs of these paths—that is, the constraints on the
output, rather than the concrete output values themselves—as well as
path constraints when these outputs are made, and compares them to
the outputs and path constraints of the corresponding alternate paths;
we explain this below. Algorithm 3 describes the functions invoked
by Portend during this analysis in the following order: 1) on initial-
ization, 2) when encountering a thread preemption, 3) on a branch
that depends on symbolic data, and 4) on ﬁnishing an execution.
Unlike in the single-pre/single-post step, Portend now executes
the primary symbolically. This means that the target program is given
symbolic inputs instead of regular concrete inputs. Cloud9 relies in
large part on KLEE [35] to interpret the program and propagate these
symbolic values to other variables, corresponding to how they are
read and operated upon. When an expression with symbolic content
is involved in the condition of a branch, both options of the branch
are explored, if they are feasible. The resulting path(s) are annotated
with a constraint indicating that the branch condition holds true (re-
spectively false). Thus, instead of a regular single-path execution, we
get a tree of execution paths, similar to the one in Fig. 18. Conceptu-
ally, at each such branch, program state is duplicated and constraints
on the symbolic parameters are updated to reﬂect the decision taken
at that branch (line 11). Describing the various techniques for per-
forming symbolic execution efﬁciently [35, 33] is beyond the scope of
this article.
An important concern in symbolic execution is “path explosion,”
i.e., that the number of possible paths is large. Portend provides two
parameters to control this growth: (a) an upper bound Mp on the
number of primary paths explored; and (b) the number and size of
symbolic inputs. These two parameters allow developers to trade per-
formance vs. classiﬁcation conﬁdence. For parameter (b), the fewer
inputs are symbolic, the fewer branches will depend on symbolic in-
put, so less branching will occur in the execution tree.
Determining the optimal values for these parameters may require
knowledge of the target system as well as a good sense of how much
conﬁdence is required by the system’s users. Reasonable (i.e., good
but not necessarily optimal) values can be found through trial and
70 portend : classifying data races during testing
S1
S2
data race
branch
instruction
that depends
on symbolic
data
pruned
execution 
path
complete
execution
path
Figure 18 – Portend prunes paths during symbolic execution.
error relatively easily—we expect development teams using Portend
to converge onto values that are a good ﬁt for their code and user
community, and then make these values the defaults for their testing
and triage processes. We empirically study in §6.3 the impact of these
parameters on classiﬁcation accuracy on a diverse set of programs
and ﬁnd that relatively small values achieve high accuracy for a broad
range of programs.
During symbolic execution, Portend prunes (Fig. 18) the paths that
do not obey the thread schedule in the trace (line 8), thus excluding
the (many) paths that do not enable the target data race. Moreover,
Portend attempts to follow the original trace only until the second
racing access is encountered; afterward, it allows execution to di-
verge from the original schedule trace. This enables Portend to ﬁnd
more executions that partially match the original schedule trace (e.g.,
cases in which the second racing access occurs at a different program
counter, as in Fig. 17). Tolerating these divergences signiﬁcantly in-
creases Portend’s accuracy over the state of the art [152], as will be
explained in §6.3.5.
Once the desired paths are obtained (at most Mp, line 14), the con-
junction of branch constraints accumulated along each path is solved
by KLEE using an SMT solver [71] in order to ﬁnd concrete inputs
that drive the program down the corresponding path. For example,
in the case of Fig. 18, two successful leaf states S1 and S2 are reached,
and the solver provides the inputs corresponding to the path from
the root of the tree to S1, respectively S2. Thus, we now have Mp = 2
different primary executions that experience the data race.
5.5 symbolic output comparison
Portend now records the output of each of the Mp executions, like
in the single-pre/single-post case, and it also records the path con-
straints when these outputs are made. However, this time, in addi-
tion to simply recording concrete outputs, Portend propagates the
constraints on symbolic state all the way to the outputs, i.e., the out-
puts of each primary execution contain a mix of concrete values and
5.5 symbolic output comparison 71
Input : Schedule trace trace, initial program state S0, set of states
S = ∅, upper bound Mp on the number of primary paths
Output : Classiﬁcation result ∈ {specViol, outDiff , singleOrd k-witness}
1 function init ()
2 S ← S∪ S0
3 current ← S.head()
4 pathsExplored ← 0
5 function onPreemption ()
6 ti ← scheduleNextThread(current)
7 if ti = nextThreadInTrace(trace, current) then
8 S ← S.remove(current)
9 current ← S.head()
10 function onSymbolicBranch ()
11 S ← S∪ current.fork()
12 function onFinish ()
13 classiﬁcation ← classiﬁcation∪ classify(current)
14 if pathsExplored < Mp then
15 pathsExplored ← pathsExplored+ 1
16 else
17 return classiﬁcation
18 function classify (primary)
19 result ← singleClassify(primary)
20 if result = outSame then
21 alternate ← getAlternate(primary)
22 if symbolicMatch(primary.symState, alternate.symState) then
23 return k-witness
24 else
25 return outDiff
26 else
27 return result
Algorithm 3 : Multi-path Data Race Analysis (Simpliﬁed)
symbolic constraints (i.e., symbolic formulae). Note that by output,
we mean all arguments passed to output system calls.
Next, for each of the Mp executions, Portend produces a corre-
sponding alternate (analogously to the single-pre/single-post case).
The alternate executions are fully concrete, but Portend records con-
straints on the alternate’s outputs (lines 19-21) as well as the path
constraints when these outputs are made. The function singleClassify
in Algorithm 3 performs the analysis described in Algorithm 2. Por-
tend then checks whether the constraints on outputs of each alter-
nate and the path constraints when these outputs are made match
the constraints of the corresponding primary’s outputs and the path
constraints when primary’s outputs are made. This is what we refer
to as symbolic output comparison (line 22). The purpose behind com-
paring symbolic outputs is that Portend tries to ﬁgure out if the data
race caused the constraints on the output to be modiﬁed or not, and
72 portend : classifying data races during testing
the purpose behind comparing path constraints is to be able general-
ize the output comparison to more possible executions with different
inputs.
1:  int globalx = 0;
2:  int i = 0;
3:  void* work0 (void* arg) {
4:    globalx = 1;
5:    if(i >= 0)
6        printf(”10\n”);
7:    return 0;
8: }
9: void* work1 (void* arg) {
10:   globalx = 2;
11:   return 0;
12: }
13: int main (int argc, char *argv[]){
14:   pthread_t t0, t1;
15:   int rc;
16:   i = getInput(argc, argv)
17:   rc = pthread_create(&t0, 0, work0, 0);
18:   rc = pthread_create(&t1, 0, work1, 0);  
19:   pthread_join(t0, 0);
20:   pthread_join(t1, 0);
21:   return 0;
22: }
Thread T1
Thread T2
Main Thread
Figure 19 – A program to illustrate the beneﬁts of symbolic output compar-
ison
This symbolic comparison enables Portend’s analysis to extend over
more possible primary executions. To see why this is the case, con-
sider the example in Figure 19. In this example, the Main thread
reads input to the shared variable i and then spawns two threads T1
and T2 which perform racing writes to globalx. T1 prints 10 if the
input is positive. Let us assume that during the symbolic execution
of the primary, the write to globalx in T1 is performed before the
write to globalx in T2. Portend records that the output at line 6 is
10 if the path constraint is i  0. Let us further assume that Portend
runs the program while enforcing the alternate schedule with input
1. The output of the program will still be 10 (since i  0) and the
path constraint when the output will be made will still be i  0. The
output and the path constraint when the output is made is therefore
the same regardless of the order with which the accesses to globalx
are performed (i.e., the primary or the alternate order). Therefore,
Portend can assert that the program output for i  0 will be 10 re-
gardless of the way the data race goes even though it only explored a
single alternate ordering with input 1.
This comes at the price of potential false negatives because path
constraints can be modiﬁed due to a different thread schedule; de-
spite this theoretical shortcoming, we have not encountered such a
case in practice, but we plan to investigate this further in future work.
5.6 multi-schedule analysis 73
False negatives can also arise because determining semantic equiva-
lence of output is undecidable, and our comparison may still wrongly
classify as “output differs” a sequence of outputs that are equivalent
at some level (e.g., <print ab; print c> vs. <print abc>).
When executing the primaries and recording their outputs and the
path constraints, Portend relies on Cloud9 to track all symbolic con-
straints on variables. To determine if the path constraints and con-
straints on outputs match for the primary and the alternates, Portend
directly employs an SMT solver [71].
As will be seen in §6.3, using symbolic comparison in conjunction
with multi-path multi-schedule analysis leads to substantial improve-
ments in classiﬁcation accuracy.
We do not detail here the case when the program reads input after
the data race—it is a natural extension of the algorithm above.
5.6 multi-schedule analysis
The goal of multi-schedule analysis is to further augment the set of
analyzed executions by diversifying the thread schedule.
We mentioned earlier that, for each of the Mp primary executions,
Portend obtains an alternate execution. Once the alternate ordering
of the racing accesses is enforced, Portend randomizes the schedule
of the post-race alternate execution: at every preemption point in the
alternate, Portend randomly decides which of the runnable threads
to schedule next. This means that every alternate execution will most
likely have a different schedule from the original input trace (and
thus from the primary).
Consequently, for every primary execution Pi, we obtain multiple al-
ternate executions A1i , A
2
i , ... by running up to Ma multiple instances
of the alternate execution. Since the scheduler is random, we expect
practically every alternate execution to have a schedule that differs
from all others. Recently proposed techniques [146] can be used to
quantify the probability of these alternate schedules discovering the
harmful effects of a data race.
Portend then uses the same symbolic comparison technique as in
§5.5 to establish equivalence between the constraint on outputs and
path constraints of A1i , A
2
i , ...A
Ma
i and the symbolic outputs and path
constraints of Pi.
Schedule randomization can be employed also in the pre-data race
stage of the alternate-execution generation as well as in the genera-
tion of the primary executions. We did not implement these options,
because the level of multiplicity we obtain with the current design
appears to be sufﬁcient in practice to achieve high accuracy. Note
however that, as we show in §6.3, multi-path multi-schedule analysis
is indeed crucial to attaining high classiﬁcation accuracy.
74 portend : classifying data races during testing
In summary, multi-path multi-schedule analysis explores Mp pri-
mary executions and, for each such execution, Ma alternate execu-
tions with different schedules, for a total of Mp ×Ma path-schedule
combinations. For data races that end up being classiﬁed as “k-
witness harmless”, we say that k = Mp ×Ma is the lower bound
on the number of concrete path-schedule combinations under which
this data race is harmless.
Note that the k executions can be simultaneously explored in par-
allel: if a developer has p machines with q cores each, she could
explore p × q parallel executions in the same amount of time as a
single execution. Given that Portend is “embarrassingly parallel,” it
is appealing for cluster-based automated bug triage systems.
5.7 symbolic memory consistency modeling
Modern processor architectures rarely assume sequential consis-
tency as this would hurt program performance. Instead, they adopt
relaxed memory consistency models like weak ordering [54] and rely
on programmers to explicitly specify orderings among program state-
ments using synchronization primitives.
Previous work has shown that subtle bugs may arise in code with
data races because programmers make assumptions based on sequen-
tial consistency. despite the fact that no modern processor provides
sequential consistency [62]. Such assumptions may be violated un-
der relaxed consistency models, and bugs that are deemed unlikely
may appear when the program is running on various CPUs causing
programs to crash, hang or violate some given speciﬁcation of a pro-
gram.
Therefore, a program analysis tool should ideally have the capabil-
ity to reason about different memory models and their effects on the
performed analysis. The effect of the memory model on the conse-
quences of a data race are serious: code written with the assumption
of a particular memory model may end up computing wrong results;
or worse, it can crash or cause data loss [62].
Why Does the Memory Model Matter?
In order to better show why reasoning about relaxed memory con-
sistency models matters while performing program analysis and test-
ing, let us consider the example in Fig 20. There are two shared vari-
ables globalx and globaly that both have an initial value of 0. There
is a thread Main that spawns two threads T1 and T2. T1 writes 2 to a
global variable globalx and 1 to another global variable globaly. T2
writes 2 to globalx. Then, the Main thread reads the value of the
global variables. If the read values of globalx and globaly are 0 and
1 respectively, the program crashes on line 18.
Programmers naturally expect the program statements to be exe-
cuted in the order as they appear in the program text. A programmer
5.7 symbolic memory consistency modeling 75
making that assumption expects that the value of globaly being 1 im-
plies the value of globalx being 2. This assumption is equivalent to
assuming sequential consistency as the underlying memory model: if
sequential consistency is assumed as the underlying memory model
for the execution of this program, the value of globalx cannot be 0
when the value of globaly is 1. This is simply because the order of
the program text would require globalx to be 2.
1:  int volatile globalx = 0;
2:  int volatile globaly = 0;
3:  void* work0(void* arg) {
4:    globalx = 2;
5:    globaly = 1;
6:    return 0;
7:  }
8:  void* work1(void* arg) {
9:    globalx = 2;
10:   return 0;
11: }
12: int main (int argc, char* argv[]){
13:   pthread_t t0, t1;
14:   int rc;
15:   rc = pthread_create(&t0, 0, work0, 0);
16:   rc = pthread_create(&t1, 0, work1, 0);
17:   if(globalx == 0 && globaly == 1)
18:     abort(); //crash!
19:   pthread_join(t0, 0);
20:   pthread_join(t1, 0);
21:   return 0;
22: }
Thread T1
Thread T2
Main Thread
Figure 20 – Simple multithreaded program
Under a different memory model such as weak ordering [54], noth-
ing prevents the write to globalx on line 4 and the write to globaly
on line 5 to swap places. This stems from the fact that, under weak
consistency, if instructions are not conﬂicting, and they are not or-
dered by synchronization operations, then any reordering is allowed.
In such a scenario, it is possible for T1 to write 1 to globaly while the
value of globalx is still 0. Furthermore, there is a data race between
the write to globalx in T1 and the read from it in Main. This means
that T1 can be preempted right after setting globaly to 1 and globaly
and globalx can be equal to 1 and 0 respectively. This can cause the
program to crash on line 18.
Limitations of Cloud9 as a Multithreaded Testing Platform
Portend is built on Cloud9, which is a multithreaded parallel sym-
bolic execution engine [33]. Cloud9 is essentially an LLVM interpreter
that can concretely interpret programs compiled to LLVM, and dur-
ing this interpretation, it can also keep track of symbolic values and
constraints.
76 portend : classifying data races during testing
Cloud9 makes the following sequential consistency assumptions: 1)
uniprocessor scheduler: Cloud9 scheduler picks threads in a round
robin fashion and runs them by interpreting their text until an op-
portunity for scheduling arises (such as a sleep or a synchronization
operation); 2) immediate updates to shared memory: shared memory
is modeled as a ﬂat structure with no cache model; therefore, any up-
date to a shared memory location is immediately visible to all other
threads; 3) no instruction reordering: Cloud9 interpretation engine
works by fetching instructions from the LLVM binary and executing
them sequentially without any instruction reordering.
Since shared memory updates are not reordered, and they are di-
rectly visible to all threads, and threads are scheduled one after the
other, one at a time, any analysis that builds on Cloud9 is bound
to perform the analysis within the conﬁnes of sequential consistency.
However, as it was previously demonstrated, such an analysis may
be unable to expose insidious bugs.
Symbolic Memory Consistency Modeling in Portend
Previously, we showed how Portend explores multiple paths and
schedules in order to observe the consequences of a data race. The
goal of SMCM is to further augment multi-path multi-schedule analy-
sis to factor in the effects of the underlying memory model. SMCM
is in essence similar to multi-path analysis: multi-path analysis ex-
plores multiple execution paths that the execution could take due to
different program input values; SMCM explores multiple paths that
the execution could take due to different values that could be read
from the shared memory.
SMCM has two main components: the Lamport clock tracker and
the SMCM plugin.
The ﬁrst component is the Lamport clock tracker. Lamport clocks
are logical counters that maintain a partial order among synchroniza-
tion operations in order to determine the relative occurrence sequence
of events in a concurrent program [119]. This order is partial because
an order is only present among related synchronization operations
(e.g., a lock and an unlock on the same lock).
Lamport clocks are maintained per synchronization operation. A
thread’s Lamport clock is equal to the greatest of the clocks of all
the events that occur in the thread. Lamport clocks are incremented
under the following conditions:
— Each thread increments the clock of an event before the occur-
rence of that event in that thread.
— When threads communicate, they also communicate their clocks
(upon fork/join or wait/signal)
— The thread that receives a clock sets its own clock to be greater
than the maximum of its own clock or that of the received mes-
sage.
5.7 symbolic memory consistency modeling 77
The graph that captures the relative order of events in a concur-
rent program using Lamport clocks is called a happens-before graph.
Figure 21 shows an example happens-before graph and the Lam-
port clocks associated with a given execution. Note that locks and
unlocks on lock l induce a happens-before edge denoted by the ar-
row between Thread 1 and Thread 2. On the other hand, the lock/un-
lock block on lock k does not have any partial order with any of the
events in Thread 2; therefore, although the current timeline shows it
as occurring before the lock/unlock block in Thread 2, in some other
execution it can occur after that lock/unlock block.
Thread1 Thread 2
lock (l)
unlock (l)
lock (l)
unlock (l)
lock (k)
unlock (k)
T1t: 0
lt: 0
kt: 0
T1t: 1
lt: 1
kt: 0
T1t: 1
lt: 1
kt: 1
T2
t
: 2
l
t
: 2
k
t
: 1
Time
Figure 21 – Lamport clocks and a happens-before graph
The Lamport clock tracker needs to monitor the synchronization
operations performed by each thread in order to construct the happens-
before graph. All synchronization events are intercepted, and the
happens-before graph is constructed behind the scenes according to
the rules that were previously stated, while the program is being ex-
ecuted.
The Lamport clock tracker is a critical component of Portend since
it actually forms a well deﬁned ordering among events during pro-
gram execution that stem from synchronization operations. This is
important because different memory models and reordering constraints
are deﬁned using synchronization operations in programs.
The second component, namely the SMCM plugin, deﬁnes the
memory model protocol according to which a read returns previously-
written values.
The memory model protocol encodes the rules of the particular
memory model that Portend uses. In our case, we deﬁne two such
protocols: one default protocol for sequential consistency and another
one for weak consistency. We previously described the semantics of
sequential consistency. Under Portend’s weak memory consistency, a
read R may see a previous write A, provided that there is no other
write B such that B happened before R and A happened before B, with
the exception that within the same thread, a sequence of reads from
78 portend : classifying data races during testing
the same variable with no intervening writes to that variable will read
the same value as the ﬁrst read. We call this exception in Portend’s
weak memory model write buffering. Write buffering is responsible
for keeping a write history for each shared memory location. Write
buffering enables Portend to compute a subset of the values written
to a memory location, when that location is read. That subset is
computed considering the happens-before graph that is generated
during program execution by the Lamport clock tracker.
Similar forms of weak memory consistency have been implemented
in architectures such as SPARC [210] and Alpha [189].
To see how write buffering and the memory model protocol works,
consider the example given in Figure 22. Again, vertical order of
events imply the order of events in time. In this execution, Thread
2 writes 0 to globalx, then execution switches to Thread 3, which
writes 2 to globalx and 1 to globaly before the execution switches
to Thread 1. Then, Thread 1 writes 3 to globalx after a lock/unlock
region on l and ﬁnally execution switches back to Thread 2 which
reads both globalx and globaly while holding the same lock l.
So what values do Thread 2 read? Note that since both globalx
and globaly are shared variables, the CPU can buffer all the values
that were written to globalx (0, 2, 3) and globaly (1). For globaly,
the only value that can be read is 1. Now, when the value globalx is
read, Portend knows that, under its weak memory consistency model,
the values that can be read are 0, 2 and 3. This is because there is
no ordering constraint (a happens-before edge) that prevents from
making those three write values readable at the point of the read.
Then, Portend will use these multiple possible reads to augment
multi-path analysis: Portend will split the execution to as many pos-
sible “read” values there are, by checkpointing the execution state
prior to the read and binding each one of those possible “read”s
to one such checkpointed state’s thread. By binding a read value,
we mean copying the value in question into the checkpointed state’s
memory. Therefore, in this case there will be three such forked states:
One with values (0, 1), one with (2, 1) and the other with values (3, 1)
corresponding to (globalx, globaly). Portend will continue explor-
ing the forked states, forking further if the threads in the states read
global variables that can potentially return multiple values.
If an already-bound global value is read by the same thread in a
state without being altered after the last time it had been read, Por-
tend makes sure to return the already-bound value. This is a neces-
sary mechanism to avoid false positives (a thread reading two differ-
ent values in a row with no intervening writes) due to write buffering.
This is achieved by maintaining a last reader thread ID ﬁeld per write
buffer.
Write buffering and the optimization we use to bind a read value
to a thread are performed for a given thread schedule that Portend
5.8 classification verdicts 79
explores at a time. For example, in Figure 22, if Thread 2 were to read
globalx twice, it would have been possible for the ﬁrst read to return
2 and the second read to return 3 (or vice versa) if there had been
an intervening write between the two reads. Portend relies on multi-
schedule data race analysis to handle this case, rather than relying on
SMCM to reason about potential prior thread schedules that would
lead to such a behavior.
Thread1 Thread2
lock (l)
unlock (l)
lock (l)
unlock (l)
Time
 ... = globalx
Thread3
globalx = 2
globaly = 1
globalx = 0
 ... = globaly
globalx = 3
Figure 22 – Write Buffering
This example demonstrates the power of SMCM in reasoning about
weak ordering. Note that, if sequential consistency was assumed for
the given sequence of events, there would not have been a scenario
where the value of globaly is 1 whereas the value of globalx is 0.
This is because the given sequence of events would imply that writ-
ing 2 to globalx in Thread 3 occurs before writing 1 to globaly in
the same thread. However, this is not the case under weak consis-
tency. Since there is no synchronization enforcing the ordering of
the write to globalx and globaly in Thread 3, these accesses can be
reordered. Therefore it is perfectly possible for Thread 2 to see the
value of globaly as 1 and globalx as 0.
5.8 classification verdicts
We showed how Portend explores paths and schedules to give the
classiﬁer an opportunity to observe the effects of a data race. We now
provide details on how the classiﬁer makes its decisions.
“Spec violated” data races cause a program’s explicit speciﬁcation
to be violated; they are guaranteed to be harmful and thus should
have highest priority for developers. To detect violations, Portend
watches for them during exploration.
First, Portend watches for basic properties that can be safely as-
sumed to violate any program’s speciﬁcation: crashes, deadlocks, in-
ﬁnite loops, and memory errors. Since Portend already controls the
program’s schedule, it also keeps track of all uses of synchroniza-
80 portend : classifying data races during testing
tion primitives (i.e., POSIX threads calls); based on this, it determines
when threads are deadlocked. Inﬁnite loops are diagnosed as in [220],
by detecting loops for which the exit condition cannot be modiﬁed.
For memory errors, Portend relies on the mechanism already pro-
vided by KLEE inside Cloud9. Even when Portend runs the program
concretely, it still interprets it in Cloud9.
Second, Portend watches for “semantic” properties, which are pro-
vided to it by developers in the form of assert-like predicates. Devel-
opers can also place these assertions inside the code.
Whenever an alternate execution violates a basic or a semantic
property (even though the primary may not), Portend classiﬁes the
corresponding data race as “spec violated”.
“Output differs” data races cause a program’s output to depend
on the ordering of the racing accesses. As explained previously, a dif-
ference between the post-data race memory or register states of the
primary and the alternate is not necessarily indicative of a harmful
race (e.g., the difference may just be due to dynamic memory alloca-
tion). Instead, Portend compares the outputs of the primary and the
alternate, and it does so symbolically, as described earlier. In case of
a mismatch, Portend classiﬁes the race as “output differs” and gives
the developer detailed information to decide whether the difference
is harmful or not.
“K-witness harmless” data races: If, for every primary execution
Pi, the constraints on the outputs of alternate executionsA1i , A
2
i ...A
Ma
i
and the path constraints when these outputs are made, match Pi’s
output and path constraints, then Portend classiﬁes the data race as
“k-witness harmless”, where k = Mp ×Ma, because there exist k
executions witnessing the conjectured harmlessness. The value of
k is often an underestimate of the number of different executions for
which the data race is guaranteed to be harmless; as suggested earlier
in §5.1, symbolic execution can even reason about a virtually inﬁnite
number of executions.
Theoretical insights into how k relates to the conﬁdence a developer
can have that a “k-witness harmless” race will not cause harm in
practice are beyond the scope of this article. One can think of k in
ways similar to code coverage in testing: 80% coverage is better than
60%, but does not exactly predict the likelihood of bugs not being
present. For all our experiments, k = 5 was shown to be sufﬁcient for
achieving 99% accuracy. We consider “k-witness harmless” analyses
to be an intriguing topic for future work, in a line of research akin
to [146]. Note that Portend explores many more executions before
ﬁnding the required k path-schedule combinations that match the
trace, but the paths that do not match the trace are pruned early
during the analysis.
“Single ordering” data races may be harmless data races if the ad
hoc synchronization is properly implemented. In that case, one might
5.9 portend’s debugging aid output 81
even argue they are not data races at all. Yet, dynamic data race
detectors are not aware of the implicit happens-before relationship
and do report a data race, and our deﬁnition of a data race (§2.1.1)
considers these reports as data races.
When Portend cannot enforce an alternate interleaving in the single-
pre/single-post phase, this can either be due to ad hoc synchroniza-
tion that prevents the alternate ordering, or the other thread in ques-
tion cannot make progress due to a deadlock or an inﬁnite loop. If
none of the previously described inﬁnite-loop and deadlock detection
mechanisms trigger, Portend simply waits for a conﬁgurable amount
of time and, upon timeout, classiﬁes the data race as “single order-
ing.” Note that it is possible to improve this design with a heuristic-
based static analysis that can in some cases identify ad hoc synchro-
nization [220, 200].
5.9 portend’s debugging aid output
To help developers decide what to do about an “output differs”
data race, Portend dumps the output values and the program loca-
tions where the output differs. Portend also aims to help in ﬁxing
harmful data races by providing for each data race two items: a tex-
tual report and a pair of execution traces that evidence the effects of
the data race and can be played back in a debugger, using Portend’s
runtime replay environment. A simpliﬁed report is shown in Fig. 23.
Data Race during access to: 0x2860b30
current thread id: 3: READ
racing thread id: 0: WRITE
Current thread at:
/home/eval/pbzip/pbzip2.cpp:702
Previous at:
/home/eval/pbzip/pbzip2.cpp:389
size of the accessed field: 4 offset: 0
Figure 23 – Example debugging aid report for Portend.
In the case of an “output differs” data race, Portend reports the
stack traces of system calls where the program produced different
output, as well as the differing outputs. This simpliﬁes the debugging
effort (e.g., if the difference occurs while printing a debug message,
the data race could be classiﬁed as benign with no further analysis).
5.10 implementation details
Portend works on programs compiled to LLVM [122] bitcode and
can run C/C++ programs for which there exists a sufﬁciently com-
plete symbolic POSIX environment [33]. We have tested Portend on
82 portend : classifying data races during testing
C programs as well as C++ programs that do not link to libstdc++;
we leave linking programs against an implementation of a standard
C++ library for LLVM [46] as future work. Portend uses Cloud9 [33]
to interpret and symbolically execute LLVM bitcode; we suspect any
path exploration tool will do (e.g., CUTE [186], SAGE [74], EXE [36],
ESD [229], S2E [41, 42, 43]), as long as it supports multi-threaded
programs.
Portend intercepts various system calls, such as write, under the as-
sumption that they are the primary means by which a program com-
municates changes in its state to the environment. A separate Portend
module is responsible for keeping track of symbolic outputs in the
form of constraints, as well as of concrete outputs. Portend hashes
program outputs (when they are concrete) and can either maintain
hashes of all concrete outputs or compute a hash chain of all outputs
to derive a single hash code per execution. This way, Portend can
deal with programs that have a large amount of output.
Portend keeps track of Lamport clocks per execution state it ex-
plores on the ﬂy. Note that it is essential to maintain the happens-
before graph per execution state because threads may get scheduled
differently depending on the ﬂow of execution in each state state and
therefore synchronization operations may end up being performed in
a different order.
The state space exploration in Portend is exponential in the num-
ber of values that “read”s can return in a program. Therefore the im-
plementation needs to handle bookkeeping as efﬁciently as possible.
There are several optimizations that are in place to enable a more scal-
able exploration. The most important ones are: 1) copy-on-write for
keeping the happens-before graph and 2) write buffer compression.
Portend can use other techniques for taming state space explosion
such as State Merging [117] in the future.
Portend employs copy-on-write for tracking the happens-before
graphs in various states. Initially, there is a single happens-before
graph that gets constructed during program execution before any
state is forked due to a read with multiple possible return values.
Then, when a state is forked, the happens-before graph is not du-
plicated. The forking state rather maintains a reference to the old
graph. Then, when a new synchronization operation is recorded in
either one of the forked states, this event is recorded as an incremen-
tal update to previously saved happens-before graph. In this way,
maximum sharing of the happens-before graph is achieved among
forked states.
The copy-on-write scheme for states can be further improved if
one checks whether two different states perform the same updates
to the happens-before graph. If that is the case, these updates can
be merged and saved as part of the common happens-before graph.
5.10 implementation details 83
This feature is not implemented in the current prototype, but it is a
potential future optimization.
The second optimization is write buffer compression. This is per-
formed whenever the same value is written to the same shared vari-
able’s buffer and the constraints imposed by the happens-before rela-
tionship allow these same values to be returned upon a read. Then,
in such a case, these two writes are compressed into one, as return-
ing two of them would be redundant from the point of view of state
exploration. For example, if a thread writes 1 to a shared variable
globalx twice before this value is read by another thread, the write
buffer will be compressed to behave as if the initial thread has written
1 once.
Portend clusters the data races it detects in order to ﬁlter out similar
races; the clustering criterion is whether the racing accesses are made
to the same shared memory location by the same threads, and the
stack traces of the accesses are the same. Portend provides developers
with a single representative data race from each cluster.
The timeout used in discovering ad hoc synchronization is conser-
vatively deﬁned as 5 times what it took Portend to replay the primary
execution, assuming that reversing the access sequence of the racing
accesses should not cause the program to run for longer than that.
In order to run multi-threaded programs in Portend, we extended
the POSIX threads support found in Cloud9 to cover almost the entire
POSIX threads API, including barriers, mutexes and condition vari-
ables, as well as thread-local storage. Portend intercepts calls into the
POSIX threads library to maintain the necessary internal data struc-
tures (e.g., to detect data races and deadlocks) and to control thread
scheduling.

6
EVALUAT ION
In this section, we evaluate all the prototypes we built for all the
techniques we presented in the three previous chapters. For each pro-
totype evaluation, we ﬁrst describe the experimental setup followed
with a description of prototype-speciﬁc experiments. We ﬁrst present
the evaluation results of RaceMob (§6.1). Next we present the evalu-
ation results of Gist (§6.2), followed by the evaluation results of Por-
tend (§6.3).
6.1 racemob’s evaluation
In this section, we address the following questions about RaceMob:
Can it effectively detect true races in real code (§6.1.2)? Is it efﬁ-
cient (§6.1.3)? How does it compare to state-of-the-art data race detec-
tors (§6.1.4) and interleaving-based concurrency testing tools (§6.1.5)?
Finally, how does RaceMob scale with the number of threads (§6.1.6)?
6.1.1 Experimental Setup
We evaluated RaceMob using a mix of server, desktop and scientiﬁc
software: Apache httpd is a Web server that serves around 35% of the
Web [6]—we used the mpm-worker module of Apache to operate it in
multi-threaded server mode and detected data races in this speciﬁc
module. SQLite [192] is an embedded database used in Firefox, iOS,
Chrome, and Android, and has 100% branch coverage with devel-
oper’s tests. Memcached [61] is a distributed memory-object caching
system, used by Internet services like Twitter, Flickr, and YouTube.
Knot [17] is a web server. Pbzip2 [72] is a parallel implementation of
the popular bzip2 ﬁle compressor. Pfscan [60] is a parallel ﬁle scan-
ning tool that provides the combined functionality of find, xargs,
and fgrep in a parallel way. Aget is a parallel variant of wget. Fmm,
Ocean, and Barnes are applications from the SPLASH2 suite [188,
218]. Fmm and Barnes simulate interactions of bodies (n-body simu-
lation), and Ocean simulates eddy currents in oceans.
Our evaluation results are obtained primarily using a test environ-
ment simulating a crowdsourced setting, and we also have a small
scale, real deployment of RaceMob on our laptops. For the experi-
ments, we use a mix of workloads derived from actual program runs,
test suites, and test cases devised by us and other researchers [224].
We conﬁgured the hive to assign a single dynamic validation task per
user at a time. Altogether, we have execution information from 1, 754
85
86 evaluation
Program
A
pache
SQ
Lite
M
em
cached
Fm
m
B
arnes
O
cean
Pbzip2
K
not
A
get
Pfscan
Size
(LO
C
)
138,456
113,326
19,397
9,126
7,580
6,551
3,521
3,586
2,053
2,033
R
ace
candidates
118
88
7
176
166
115
65
65
24
17
True
Race
C
auses
hang
0
3
0
0
0
0
0
0
0
0
C
auses
crash
0
0
0
0
0
0
3
0
0
0
Both
orders
0
0
1
5
10
0
2
0
0
0
Single
order
8
0
0
53
6
3
4
2
4
2
Likely
FP
N
ot
aliasing
10
31
0
33
65
13
0
18
2
0
C
ontext
61
10
2
61
28
42
21
28
10
4
Synchronization
1
37
3
10
49
47
34
13
7
11
U
nknow
n
38
7
1
14
8
10
1
4
1
0
Table
1
–
D
ata
race
detection
w
ith
R
aceM
ob.
The
static
phase
reports
D
ata
race
candidates
(row
2).
The
dynam
ic
phase
reports
verdicts
(row
s
3-10).
C
auses
hang
and
C
auses
crash
are
data
races
that
caused
the
program
to
hang
or
crash.
Single
order
are
true
data
races
for
w
hich
either
the
prim
ary
or
the
alternate
executed
(but
not
both)
w
ith
no
intervening
synchronization;
Both
orders
are
data
races
for
w
hich
both
executed
w
ithout
intervening
synchronization.
6.1 racemob’s evaluation 87
simulated user sites. Our test bed consists of a 2.3 GHz 48-core AMD
Opteron 6176 machine with 512 GB of RAM running Ubuntu Linux
11.04 and a 2 GHz 8-core Intel Xeon E5405 machine with 20 GB of
RAM running Ubuntu Linux 11.10. The hive is deployed on the 8-
core machine, and the simulated users on both machines. The real
deployment uses ThinkPad laptops with Intel 2620M processors and
8 GB of RAM, running Ubuntu Linux 12.04.
We used C programs in our evaluation because RELAY operates on
CIL, which does not support C++ code. Pbzip2 is a C++ program,
but we converted it to C by replacing references to STL vector with
an array-based implementation. We also replaced calls to new/delete
with malloc/free.
6.1.2 Effectiveness
To investigate whether RaceMob provides an effective way to detect
data races, we look at whether RaceMob can detect true data races,
and whether its false positive and false negative rates are sufﬁciently
low.
RaceMob’s data race detection results are shown in Table 1. Race-
Mob detected a total of 106 data races in ten programs. Four data
races in pbzip2 caused the program to crash, three data races in
SQLite caused the program to hang, and one data race in Aget caused
a data corruption (that we conﬁrmed manually). The other data races
did not lead to any observable failure. We manually conﬁrmed that
the “True Race” verdicts are correct, and that RaceMob has no false
positives in our experiments.
The “Likely FP” row represents the data races that RaceMob identi-
ﬁed as likely false positives: (1) Not aliasing are reports with accesses
that do not alias to the same memory location at runtime; (2) Con-
text are reports whose accesses are only made by a single thread at
runtime; (3) Synchronization are reports for which, the accesses are
synchronized, an artifact that the static detector missed. The ﬁrst two
sources of likely false positives (53% of all static reports) are identiﬁed
using DCI, whereas the last source (24% of all static reports) is iden-
tiﬁed using on demand race detection. In total, 77% of all statically
detected data races are likely false positives.
As we discussed in §3.5, RaceMob’s false negative rate is deter-
mined by its static data race detector. We rely on prior work’s results
to partially conﬁrm the absence of false negatives in RaceMob. In
particular, Chimera [123], a deterministic record/replay system, re-
lies on RELAY; for deterministic record/replay to work, all data races
must be detected; in Chimera’s evaluation (which included Apache,
Pbzip2, Knot, Ocean, Pfscan, Aget), RELAY did not have any false neg-
atives [123]. We therefore cautiously conclude that RaceMob’s static
phase had no false negatives in our evaluation. However, this does
88 evaluation
not exclude the possibility that for other programs there do exist false
negatives.
For all the programs, we initially set the timeout for schedule steer-
ing to τ = 1 ms. As timeouts ﬁred during validation, the hive in-
creased the timeout 50 ms at a time, up to a maximum of 200 ms.
Developers may choose to modify this basic scheme depending on
the characteristics of their programs. For instance, the timeout could
be increased multiplicatively instead of linearly.
In principle, false negatives may also arise from τ being too low or
from there being insufﬁcient executions to prove a true data race. We
increased τ in our experiments by 4×, to check if this would alter our
results, and the ﬁnal verdicts were the same. After manually examin-
ing data races that were not encountered during dynamic validation,
we found that they were either in functions that are never called but
are nonetheless linked to the programs, or they are not encountered
at runtime due to the workloads used in our evaluation.
6.1.3 Efﬁciency
The more efﬁcient a detector is, the less runtime overhead it intro-
duces, i.e., the less it slows down a user’s application (as a percentage
of uninstrumented execution). The static detection phase is ofﬂine,
and it took less than 3 minutes for all programs, except Apache and
SQLite, for which it took less than 1 hour. Therefore, in this section,
we focus on the dynamic phase.
A
pa
ch
e
SQ
Li
te
M
em
ca
ch
ed
Fm
m
Ba
rn
es
O
ce
an
Pb
zi
p2
K
no
t
A
ge
t
Pf
sc
an
1.74 1.60 0.10 4.54 2.98 2.05 2.90 1.27 3.00 3.03
Table 2 – Runtime overhead of data race detection as a percentage of unin-
strumented execution. Average overhead is 2.32%, and maximum
overhead is 4.54%.
Table 2 shows that runtime overhead of RaceMob is typically less
than 3%. The static analysis used to remove instrumentation from
empty loop bodies reduced our worst case overhead from 25% to
4.54%. The highest runtime overhead is 4.54%, in the case of Fmm, a
memory-intensive application that performs repetitive computations,
which gives the instrumentation more opportunity to introduce over-
head. Our results suggest that there is no correlation between the
number of data race candidates (row 2 in Table 1) and the runtime
overhead (Table 2)—overhead is mostly determined by the frequency
of execution of the instrumentation code.
6.1 racemob’s evaluation 89
Pr
og
ra
m
A
pa
ch
e
SQ
Li
te
M
em
ca
ch
ed
Fm
m
B
ar
ne
s
O
ce
an
Pb
zi
p2
K
no
t
A
ge
t
Pf
sc
an
R
ac
eM
ob
8
3
1
58
16
3
9
2
4
2
TS
A
N
8
3
0
58
16
3
9
2
2
1
R
EL
A
Y
11
8
88
7
17
6
16
6
11
5
65
15
7
25
6
17
Ta
bl
e
3
–
D
at
a
ra
ce
de
te
ct
io
n
re
su
lt
s
w
it
h
R
ac
eM
ob
,
Th
re
ad
Sa
ni
ti
ze
r
(T
SA
N
),
an
d
R
EL
A
Y
.E
ac
h
ce
ll
sh
ow
s
th
e
nu
m
be
r
of
re
po
rt
ed
da
ta
ra
ce
s.
Th
e
da
ta
ra
ce
s
re
po
rt
ed
by
R
ac
eM
ob
an
d
TS
A
N
ar
e
al
lt
ru
e
da
ta
ra
ce
s.
Th
e
on
ly
tr
ue
da
ta
ra
ce
s
am
on
g
th
e
on
es
de
te
ct
ed
by
R
EL
A
Y
ar
e
th
e
on
es
in
th
e
ro
w
“R
ac
eM
ob
”.
To
th
e
be
st
of
ou
r
kn
ow
le
dg
e,
tw
o
of
th
e
da
ta
ra
ce
s
th
at
ca
us
e
a
ha
ng
in
SQ
Li
te
w
er
e
no
t
pr
ev
io
us
ly
re
po
rt
ed
.
90 evaluation
The overhead introduced by RaceMob is due to the instrumenta-
tion plus the overhead introduced by validation (DCI, on-demand
detection, and schedule steering). Fig. 24 shows the breakdown of
overhead for our ten target programs. We ﬁnd that the runtime over-
head without detection is below 1% for all cases, except the memory-
intensive Fmm application, for which it is 2.51%. We conclude that, in
the common case when a program is instrumented by RaceMob but
no detection is performed, the runtime overhead is negligible; this
property is what makes RaceMob suitable for always-on operation.
 0
 1
 2
 3
 4
 5
 6
Apac
he
SQLit
e
Mem
cach
ed
Fmm Barn
es Ocea
n
Pbzip
2
Knot Aget Pfsca
n
O
ve
rh
ea
d 
(%
)
Instrumentation overhead
Detection-induced overhead
Figure 24 – Breakdown of average overhead into instrumentation-induced
overhead and detection-induced overhead.
The dominant component of the overhead of data race detection
(the black portion of the bars in Fig. 24) is due to dynamic data race
validation. The effect of DCI is negligible: it is below 0.1% for all
cases; thus, we don’t show it in Fig. 24. Therefore, it is feasible to
leave DCI on for all executions. This can help RaceMob to promote a
data race from “Likely FP” to “True Race” with low overhead.
If RaceMob assigns more than one validation task at a time per
user, the aggregate overhead that a user experiences will increase. In
such a scenario, the user site would pick a validation candidate at
runtime depending on which potentially racing access is executed.
This scheme introduces a lookup overhead to determine at runtime
which racing access is executed, however, it would not affect the per-
race overhead, because of RaceMob’s on-demand data race detection
algorithm.
6.1.4 Comparison to Other Detectors
In this section, we compare RaceMob to state-of-the art dynamic,
static, and sampling-based race detectors.
6.1 racemob’s evaluation 91
We compare RaceMob to the RELAY static data race detector [206]
and to ThreadSanitizer [187] (TSAN), an open-source dynamic data
race detector developed by Google. We also compare RaceMob to
Pacer [30], a sampling-based data race detector. Our comparison is
in terms of detection results and runtime overhead. We do not com-
pare to LiteRace, which is another sampling-based data race detector,
because LiteRace has higher overhead and lower data race detection
coverage than PACER [139]. The detection results are shown in Ta-
ble 3.
6.1.4.1 Comparative Accuracy
We ﬁrst compared RaceMob to TSAN by detecting data races for
all the test cases that were available to us, except for the program
executions from the real deployment of RaceMob, because we do not
record real user executions. RaceMob detected 4 extra data races rel-
ative to TSAN: For Memcached and Pfscan, RaceMob detected, with
the help of schedule steering, 2 data races missed by TSAN. Race-
Mob also detected 2 input-dependent data races in Aget that were
missed by TSAN (of which one causes Aget to corrupt data), because
RaceMob had access to executions from the real deployment, which
were not accessible to TSAN. These data races required the user to
manually abort and restart Aget. For 3 data races in pbzip2, Race-
Mob triggered a particular interleaving that caused the program to
crash as a result of schedule steering, which did not happen in the
case of TSAN. Furthermore, we have not observed any crash during
detection with TSAN; this shows that, without schedule steering, the
consequences of a detected data race may remain unknown.
Note that we give TSAN the beneﬁt of access to all executions that
RaceMob has access to (except the executions from the real users).
This is probably overly generous, because in reality, dynamic data
race detection is not crowdsourced, so one would run TSAN on fewer
executions and obtain lower data race detection coverage than shown
here. We did not use TSAN’s hybrid data race detection algorithm,
because it is known to report false positives and therefore reduces
the accuracy of data race detection.
RELAY typically reports at least an order of magnitude more data
races than the real data races reported by RaceMob, with no indica-
tion of whether they are true data races or not. Consequently, de-
velopers would not have information on how to prioritize their bug
ﬁxing. This would in turn impact the users, because it might take
longer to remove the data races with severe consequences. The ben-
eﬁt of tolerating a 2.32% average detection overhead with RaceMob
is that data race detection results are more detailed and helpful. To
achieve a similar effect as RaceMob, static data race detectors use un-
sound heuristics to prune some data race reports, and thus introduce
false negatives.
92 evaluation
Program Aggregate overhead with
RaceMob [ # of race can-
didates × # of users ] in %
TSAN user-perceived
overhead in %
Apache 339.30 25,207.79
SQLite 281.60 1,428.57
Memcached 2.20 3,102.32
Fmm 1,598.08 47,888.07
Barnes 989.36 30,640.00
Ocean 360.70 3,069.39
Pbzip2 377.00 3,001.00
Knot 165.10 751.47
Aget 144.00 184.22
Pfscan 103.20 13,402.15
Table 4 – RaceMob aggregate overhead vs. TSAN’s average overhead, rela-
tive to uninstrumented execution. RaceMob’s aggregate overhead
is across all the executions for all users. For TSAN, we report the
average overhead of executing all the available test cases.
6.1.4.2 Comparative Overhead
RELAY’s static data race detection is ofﬂine, and the longest detec-
tion we measured was below 1 hour.
We compared the overheads of dynamic data race detection in Race-
Mob and TSAN. We chose TSAN because it is freely available, actively
maintained, and works for C programs. The results are shown in
Table 4. The average overhead of TSAN ranged from almost 49× for
Fmm to 1.84× for Aget. The average overhead of RaceMob per user
is about three orders of magnitude less than that of TSAN for all three
programs.
The aggregate overhead of RaceMob represents the sum of all the
overheads of all the executions at all the user sites. It represents Race-
Mob’s overall overhead for detecting the data races in row 2 of Table 3.
We compare RaceMob’s aggregate overhead to TSAN’s overhead be-
cause these overheads represent what both tools incur for all the data
races they detect. The aggregate overhead of RaceMob is an order
of magnitude less than the overhead of TSAN. This demonstrates that
mere crowdsourcing of TSAN would not be enough to reduce its over-
head (it would still be one order of magnitude higher than RaceMob),
and so the other techniques proposed in RaceMob are necessary too.
In particular, there are two other factors that contribute to lower
overhead: the static data race detection phase and the lightweight
dynamic validation phase. The contribution of each such phase de-
pends on whether the application for which RaceMob performs data
race detection is synchronization-intensive or not. To show the ben-
eﬁt of each phase, we picked Ocean (synchronization-intensive) and
6.1 racemob’s evaluation 93
 0
 20
 40
 60
 80
 100
Pbzip2 Ocean
Pe
rc
en
ta
ge
 o
f 
a
gg
re
ga
te
 o
ve
rh
ea
d 
(%
)
Dynamic detection (TSAN)
RaceMob without DCI and on-demand detection
RaceMob
Figure 25 – Contribution of each technique to lowering the aggregate over-
head of RaceMob. Dynamic detection represents detection with
TSAN. RaceMob without DCI and on-demand detection just
uses static data race detection to prune the number of accesses
to monitor.
pbzip2 (uses less synchronization), and measured the contribution of
each phase.
The results are shown in Fig. 25. This graph shows how the over-
head of full dynamic detection reduces with each phase. The contri-
bution of static data race detection is more signiﬁcant for Pbzip2 in
comparison to Ocean. This is because, for Pbzip2, narrowing down
the set of accesses to be monitored has a good enough contribution.
On the other hand, Ocean beneﬁts more from DCI and on-demand
data race detection, because static data race detection is inaccurate
in this case (and is mitigated by DCI), and Ocean employs heavy syn-
chronization (mitigated by on-demand data race detection). Thus, we
conclude that both the static data race detection phase and DCI fol-
lowed by on-demand data race detection are essential to lowering the
overhead of aggregate data race detection in the general case.
We also compared the runtime overhead with PACER, a sampling-
based data race detector. We do not have access to a PACER imple-
mentation for C/C++ programs; therefore, we modiﬁed RaceMob to
operate like PACER. We allow PACER to have access to the static data
race detection results from RELAY, and we assumed PACER starts sam-
pling whenever a potential racing access is performed (as in Race-
Mob) rather than at a random time. We refer to our version of PACER
as PACER-SA.
PACER-SA’s runtime overhead is an order of magnitude larger than
that of RaceMob for non-synchronization-inten-sive programs: 21.56%
on average for PACER-SA vs. 2.32% for RaceMob. RaceMob has lower
overhead mainly because it performs data race detection selectively:
it does not perform on-demand data race detection for every poten-
94 evaluation
tial data race detected statically, rather it only does so after DCI has
proven that the relevant accesses can indeed alias and that they in-
deed can occur in a multithreaded context. Table 1 shows that DCI
excludes on this basis more than half the data race candidates from
further analysis.
For synchronization-intensive programs, like Fmm, Ocean and Bar-
nes, PACER-SA’s overhead can be up to two orders of magnitude
higher than that of RaceMob. This is due to the combined effect
of DCI and on-demand data race detection. The latter factor is more
prominent for synchronization-intensive applications. To illustrate
this, we picked Fmm and used RaceMob and PACER-SA to detect data
races. For typical executions of 200 msec, where we ran Fmm with its
default workload, Fmm performed around 15, 000 synchronization
operations, which incur a 200% runtime overhead with PACER-SA
compared to 4.54% with RaceMob.
We conclude that, even if PACER-SA’s performance might be con-
sidered suitable for production use for non-synchronization-intensive
programs, it is prohibitively high in the case of synchronization-intensive
programs. This is despite giving the beneﬁt of a static data race detec-
tion phase to vanilla PACER. PACER could have lower overhead than
RaceMob if it stopped sampling soon after having started and before
even detecting a data race, but it would of course also detect fewer
data races.
This section showed that RaceMob detects more true data races
than state-of-the art detectors while not introducing additional false
negatives relative to what the static race detectors already do. It also
showed that RaceMob’s runtime overhead is lower than state-of-the-
art detectors.
6.1.5 Comparison to Concurrency Testing Tools
A concurrency testing tool can be viewed as a type of data race
detector, and vice versa. In this vein, one could imagine using Race-
Mob for testing, by using schedule steering (§3.3.3) to explore data
races that may otherwise be hard to witness and that could lead to
failures. As a simple test, we ran SQLite with the test cases used in
our evaluation 10, 000 times and never encountered any hang when
not instrumented. When running it under RaceMob, we encountered
3 hangs within 176 executions. Similarly, we ran the Pbzip2 test cases
10, 000 times and never encountered a crash, but RaceMob caused
the occurrence of 4 crashes within 130 executions. This suggests that
RaceMob could also be used as a testing tool to quickly identify and
prioritize data races.
Some existing concurrency testing tools perform an analysis simi-
lar to schedule steering to detect and explore data races. In the rest of
this section we compare RaceMob to two such state-of-the-art tools:
6.1 racemob’s evaluation 95
x = 0;
Thread T2Thread T1
  if(in1)
y = 0;
...
...
z = 0;
x = 1;
  if(in2)
y = 1;
  if(in3)
z = 1;...
x = 0;
Thread T2Thread T1
  lock(l)
signal(c)
wait(c)
unlock(l)
Thread T3
  lock(l)
x = 1;
unlock(l)
HB
HB
benchbench
sleep
Thread T2Thread T1 Thread T3 Thread T4
x = 0;
signal(c)
  lock(l)
wait(c)
unlock(l)
sleep
  lock(l)
signal(c)
unlock(l)
Time
wait(c)
sleep
HB
HB
HB
x = 1;bench2
3 4
Figure 26 – Concurrency testing benchmarks: bench1 is shown in Fig. 2,
thus not repeated here. In bench2, the accesses to x in thread T1
and T3 can race, but the long sleep in T3 and T4 causes the signal-
wait and lock-unlock pairs to induce a happens-before edge be-
tween T1 and T4. bench3 has a similar situation to bench2. In
bench4, the accesses to variables x, y, z from T1 and T2 are racing
if the input is either in1, in2, or in3.
RaceFuzzer [185] and Portend [110] (whose design we described in
detail in Chapter §5). These tools were not intended for use in pro-
duction, and thus have high overheads (up to 200× for RaceFuzzer
and up to 5, 000× for Portend), so we do not compare overhead, but
focus instead on comparing their respective data race detection cov-
erage.
RaceFuzzer works in two stages: First, it uses imprecise hybrid data
race detection [157] to detect potential data races in a program and
instrument them. Second, it uses a randomized analysis to determine
whether these potential data races are actual races. Portend uses pre-
cise happens-before dynamic data race detection and explores a de-
tected data race’s consequences along multiple paths and schedules.
To compare data race detection coverage, we use benchmarks bench1,
bench2, bench3 (taken from Google TSAN) and bench4 (taken from
[110]). The bench4 benchmark has three data races that only mani-
fest under speciﬁc inputs in1, in2, and in3. Simpliﬁed versions of the
benchmarks are shown in Fig. 26 and Fig. 2.
The RaceFuzzer implementation is not available, so we simulate it:
we use TSAN in imprecise hybrid mode, as done in RaceFuzzer, and
then implement RaceFuzzer’s random scheduler. The results appear
in Table 5. For bench1, bench2, and bench3, RaceFuzzer performs as
well as RaceMob in terms of data race detection coverage. For bench4,
RaceFuzzer’s data race detection coverage varies between 0/3− 3/3.
To understand this variation, we run the following experiment: we
assume that initially neither tool has access to any test case with input
96 evaluation
Tool bench1 bench2 bench3 bench4
RaceMob 1 / 1 1 / 1 1 / 1 3 / 3
RaceFuzzer 1 / 1 1 / 1 1 / 1 0 – 3 / 3
Portend 0 / 1 0 / 1 0 / 1 3 / 3
Table 5 – RaceMob vs. concurrency testing tools: Ratio of data races de-
tected in each benchmark to the total number of data races in that
benchmark.
1/3
2/3
3/3
in1 in1,in2 in1, in2, in3D
at
a 
ra
ce
 d
et
ec
tio
n 
 
co
ve
ra
ge
 [%
]
inputs for bench4
RaceMob
RaceFuzzer0RaceFuzzer1
RaceFuzzer2RaceFuzzer3
Figure 27 – Data race detection coverage for RaceMob vs. RaceFuzzer. To
do as well as RaceMob, RaceFuzzer must have a priori access to
all test cases (the RaceFuzzer3 curve).
in1, in2, or in3. Thus, RaceFuzzer cannot detect any data race, so it
cannot instrument the racing accesses, and generates an instrumented
version of bench4 we call RaceFuzzer0. RaceMob, however, detects
all three potential data races in bench4, thanks to static data race
detection, and instruments bench4 at the potentially racing accesses.
If we allow RaceFuzzer to see a test with input in1, then it generates a
version of bench4 we call RaceFuzzer1; if we allow it to see both a test
with input in1 and in2, then it generates RaceFuzzer2. RaceFuzzer3
corresponds to having seen all three inputs.
We run both RaceFuzzer’s and RaceMob’s versions of the instru-
mented benchmark and plot data race detection coverage in Fig. 27.
When run on random inputs different from in1, in2, and in3, neither
tool ﬁnds any data race (0/3), as expected. When given input in1,
RaceMob ﬁnds the data race, RaceFuzzer0 doesn’t, but RaceFuzzer1,
RaceFuzzer2, and RaceFuzzer3 do. And so on.
Of course, giving RaceFuzzer the beneﬁt of access in advance to
all test cases is overly generous, but this experiment serves to illus-
trate how the tool works. In contrast, RaceMob achieves data race
detection coverage proportional to the number of runs with different
inputs in1, in2, in3, irrespective of which test cases were available ini-
tially, since it performs static data race detection to identify potential
6.1 racemob’s evaluation 97
data races. RaceFuzzer could potentially miss all input-dependent
data races even when the program under test is run with the inputs
that expose such data races, because it may have missed those data
races in its initial instrumentation stage. However, this is not a fun-
damental shortcoming: it is possible to mitigate it by replacing Race-
Fuzzer’s dynamic data race detection phase with a static data race
detector.
The results of the comparison with Portend appear in Table 5. Race-
Mob detects all three test cases for bench4, as well as all the data races
in all the other benchmarks. On the other hand, Portend discovered
all the input-dependent data races in bench4, but failed to detect the
data races in the other benchmarks, because it employs a precise dy-
namic data race detector that does not do schedule steering. However,
Portend is able to explore the consequences of a data race more thor-
oughly than RaceMob, and in that regard RaceMob and Portend are
complementary.
6.1.6 Scalability with Application Threads
RaceMob uses atomic operations to update internal shared struc-
tures related to dynamic data race validation and signal-wait synchro-
nization to perform schedule steering; in this section, we analyze the
effect these operations have on RaceMob’s scalability as the number
of application threads increases.
We conﬁgured multiple clients to concurrently request a 10 MB ﬁle
from Apache and Knot using the Apache benchmarking tool ab. For
SQLite and Memcached, we inserted, modiﬁed, and removed 5, 000
items from the database and the object cache, respectively. We used
Pbzip2 to decompress a 100 MB ﬁle. For Ocean, we simulated cur-
rents in a 256× 256 ocean grid. For Barnes, we simulated interactions
of 16, 384 bodies (default number for Barnes). We varied the number
of threads from 2 – 32. For all programs, we ran the instrumented
versions of the programs while performing data race detection and
measured the overhead relative to uninstrumented versions on the
8-core machine.
Fig. 28 shows the results. We expected RaceMob’s overhead to be-
come less visible after the thread count reached the core count. We
wanted to verify this, and that is why we used the 8-core machine. For
instance, for Apache the overhead is 1.16% for 2 threads, it slightly
rises to its largest value of 2.31% for 8 threads, and then it decreases
as the number of threads exceeds the number of cores. We observe a
similar trend for all other applications. We conclude that RaceMob’s
runtime overhead remains low as the number of threads in the test
programs increases.
98 evaluation
 0
 1
 2
 3
 4
 5
 6
 7
 8
Apache SQLite Memcached Barnes Ocean Pbzip2 Knot
O
ve
rh
ea
d
[%
 of
 un
ins
tru
me
nte
d e
xe
cu
tio
n] 2 threads
4 threads
8 threads
16 threads
32 threads
Figure 28 – RaceMob scalability: Induced overhead as a function of the num-
ber of application threads.
6.2 gist’s evaluation
In this section we aim to answer the following questions about Gist
and failure sketching: Is Gist capable of automatically computing
failure sketches (§6.2.2)? Are these sketches accurate (§6.2.3)? How
efﬁcient is the computation of failure sketches in Gist (§6.2.4)?
6.2.1 Experimental Setup
To answer these questions we benchmark Gist with several real
world programs: Apache hhtpd, SQLite, Memcached, and Pbzip2
were previously described in §6.1.1. Cppcheck [140] is a C/C++ static
analysis tool integrated with popular development tools such as Vi-
sual Studio, Eclipse, and Jenkins. Curl [193] is a data transfer tool
for network protocols such as FTP and HTTP, and it is part of most
Linux distributions and many programs, like LibreOfﬁce and CMake.
Transmission [203] is the default BitTorrent client in Ubuntu and Fe-
dora Linux, as well as Solaris.
We developed an extensible framework called Bugbase [16] in or-
der to reproduce the known bugs in the aforementioned software.
Bugbase can also be used to do performance benchmarking of vari-
ous bug ﬁnding tools. We used Bugbase to obtain our experimental
results.
We benchmark Gist on bugs (from the corresponding bug reposito-
ries) that were used by other researchers to evaluate their bug ﬁnding
and failure diagnosis tools [9, 110, 162]. Apart from bugs in Cppcheck
and Curl, all bugs are concurrency bugs (e.g., data races and atomic-
ity). We use a mixture of workloads from actual program runs, test
suites, test cases devised by us and other researchers [225], Apache’s
6.2 gist’s evaluation 99
operate(struct char* url, ...){
 for(i = 0; (url = next_url(urls))); i++){
 }
}
next_url(urls* urls){  
 len = strlen(urls->current);
}
Time
1
2
3
4
5
6
7
Failure Sketch for Curl bug #965
Type: Sequential bug, data-related
url
1
2
3
4
5
6
7
horizontal line 
separates 
different functions
urls->current
Failure (segmentation fault)
0
“{}{”
{
Figure 29 – The failure sketch of Curl bug #965.
benchmarking tool ab, and SQLite’s test harness. We gathered execu-
tion information from a total of 11,360 executions.
The distributed cooperative setting of our test environment is sim-
ulated, as opposed to employing real users, because CPUs with In-
tel PT support are still scarce, having become available only recently.
In the future we plan to use a real-world deployment. Altogether
we gathered execution information from 1,136 simulated user end-
points. Client-side experiments were run on a 2.4 GHz 4 core Intel
i7-5500U (Broadwell) machine running a Linux kernel with an Intel
PT driver [128]. The server side of Gist ran on a 2.9 GHz 32-core Intel
Xeon E5-2690 machine with 256 GB of RAM running Linux kernel
3.13.0-44.
6.2.2 Automated Generation of Sketches
For all the failures shown in Table 6, Gist successfully computed the
corresponding failure sketches after gathering execution information
from 11,360 runs in roughly 35 minutes. The results are shown in the
rightmost two columns. We veriﬁed that, for all sketches computed
by Gist, the failure predictors with the highest F-measure indeed cor-
respond to the root causes that developers chose to ﬁx.
In the rest of this section, we present two failure sketches computed
by Gist, to illustrate how developers can use them for root cause diag-
nosis and for ﬁxing bugs. These two complement the failure sketch
for the Pbzip2 bug already described in Fig. 8. Aside from some for-
matting, the sketches shown in this section are exactly the output of
Gist. We renamed some variables and functions to save space in the
ﬁgures. The statements or variable values in dotted rectangles denote
failure predicting events with the highest F-measure values. We inte-
grated Gist with KCachegrind [211], a call graph viewer that allows
easy navigation of the statements in the failure sketch.
Fig. 29 shows the failure sketch for Curl bug #965, a sequential bug
caused by a speciﬁc program input: passing the string “{}{“ (or any
other string with unbalanced curly braces) to Curl causes the variable
urls->current in function next_url to be NULL in step 6. The value of
100 evaluation
B
ug
nam
e
/
softw
are
Softw
are
version
Softw
are
size
[LO
C
]
B
ug
ID
from
bug
D
B
Static
slice
size,in
source
[LO
C
]
(LLV
M
instructions)
Ideal
failure
sketch
size,in
source
[LO
C
]
(LLV
M
instrs)
G
ist-com
puted
sketch
size,in
source
[LO
C
]
(LLV
M
instrs)
D
uration
of
failure
sketch
com
putation
by
G
ist:
#
failure
recurrences
<tim
e>
(ofﬂine
analysis
tim
e)
A
pache-1
2.2.9
224,533
45605
7
(
23)
8
(23)
8
(23)
5
<4m
:22s>
(1m
:28s)
A
pache-2
2.0.48
169,747
25520
35
(
137)
4
(16)
4
(16)
4
<3m
:53s>
(0m
:55s)
A
pache-3
2.0.48
169,747
21287
354
(
968)
6
(
6)
8
(
8)
3
<4m
:17s>
(1m
:19s)
A
pache-4
2.0.46
168,574
21285
335
(
805)
9
(12)
13
(16)
4
<5m
:34s>
(1m
:23s)
C
ppcheck-1
1.52
86,215
3238
3,662
(10,640)
11
(16)
11
(16)
4
<5m
:14s>
(2m
:32s)
C
ppcheck-2
1.48
76,009
2782
3,028
(
8,831)
3
(
8)
3
(
8)
3
<3m
:21s>
(1m
:40s)
C
url
7.21
81,658
965
15
(
46)
6
(17)
6
(17)
5
<1m
:31s>
(0m
:40s)
Transm
ission
1.42
59,977
1818
680
(
1,681)
2
(
7)
3
(
8)
3
<0m
:23s>
(0m
:17s)
SQ
Lite
3.3.3
47,150
1672
389
(
1,011)
3
(
4)
3
(
4)
2
<2m
:47s>
(1m
:43s)
M
em
cached
1.4.4
8,182
127
237
(
1,003)
6
(13)
8
(16)
4
<0m
:56s>
(0m
:02s)
Pbzip2
0.9.4
1,492
N
/A
8
(
14)
6
(13)
9
(14)
4
<1m
:12s>
(0m
:03s)
Table
6
–
Bugs
used
to
evaluate
G
ist.
Bug
ID
s
com
e
from
the
corresponding
ofﬁcial
bug
database.
Source
lines
of
code
are
m
easured
using
sloc-
count
[214].W
e
report
slice
and
sketch
sizes
in
both
source
code
lines
and
LLV
M
instructions.Tim
e
is
reported
in
m
inutes:seconds.
6.2 gist’s evaluation 101
decrement_refcount(obj){
 if (!obj->complete) {
  object_t *mobj = ...
 dec(&obj->refcnt);
 
 
 if (!obj->refcnt) {
  free(obj);
 }
Time Thread T1 Thread T2
1
2
3
4
5
6
7
8
9
decrement_refcount(obj){
if (!obj->complete) {
  object_t *mobj = ... 
dec(&obj->refcnt);
if (!obj->refcnt) {
  free(obj);
 }
}
1
2
3
4
5
6
7
8
9{
Failure (double free)
Failure Sketch for Apache bug #21287
Type: Concurrency bug, double-free
obj->refcnt
1
0
1
2
3
4
5
6
7
8
9
Figure 30 – The failure sketch of Apache bug #21287. The grayed-out com-
ponents are not part of the ideal failure sketch, but they appear
in the sketch that Gist automatically computes.
url in step 2 (“{}{“) and the value of urls->current in step 6 (0) are the
best failure predictors. This failure sketch suggests that ﬁxing the bug
consists of either disallowing unbalanced parentheses in the input url,
or not calling strlen when urls->current is NULL. Developers chose the
former solution to ﬁx this bug [194].
Fig. 30 shows the failure sketch for Apache bug 21287, a concur-
rency bug causing a double free. The failure sketch shows two threads
executing the decrement_refcount function with the same obj value.
The dec function decrements obj->refcount. The call to dec, the if
condition checking, namely !obj->refcount, and the call to free are
not atomic, and this can cause a double free if obj->refcount is 0 in
step 6 in T3 and step 8 in T2. The values of obj->refcount in steps 4
and 5 (1 and 0 respectively), and the double call to free(obj) are the
best failure predictors. Developers ﬁxed this bug by ensuring that the
decrement-check-free triplet is executed atomically [195].
The grayed-out statements in the failure sketch in Fig. 30 are not
part of the ideal failure sketch. The adaptive slice tracking in Gist
tracks them during slice reﬁnement, because Gist does not know the
statements in the ideal failure sketch a priori. For the Curl bug in
Fig. 29, we do not show any grayed-out statements, because, adaptive
slice tracking happens to track only the statements that are in the
ideal failure sketch.
6.2.3 Accuracy of Failure Sketches
In this section, we measure the accuracy (A) of failure sketches
computed by Gist (ΦG), as compared to ideal failure sketches that
we computed by hand (ΦI), according to our ideal failure sketch def-
inition (§4.3). We deﬁne two components of failure sketch accuracy:
1) Relevance measures the extent to which a failure sketch contains
all the statements from the ideal sketch and no other statements. We
102 evaluation
Figure 31 – Accuracy of Gist, broken down into relevance accuracy and or-
dering accuracy.
deﬁne relevance as the ratio of the number of LLVM instructions in
ΦG ∩ ΦI to the number of statements in ΦG ∪ ΦI. We compute rele-
vance accuracy as a percentage, and deﬁne it as AR = 100 · |ΦG∩ΦI||ΦG∪ΦI|
2) Ordering measures the extent to which a failure sketch correctly
represents the partial order of LLVM memory access instructions in
the ideal sketch. To measure the similarity in ordering between the
Gist-computed failure sketches and their ideal counterparts, we use
the normalized Kendall tau distance [112] τ, which measures the
number of pairwise disagreements between two ordered lists. For ex-
ample, for ordered lists <A, B, C> and <A, C, B>, the pairs (A, B) and
(A, C) have the same ordering, whereas the pair (B, C) has different
orderings in the two lists, hence τ = 1. We compute the ordering ac-
curacy as a percentage deﬁned by AO = 100 · (1− τ(ΦG,ΦI)# of pairs in ΦG∩ΦI ).
Note that # of pairs in ΦG ∩ΦI can’t be zero, because both failure
sketches will at least contain the failing instruction as a common in-
struction.
We deﬁne overall accuracy as A = AR+AO2 , which equally favors
AO and AR. Of course, different developers may have different sub-
jective opinions on which one matters most.
We show Gist’s accuracy results in Fig. 31. Average relevance accu-
racy is 92%, average ordering accuracy is 100%, and average overall
accuracy is 96%, which leads us to conclude that Gist can compute
failure sketches with high accuracy. The accuracy results are deter-
ministic from one run to the next.
Note that, for all cases when relevance accuracy is below 100%, it
is because Gist’s failure sketches have (relative to the ideal sketches)
some excess statements in the form of a preﬁx to the ideal failure
sketch, as shown in gray in Fig. 30. We believe that developers ﬁnd
it signiﬁcantly easier to visually discard excess statements clustered
as a preﬁx than excess statements that are sprinkled throughout the
failure sketch, so this inaccuracy is actually not of great consequence.
6.2 gist’s evaluation 103
Figure 32 – Contribution of various techniques to Gist’s accuracy.
We show in Fig. 32 the contribution of Gist’s three analysis and
tracking techniques to overall sketch accuracy. To obtain these mea-
surements, we ﬁrst measured accuracy when using just static slicing,
then enabled control ﬂow tracking and re-measured, and ﬁnally en-
abled also data ﬂow tracking and re-measured. While the accuracy re-
sults are consistent across runs, the individual contributions may vary
if, for example, workload non-determinism causes different paths to
be exercised through the program.
A small contribution of a particular technique does not necessar-
ily mean that it does not perform well for a given program, but it
means that the other techniques that Gist had enabled prior to this
technique “stole its thunder” by being sufﬁcient to provide high ac-
curacy. For example, in the case of Apache-1, static analysis performs
well enough that control ﬂow tracking does not need to further reﬁne
the slice. However, in some cases (e.g., for SQLite), tracking the inter-
thread execution order of statements that access shared variables us-
ing hardware watchpoints is crucial for achieving high accuracy.
We observe that the amount of individual contribution varies sub-
stantially from one program to the next, which means that neither of
these techniques would achieve high accuracy for all programs on its
own, and so they are all necessary if we want high accuracy across a
broad spectrum of software.
6.2.4 Efﬁciency
Now we turn our attention to the efﬁciency of Gist: how long does
it take to compute a failure sketch, how much runtime performance
overhead does it impose on clients, and how long does it take to
perform its ofﬂine static analysis. We also look at how these measures
vary with different parameters.
The last column of Table 6 shows Gist’s failure sketch computation
latency broken down into three components. We show the number
of failure recurrences required to reach the best sketch that Gist can
104 evaluation
Figure 33 – Gist’s average runtime performance overhead across all runs as
a function of tracked slice size.
compute, and this number varies from 2 to 5 recurrences. We then
show the total time it took in our simulated environment to ﬁnd this
sketch; this time is always less than 6 minutes, varying from <0m:23s>
to <5m:34s>. Not surprisingly, this time is dominated by how long it
takes the target failure to recur, and in practice this depends on the
number of deployed clients and the variability of execution circum-
stances. Nevertheless, we present the values for our simulated setup
to give an idea as to how long it took to build a failure sketch for
each bug in our evaluation. Finally, in parentheses we show Gist’s
ofﬂine analysis time, which consists of computing the static slice plus
generating instrumentation patches. This time is always less than
3 minutes, varying between <0m:2s> and <2m:32s>. We therefore
conclude that, compared to the debugging latencies experienced by
developers today, Gist’s automated approach to root cause diagnosis
presents a signiﬁcant advantage.
In the context of adaptive slice tracking, the overhead incurred on
the client side increases monotonically with the size of the tracked
slice, which is not surprising. Fig. 33 conﬁrms this experimentally.
The portion of the overhead curve between the slice sizes 16 and 22
is relatively ﬂat compared to the rest of the curve. This is because,
within that interval, Gist only tracks a few control ﬂow events for
Apache-1 and Curl (these programs have no additional data ﬂow ele-
ments in that interval), which introduces negligible overhead.
The majority of the overhead incurred on the client side stems from
control ﬂow tracking. In particular, the overhead of control ﬂow track-
ing varies from a low of 2.01% to a high of 3.43%, whereas the over-
head of data ﬂow tracking varies from a low of 0.87% to a high of
1.04%.
What is perhaps not immediately obvious is the trade-off between
initial slice size σ and the resulting accuracy and latency. In Fig. 34,
we show the average failure sketch accuracy across all programs we
measured (right y-axis) and Gist’s latency in # of recurrences (left y-
axis) as a function of σ that Gist starts with (x-axis). As long as the
initial slice size is less than the one for the best sketch that Gist can
ﬁnd, Gist’s adaptive approach is capable of guiding the developer
to the highest accuracy sketch. Of course, the time it takes to ﬁnd
6.2 gist’s evaluation 105
Figure 34 – Tradeoff between slice size and the resulting accuracy and la-
tency. Accuracy is in percentage, latency is in the number of
failure recurrences.
the sketch is longer the smaller the starting slice size is, because the
necessary # of recurrences is higher. There is thus an incentive to
start with a larger slice size. Unfortunately, if this size overshoots the
size of the highest accuracy sketch, then the accuracy of the outcome
suffers, because the larger slice includes extraneous elements.
As we mentioned in §6.2.3, the extraneous statements that can
lower Gist’s accuracy are clustered as a preﬁx to the ideal failure
sketch, allowing developers to easily ignore them. Therefore, if lower
root cause diagnosis latency is paramount to the developers, they are
comfortable ignoring the preﬁx of extraneous statements, and they
can tolerate the slight increase in Gist’s overhead, it is reasonable to
conﬁgure Gist to start with a large σ (e.g., σ = 23 achieves a latency
of one failure recurrence for all our benchmarks).
For the benchmarks in our evaluation, starting AsT at σ = 4 would
achieve the highest average accuracy at the lowest average latency of
3, with an average overhead of 3.98%.
Finally, Fig. 35 compares Intel PT, the hardware-based control ﬂow
tracking mechanism we use in Gist, to Mozilla rr, a software-based
state-of-the-art record & replay system. In particular, we compare
the performance overhead imposed by the two tracking mechanisms
on the client application. The two extremes are Cppcheck, where
Mozilla rr is on par with Intel PT, and Transmission and SQLite,
where Mozilla rr’s overhead is over many orders of magnitude higher
than Intel PT’s 1. For the benchmarks in our evaluation, full tracing
using Intel PT incurs an average overhead of 11%, whereas full pro-
gram record & replay incurs an average runtime overhead of 984%.
Unlike Intel PT, Mozilla rr also gathers data ﬂow information, but
with Gist we have shown that full program tracing is not necessary
for automating root cause diagnosis.
1. Full tracing overheads of Transmission and SQLite for Intel PT are too low to
be reliably measured, thus they are shown as 0%, and the corresponding Mozilla
rr/Intel PT overheads for these systems are shown as∞.
106 evaluation
Figure 35 – Comparison of the full tracing overheads of Mozilla rr and Intel
PT.
In conclusion, our empirical evaluation shows that Gist is capable
of automatically computing failure sketches for failures caused by
real bugs in real systems (§6.2.2), these sketches have a high accuracy
of 96% on average (§6.2.3), and the average performance overhead
of failure sketching is low at 3.74% with σ = 2 (§6.2.4). We there-
fore believe failure sketching to be a promising approach for helping
developers debug elusive bugs that occur only in production.
6.3 portend’s evaluation
In this section, we answer the following questions: Is Portend effec-
tive in telling developers which data races are true bugs and in help-
ing them ﬁx buggy data races (§6.3.2)? How accurately does it classify
data race reports into the four categories of data races (§6.3.3)? How
long does classiﬁcation take, and how does it scale (§6.3.4)? How
does Portend compare to the state of the art in data race classiﬁca-
tion (§6.3.5)? How effectively and efﬁciently does Portend implement
symbolic memory consistency modeling and what is its memory over-
head (§6.3.6, §6.3.7)? Throughout this section, we highlight the syn-
ergy of the techniques used in Portend: in particular §6.3.2 shows
how symbolic output comparison allows more accurate data race clas-
siﬁcation compared to post-data race state comparison, and §6.3.3
shows how the combination of multi-path multi-schedule analysis
improves upon traditional single-path analysis.
6.3.1 Experimental Setup
We apply Portend to 7 applications: SQLite, Pbzip2, Memcached,
Ocean, and Fmm, which we previously described in §6.1.1; Ctrace [141],
a multi-threaded debug library; Bbuf [223], a shared buffer implemen-
tation with a conﬁgurable number of producers and consumers.
6.3 portend’s evaluation 107
We additionally evaluate Portend on homegrown micro-benchmarks
that capture most classes of data races considered as harmless in the
literature [187, 152]: “redundant writes” (RW), where racing threads
write the same value to a shared variable, “disjoint bit manipula-
tion” (DBM), where disjoint bits of a bit-ﬁeld are modiﬁed by rac- These “harmless”
data races are
anti-patterns for
some languages and
platforms, because
their behavior is
highly dependent on
the compiler and the
hardware [144].
ing threads, “all values valid” (AVV), where the racing threads write
different values that are nevertheless all valid, and “double checked
locking” (DCL), a method used to reduce the locking overhead by
ﬁrst testing the locking criterion without actually acquiring a lock.
Additionally, we have 4 other micro-benchmarks that we used to eval-
uate the SMCM. We detail those micro-benchmarks in §6.3.6. Table 7
summarizes the properties of our 15 experimental targets.
Program Size (LOC) Language # Forked
threads
SQLite 3.3.0 113,326 C 2
ocean 2.0 11,665 C 2
fmm 2.0 11,545 C 3
memcached 1.4.5 8,300 C 8
pbzip2 2.1.1 6,686 C++ 4
ctrace 1.2 886 C 3
bbuf 1.0 261 C 8
AVV 49 C++ 3
DCL 45 C++ 5
DBM 45 C++ 3
RW 42 C++ 3
no-sync 45 C++ 3
no-sync-bug 46 C++ 3
sync 47 C++ 3
sync-bug 48 C++ 3
Table 7 – Programs analyzed with Portend. Source lines of code are mea-
sured with the cloc utility.
We ran Portend on several other systems (e.g., HawkNL, swarm),
but no races were found in those programs with the test cases we ran,
so we do not include them here. For all experiments, the Portend
parameters were set to Mp = 5, Ma = 2, and the number of symbolic
inputs to 2. We found these numbers to be sufﬁcient to achieve high
accuracy in a reasonable amount of time. To validate Portend’s re-
sults, we used manual investigation, analyzed developer change logs,
and consulted with the applications’ developers when possible. All
experiments were run on a 2.4 GHz Intel Core 2 Duo E6600 CPU with
4 GB of RAM running Ubuntu Linux 10.04 with kernel version 2.6.33.
The reported numbers are averages over 10 experiments.
108 evaluation
6.3.2 Effectiveness
Of the 93 distinct data races detected in 7 real-world applications,
Portend classiﬁed 5 as deﬁnitely harmful by watching for “basic”
properties (Table 8): one hangs the program and four crash it.
Program
Total #
of data
races
# of “Spec violated”
races
Deadlock Crash Semantic
SQLite 1 1 0 0
pbzip2 31 0 3 0
ctrace 15 0 1 0
Manually inserted errors
fmm 13 0 0 1
memcached 18 0 1 0
Table 8 – “Spec violated” data races and their consequences.
To illustrate the checking for “high level” semantic properties, we
instructed Portend to verify that all timestamps used in fmm are pos-
itive. This caused it to identify the 6th “harmful” data race in Table 8;
without this semantic check, this data race turns out to be harmless,
as the negative timestamp is eventually overwritten.
To illustrate a “what-if analysis” scenario, we turned an arbitrary
synchronization operation in the memcached binary into a no-op, and
then used Portend to explore the question of whether it is safe to
remove that particular synchronization point (e.g., we may be inter-
ested in reducing lock contention). Removing this synchronization
induces a data race in memcached; Portend determined that the data
race could lead to a crash of the server for a particular interleaving,
so it classiﬁed it as “spec violated”.
Portend’s main contribution is the classiﬁcation of data races. If one
wanted to eliminate all harmful data races from their code, they could
use a static data race detector (one that is complete, and, by necessity,
prone to false positives) and then use Portend to classify these reports.
For every harmful data race, Portend’s comprehensive report and
replayable traces (i.e., inputs and thread schedule) allowed us to con-
ﬁrm the harmfulness of the data races within minutes. Portend’s
report includes the stack traces of the racing threads along with the
address and size of the accessed memory ﬁeld; in the case of a seg-
mentation fault, the stack trace of the faulting instruction is provided
as well—this information can help in automated bug clustering. Ac-
cording to developers’ change logs and our own manual analysis, the
data races in Table 8 are the only known harmful data races in these
applications.
6.3 portend’s evaluation 109
Pr
og
ra
m
N
um
be
r
of
da
ta
ra
ce
s
D
is
ti
nc
t
da
ta
ra
ce
s
D
at
a
ra
ce
in
st
an
ce
s
Sp
ec
vi
ol
at
ed
O
ut
pu
t
di
ff
er
s
K
-w
it
ne
ss
ha
rm
le
ss
Si
ng
le
or
de
ri
ng
st
at
es
sa
m
e
st
at
es
di
ffe
r
SQ
Li
te
1
1
1
0
0
0
0
oc
ea
n
5
1
4
0
0
0
1
4
fm
m
1
3
5
1
7
0
0
0
1
1
2
m
em
ca
ch
ed
1
8
1
0
4
0
2
0
0
1
6
pb
zi
p2
3
1
9
7
3
3
0
0
2
5
ct
ra
ce
1
5
1
9
1
1
0
0
4
0
bb
uf
6
6
0
6
0
0
0
A
V
V
1
1
0
0
1
0
0
D
C
L
1
1
0
0
1
0
0
D
BM
1
1
0
0
1
0
0
R
W
1
1
0
0
1
0
0
Ta
bl
e
9
–
Su
m
m
ar
y
of
Po
rt
en
d’
s
cl
as
si
ﬁc
at
io
n
re
su
lt
s.
W
e
co
ns
id
er
tw
o
da
ta
ra
ce
s
to
be
di
st
in
ct
if
th
ey
in
vo
lv
e
di
ff
er
en
t
ac
ce
ss
es
to
sh
ar
ed
va
ri
ab
le
s;
th
e
sa
m
e
da
ta
ra
ce
m
ay
be
en
co
un
te
re
d
m
ul
ti
pl
e
ti
m
es
du
ri
ng
an
ex
ec
ut
io
n—
th
es
e
tw
o
di
ff
er
en
t
as
pe
ct
s
ar
e
ca
pt
ur
ed
by
th
e
D
is
tin
ct
da
ta
ra
ce
s
an
d
D
at
a
ra
ce
in
st
an
ce
s
co
lu
m
ns
,r
es
pe
ct
iv
el
y.
Po
rt
en
d
us
es
th
e
st
ac
k
tr
ac
es
an
d
th
e
pr
og
ra
m
co
un
te
rs
of
th
e
th
re
ad
s
m
ak
in
g
th
e
ra
ci
ng
ac
ce
ss
es
to
id
en
ti
fy
di
st
in
ct
da
ta
ra
ce
s.
T
he
la
st
5
co
lu
m
ns
cl
as
si
fy
th
e
di
st
in
ct
da
ta
ra
ce
s.
Th
e
st
at
es
sa
m
e/
di
ffe
r
co
lu
m
ns
sh
ow
fo
r
ho
w
m
an
y
da
ta
ra
ce
s
th
e
pr
im
ar
y
an
d
al
te
rn
at
e
st
at
es
w
er
e
di
ff
er
en
t
af
te
r
th
e
da
ta
ra
ce
,a
s
co
m
pu
te
d
by
th
e
R
ec
or
d/
R
ep
la
y
A
na
ly
ze
r
[1
52
].
110 evaluation
6.3.3 Accuracy and Precision
To evaluate Portend’s accuracy and precision, we had it classify
all 93 data races in our target applications and micro-benchmarks.
Table 9 summarizes the results. The ﬁrst two columns show the num-
ber of distinct data races and the number of respective instances, i.e.,
the number of times those data races manifested during data race
detection. The “spec violated” column includes all data races from
Table 8 minus the semantic data race in fmm and the data race we
introduced in memcached. In the “k-witness harmless” column, we
show for which data races the post-data race states differed vs. not.
By accuracy, we refer to the correctness of classiﬁcation: the higher
the accuracy, the higher the ratio of correct classiﬁcation. Precision on
the other hand, refers to the reproducibility of experimental results:
the higher the precision, the higher the ratio with which experiments
are repeated with the same results.
To determine accuracy, we manually classiﬁed each data race and
found that Portend had correctly classiﬁed 92 of the 93 data races
(99%) in our target applications: all except one of the data races clas-
siﬁed “k-witness harmless” by Portend are indeed harmless in an
absolute sense, and all “single ordering” data races indeed involve
ad-hoc synchronization.
To measure precision, we ran 10 times the classiﬁcation for each
data race. Portend consistently reported the same data set shown in
Table 9, which indicates that, for these data races and applications, it
achieves full precision.
As can be seen in the “k-witness harmless” column, for each and
every one of the 7 real-world applications, a state difference (as used
in [152]) does not correctly predict harmfulness, while our “k-witness
harmless” analysis correctly predicts that the data races are harmless
with one exception.
This suggests that differencing of concrete state is a poor classiﬁ-
cation criterion for data races in real-world applications with large
memory states, but may be acceptable for simple benchmarks. This
also supports our choice of using symbolic output comparison.
Multi-path multi-schedule exploration proved to be crucial for Por-
tend’s accuracy. Fig. 36 shows the breakdown of the contribution
of each technique used in Portend: ad-hoc synchronization detec-
tion, multi-path analysis, and multi-schedule analysis. In particular,
for 16 out of 21 “output differs” data races (6 in Bbuf, 9 in Ctrace,
1 in pbzip2) and for 1 “spec violated” data race (in ctrace), single-
path analysis revealed no difference in output; it was only multi-path
multi-schedule exploration that revealed an output difference (9 data
races required multi-path analysis for classiﬁcation, and 8 data races
required also multi-schedule analysis). Without multi-path multi-
schedule analysis, it would have been impossible for Portend to accu-
6.3 portend’s evaluation 111
 20
 40
 60
 80
 100
Ctrace Pbzip2 Memcached Bbuf
Ac
cu
ra
cy
 [%
]
Single-path
Ad-hoc synch detection
Multi-path
Multi-path + Multi-schedule
Figure 36 – Breakdown of the contribution of each technique toward Por-
tend’s accuracy. We start from single-path analysis and enable
one by one the other techniques: ad-hoc synchronization detec-
tion, multi-path analysis, and ﬁnally multi-schedule analysis.
rately classify those data races by just using the available test cases.
Moreover, there is a high variance in the contribution of each tech-
nique for different programs, which means that none of these tech-
niques alone would have achieved high accuracy for a broad range of
programs.
We also wanted to evaluate Portend’s ability to deal with false posi-
tives, i.e., false data race reports. Data race detectors, especially static
ones, may report false positives for a variety of reasons, depending
on which technique they employ. To simulate an imperfect detector
for our applications, we deliberately removed from Portend’s data
race detector its awareness of mutex synchronizations. We then elim-
inated the data races in our micro-benchmarks by introducing mutex
synchronizations. When we re-ran Portend with the erroneous data
race detector on the micro-benchmarks, all four were falsely reported
as data races by the detector, but Portend ultimately classiﬁed all of
them as “single ordering”. This suggests Portend is capable of prop-
erly handling false positives.
Fig. 37 shows examples of real data races for each category: (a) a
“spec violated” data race in which resources are freed twice, (b) a “k-
witness harmless” data race due to redundant writes, (c) an “output
differs” data race in which the schedule-sensitive value of the shared
variable inﬂuences the output, and (d) a “single ordering” data race
showing ad-hoc synchronization implemented via busy wait.
6.3.4 Efﬁciency
We evaluate the performance of Portend in terms of efﬁciency and
scalability. Portend’s performance is mostly relevant if it is to be
used interactively, as a developer tool, and also if used for a large
112 evaluation
OutputBuffer[blockNum].buf = DecompressedData;
Thread T
0
Thread T
1
allDone = 1;
...
while ( allDone == 0) 
  usleep(50000);
   
ret = write(..., OutputBuffer[currBlock],...);
...
(d)
Thread T and T 
0 1
if(_initialized){
  for(i=0; i<tNum; ++i)
    free(threads[i])
_initialized = 0;
}
(a)
Thread T and T 
0 1
  trc_on =1
(b)
current_time = 
(rel_time_t) (timer.tv_sec - process_started);
  
Thread T
0
Thread T
1
settings.oldest_live = current_time - 1;
...
APPEND_STAT(..., settings.oldest_live, ...);
...
PRINT_STAT(...) (c)
if(_trc)
Figure 37 – Simpliﬁed examples for each data race class from real systems.
(a) and (b) are from ctrace, (c) is from memcached and (d) is
from pbzip2. The arrows indicate the pair of racing accesses.
scale bug triage tool, such as in Microsoft’s Windows Error Reporting
system [73].
We measure the time it takes Portend to classify the 93 data races;
Table 10 summarizes the results. We ﬁnd that Portend classiﬁes all
detected data races in a reasonable amount of time, the longest taking
less than 11 minutes. For Bbuf, Ctrace, Ocean and Fmm, the slowest
classiﬁcation time is due to a data race from the “k-witness harmless”
category, since classiﬁcation into this category requires multi-path
multi-schedule analysis.
The second column reports the time it took Cloud9 to interpret
the programs with concrete inputs. This provides a sense of the
overhead incurred by Portend compared to regular LLVM interpre-
tation in Cloud9. Both data race detection and classiﬁcation are dis-
abled when measuring baseline interpretation time. In summary, the
overhead introduced by classiﬁcation ranges from 1.1× to 49.9× over
Cloud9’s interpreter’s overhead.
In order to get a sense of how classiﬁcation time scales with pro-
gram characteristics, we measured it as a function of program size,
number of preemption points, number of branches that depend (di-
rectly or indirectly) on symbolic inputs, and number of threads. We
found that program size plays almost no role in classiﬁcation time.
Instead, the other three characteristics play an important role. We
show in Fig. 38 how classiﬁcation time varies with the number of
dependent branches and the number of preemptions in the schedule
(which is roughly proportional to the number of preemption points
and the number of threads). Each vertical bar corresponds to the clas-
6.3 portend’s evaluation 113
5
10
15
205
20
100
800
Cl
as
sif
ica
tio
n 
tim
e 
[se
c] 
(lo
g s
ca
le)
# d
ep
en
de
nt
 
 
 
 
 
 
bra
nc
he
s
#
 preemption
 points
 (log scale)
20 50 100 400
sqlite1
bbuf1
ctrace1
fmm1
memcached1
ocean1
memcached2
memcached3
Figure 38 – Change in classiﬁcation time with respect to number of preemp-
tions and number of dependent branches for some of the data
races in Table 9. Each sample point is labeled with data race id.
siﬁcation time for the indicated data race. We see that, as the number
of preemptions and branches increase, so does classiﬁcation time.
Program
Cloud9
running
time (sec)
Portend classiﬁcation
time (sec)
Avg Min Max
SQLite 3.10 4.20 4.09 4.25
ocean 19.64 60.02 19.90 207.14
fmm 24.87 64.45 65.29 72.83
memcached 73.87 645.99 619.32 730.37
pbzip2 15.30 360.72 61.36 763.43
ctrace 3.67 24.29 5.54 41.08
bbuf 1.81 4.47 4.77 5.82
AVV 0.72 0.83 0.78 1.02
DCL 0.74 0.85 0.83 0.89
DBM 0.72 0.81 0.79 0.83
RW 0.74 0.81 0.81 0.82
Table 10 – Portend’s classiﬁcation time for the 93 data races in Table 9.
We analyzed Portend’s accuracy with increasing values of k and
found that k = 5 is sufﬁcient to achieve overall 99% accuracy for all
the programs in our evaluation. Fig. 39 shows the results for Ctrace,
Pbzip2, Memcached, and Bbuf. We therefore conclude that it is possi-
ble to achieve high classiﬁcation accuracy with relatively small values
of k.
114 evaluation
 25
 50
 75
 100
 1  3  5  7  9  11
Ac
cu
ra
cy
 [%
]
Value of k
Pbzip2
Ctrace
Memcached
Bbuf
Figure 39 – Portend’s accuracy with increasing values of k.
6.3.5 Comparison to Existing Data Race Detectors
We compare Portend to the Record/Replay-Analyzer technique [152],
Helgrind+’s technique [95], and Ad-Hoc-Detector [200] in terms of
the accuracy with which data races are classiﬁed. We implemented
the Record/Replay-Analyzer technique in Portend and compared ac-
curacy empirically. For the ad-hoc synchronization detection tech-
niques, since we do not have access to the implementations, we an-
alytically derive the expected classiﬁcation based on the published
algorithms. We do not compare to RaceFuzzer [185], because it is
primarily a bug ﬁnding tool looking for harmful data races that occur
due to exceptions and memory errors; it therefore does not provide
a ﬁne-grained classiﬁcation of data races. Similarly, no comparison
is provided to DataCollider [100], since data race classiﬁcation in this
tool is based on heuristics that pertain to data races that we rarely
encountered in our evaluation.
In Table 11 we show the accuracy, relying on manual inspection
as “ground truth”. Record/Replay-Analyzer does not tolerate replay
failures and classiﬁes data races that exhibit a post-data race state
mismatch as harmful (shown as specViol), causing it to have low
accuracy (10%) for that class. When comparing to Helgrind+ and
Ad-Hoc-Detector, we conservatively assume that these tools incur no
false positives when ad-hoc synchronization is present, even though
this is unlikely, given that both tools rely on heuristics. This notwith-
standing, both tools are focused on weeding out data races due to
ad-hoc synchronization, so they cannot properly classify the other
data races (36 out of 93). In contrast, Portend classiﬁes a wider range
of data races with high accuracy.
The main advantage of Portend over Record/Replay-Analyzer is
that it is immune to replay failures. In particular, for all the data
races classiﬁed by Portend as “single ordering”, there was a replay
divergence (that caused replay failures in Record/Replay-Analyzer),
which would cause Record/Replay-Analyzer to classify the corre-
sponding data races as harmful despite them exhibiting no apparent
6.3 portend’s evaluation 115
specViol k-witness outDiff singleOrd
Ground Truth 100% 100% 100% 100%
Record/Replay
10% 95%
-
Analyzer (not-classiﬁed)
Ad-Hoc-Detector, -
100%
Helgrind+ (not-classiﬁed)
Portend 100% 99% 99% 100%
Table 11 – Accuracy for each approach and each classiﬁcation category, ap-
plied to the 93 data races in Table 9. “Not-classiﬁed” means that
an approach cannot perform classiﬁcation for a particular class.
harmfulness; this accounts for 57 of the 84 misclassiﬁcations. Note
that even if Record/Replay-Analyzer were augmented with a phase
that pruned “single ordering” data races (57/93), it would still di-
verge on 32 of the remaining 36 data races and classify them as “spec
violated”, whereas only 5 are actually “spec violated”. Portend, on
the other hand, correctly classiﬁes 35/36 of those remaining data
races. Another advantage is that Portend classiﬁes based on symbolic
output comparison, not concrete state comparison, and therefore, its
classiﬁcation results can apply to a range of inputs rather than a sin-
gle input.
We manually veriﬁed and, when possible, checked with developers
that the data races in the “k-witness harmless” category are indeed
harmless. Except for one data race, we concluded that developers
intentionally left these data races in their programs because they con-
sidered them harmless. These data races match known patterns [152,
100], such as redundant writes to shared variables (e.g., we found
such patterns in Ctrace). However, for one data race in Ocean, we
conﬁrmed that Portend did not ﬁgure out that the data race belongs
in the “output differs” category (the data race can produce different
output if a certain path in the code is followed, which depends indi-
rectly on program input). Portend was not able to ﬁnd this path even
with k = 10 after one hour. Manual investigation revealed that this
path is hard to ﬁnd because it requires a very speciﬁc and complex
combination of inputs.
6.3.6 Efﬁciency and Effectiveness of Symbolic Memory Consistency Mod-
eling
The previous evaluation results were using the SMCM plugin in
the sequential consistency mode. The sequential memory consistency
mode is the default in Cloud9 as well as in Portend. In this section,
we answer the following questions while operating the SMCM plugin
in Portend’s weak consistency mode: (1) Is Portend effective in dis-
116 evaluation
covering bugs that may surface under the weak consistency model?,
(2) What is Portend’s efﬁciency and (3) memory usage while operat-
ing the SMCM plugin in Portend’s weak memory consistency mode?
We use simple micro-benchmarks that we have constructed to test
the basic functionality of SMCM. The simpliﬁed source code for these
micro-benchmarks can be seen in Figs. 40–43. These benchmarks are:
1:  int volatile globalx = 0;
2:  int volatile globaly = 0;
3:  void* work0(void *arg) {
4:    globalx = 2;
5:    globaly = 1;
6:    return 0;
7:  }
8:  void* work1(void *arg) {
9:    globalx = 2;
10:   return 0;
11: }
12: int main(int argc, char *argv[]){
13:   pthread_t t0, t1;
14:   int rc;
15:   rc = pthread_create(&t0, 0, work0, 0);
16:   rc = pthread_create(&t1, 0, work1, 0);
17:   printf(”%d,%d”, globalx, globaly);
18:   pthread_join(t0, 0);
19:   pthread_join(t1, 0);
20:   return 0;
21: }
Thread T
1
Thread T
2
Thread Main
Figure 40 – A program with potential write reordering.
1:  int volatile globalx = 0;
2:  int volatile globaly = 0;
3:  void* work0(void* arg) {
4:    globalx = 2;
5:    globaly = 1;
6:    return 0;
7:  }
8:  void* work1(void* arg) {
9:    globalx = 2;
10:   return 0;
11: }
12: int main(int argc, char* argv[]){
13:   pthread_t t0, t1;
14:   int rc;
15:   rc = pthread_create(&t0, 0, work0, 0);
16:   rc = pthread_create(&t1, 0, work1, 0);
17:   if(globalx == 0 && globaly == 2)
18:     ; //crash!
19:   pthread_join(t0, 0);
20:   pthread_join(t1, 0);
21:   return 0;
22: }
Thread T
1
Thread T
2
Thread Main
Figure 41 – A program with potential write reordering that leads to a crash.
6.3 portend’s evaluation 117
1:  int volatile globalx = 0;
2:  int volatile globaly = 0;
3:  void* work0(void* arg) {
4:    globalx = 2;
5:    pthread_barrier_wait(&barr);
6:    globaly = 1;
7:    return 0;
8:  }
9:  void* work1(void* arg) {
10:    globalx = 2;
       pthread_barrier_wait(&barr);
11:    return 0;
12: }
13: int main(int argc, char* argv[]){
14:   pthread_t t0, t1;
15:   int rc;
16:   pthread_barrier_init(&barr, NULL, 2);
17:   rc = pthread_create(&t0, 0, work0, 0);
18:   rc = pthread_create(&t1, 0, work1, 0);
19:   printf(”%d,%d”, globalx, globaly);
20:   pthread_join(t0, 0);
21:   pthread_join(t1, 0);
22:   return 0;
23: }
Thread T
1
Thread T
2
Thread Main
Figure 42 – A program with no potential for write reordering.
1:  int volatile globalx = 0;
2:  int volatile globaly = 0;
3:  void* work0(void* arg) {
4:    globalx = 2;
5:    globaly = 1;
6:    pthread_barrier_wait(&barr);    
7:    return 0;
8:  }
9:  void* work1(void* arg) {
10:    globalx = 2;
       pthread_barrier_wait(&barr);
11:    return 0;
12: }
13: int main(int argc, char* argv[]){
14:   pthread_t t0, t1;
15:   int rc;
16:   pthread_barrier_init(&barr, NULL, 2);
17:   rc = pthread_create(&t0, 0, work0, 0);
18:   rc = pthread_create(&t1, 0, work1, 0);
19:   if(globalx == 0 && globaly == 2)
20:     ; //crash!
21:   pthread_join(t0, 0);
22:   pthread_join(t1, 0);
23:   return 0;
24: }
Thread T
1
Thread T
2
Thread Main
Figure 43 – A program that uses barriers and has a potential write reorder-
ing that leads to a crash.
— no-sync: The source code for this micro-benchmark can be seen
in Fig. 40: A program with opportunities for write reordering.
118 evaluation
Reorderings cause the printf statement on line 17 to produce
different program outputs.
— no-sync-bug: The source code for this benchmark can be seen in
Fig. 41: A program with opportunities for write reordering. A
particular write reordering causes the program to crash; how-
ever the program does not crash under sequential consistency.
— sync: The source code for this micro-benchmark can be seen in
Fig. 42: A program with no opportunities for write reordering.
There is a data race on both globalx and globaly. Since both
threads 1 and 2 write the same value 2 to globalx, the output
of the program is the same for any execution, assuming writes
are atomic 2.
— sync-bug: The source code for this micro-benchmark can be seen
in Fig. 43: A program with opportunities for write reorder-
ing. The barrier synchronization does not prevent the write
to globalx and globaly from reordering. A particular write
reordering causes the program to crash; however the program
does not crash under sequential consistency.
To evaluate Portend’s effectiveness in ﬁnding bugs that may only
arise under Portend’s weak ordering, we ran the micro-benchmarks
with Portend’s SMCM plugin conﬁgured in two modes: sequential
consistency (Portend-seq) and Portend’s weak consistency (Portend-
weak). We provide the number of bugs found by each conﬁgura-
tion of Portend and also the percentage of possible execution states
that each conﬁguration covers if Portend’s weak consistency were as-
sumed. Note that ground truth (that is the total number of states
that can be covered under Portend’s weak consistency) in this case is
manually identiﬁed, because the number of possible states is small.
Effectively identifying this percentage for arbitrarily large programs
is undecidable.
We present the results in Table 12. As it can be seen, Portend-weak
discovers the bugs that can only be discovered under Portend’s weak
consistency whereas Portend-seq cannot ﬁnd those bugs because of
sequential consistency assumptions. A similar reasoning applies to
state exploration. Portend covers all the possible states that may arise
from returning multiple values at “read”s whereas Cloud9 simply
returns the last value that was written to a memory location and
hence has lower coverage.
We also evaluate the performance of Portend-weak for our micro-
benchmarks and compare its running time to that of Portend-seq. The
results of this comparison can be seen in Fig. 44. The running times of
the benchmarks under Portend-seq essentially represent the “native”
LLVM interpretation time. For the no-sync benchmark we can see
2. If writes are non-atomic, even a seemingly benign data race, where two threads
write the same value to a shared variable, may end up producing unexpected results.
Details of how this can happen can be found in [26]
6.3 portend’s evaluation 119
System
Number of bugs State coverage (%)
Portend-seq Portend-weak Portend-seq Portend-weak
no-sync 0/0 0/0 50 100
no-sync-bug 0/1 1/0 50 100
sync 0/0 0/0 100 100
sync-bug 0/1 1/1 50 100
Table 12 – Portend’s effectiveness in bug ﬁnding and state coverage for two
memory model conﬁgurations: sequential memory consistency
mode and Portend’s weak memory consistency mode.
that the running time of Portend is about 2 seconds more than that
of Portend-seq. This is expected as Portend-weak covers more states
compared to Portend-seq.
However, it should be noted that in the case of the no-sync-bug
benchmark, the running times are almost the same for Portend-weak
and Portend-seq (although not visible on the graph, the running time
of Portend-weak is slightly larger than that of Portend-seq, on the
order of a few milliseconds). This is simply due to the fact that the
bug in no-sync-bug is immediately discovered after the exploration
state has been forked in Portend. The bug is printed at the program
output, and the exploration ends for that particular state. Similar
reasoning applies to the other benchmark pair, namely sync and sync-
bug.
 3
 6
 9
 12
 15
 18
no-sync no-sync
-bug
sync sync-bug
Ex
ec
ut
io
n 
Ti
m
e 
[S
ec
] Portend-seqPortend-weak
Figure 44 – Running time of Portend-weak and Portend-seq
6.3.7 Memory Consumption of Symbolic Memory Consistency Modeling
In this ﬁnal section of the evaluation, we measure the peak mem-
ory consumption of Portend-weak and Portend-seq for the micro-
benchmarks we have tested. The results can be seen in Fig. 45. The
memory consumption increases for all the benchmarks. This is be-
cause for all the benchmarks, Portend-weak always forks off more
120 evaluation
states and/or performs more bookkeeping than Portend-seq, even
though it does not always explore those states.
Although the memory consumption consistently increases for Portend-
weak, it does not increase proportionally with the state forking. This
is possible due to the copy-on-write mechanism employed for explor-
ing states and keeping track of the happens-before graph. However,
when we ran Portend-weak on real world programs, the memory con-
sumption quickly exceeded the memory capacity of the workstation
we used for our experiments. We plan on incorporating techniques
like partial order reduction [67] from model checking in order to re-
duce the number of states that SMCM needs to explore and improve
its scalability as part of future work.
 20
 40
 60
 80
 100
 120
no-sync no-sync
-bug
sync sync-bug
Pe
ak
 M
em
or
y
Co
ns
um
pt
io
n 
[K
By
tes
] Portend-seq
Portend-weak
Figure 45 – Memory usage of Portend-weak and Portend-seq
In summary, Portend is able to classify with 99% accuracy and full
precision all the 93 data races into four data race classes deﬁned
in §5.1 in under 5 minutes per data race on average. Furthermore,
Portend correctly identiﬁes 6 serious harmful data races. Compared
to previously published data race classiﬁcation methods, Portend per-
forms more accurate classiﬁcation (92 out of 93, 99%) and is able
to correctly classify up to 89% more data races than existing replay-
based tools (9 out of 93, 10%). Portend also correctly handles false
positive data race reports. Furthermore, SMCM allows Portend to ac-
curately perform data race classiﬁcation under relaxed memory con-
sistency models, with low overhead.
Part III
WRAPP ING UP
In this ﬁnal part, we discuss ongoing and future work,
and we present concluding remarks.

7
ONGOING AND FUTURE WORK
Our effort to better understand concurrent programs is ongoing.
This section describes how various techniques we developed for root
cause diagnosis can be extended to improve the security of software
systems (7.1); how we can tackle privacy challenges of collaborative
approaches that work in production (7.2); how mixed static-dynamic
analysis can be used to expose deadlocks and other concurrency bugs
(7.3); and how our detection, root cause diagnosis and classiﬁcation
techniques can be applied to large-scale distributed systems (7.4)
7.1 enhancing security through path profiling
Gist allows gathering information about the control ﬂow of a pro-
gram, which can be used to enhance the security of a software system.
Prior work on root cause diagnosis showed that path proﬁles [10] em-
body richer execution information than mere branches, data values
or constraints on values. We speculate that we can use path proﬁles
to also enhance the security of real-world software.
The ﬁrst way in which we can improve security through path pro-
ﬁles is by speeding up the security auditing cycles for critical security
exploits like control ﬂow hijacks. We can ﬁrst build a proﬁle of se-
cure control ﬂows by monitoring multiple user executions. We can
then identify stray executions that deviate from secure control ﬂow
(à la [59] or using machine learning) and examine whether such exe-
cutions exhibit any violations of security properties.
The second way in which we can use path proﬁles is a general-
ization of the idea in the previous paragraph. In particular, we are
interested in ﬁnding answers to the following questions as part of fu-
ture work: Can we identify good paths versus bad paths? Can we au-
tomatically infer properties about paths (e.g., performance behavior)
and help developers better structure their code based on such prop-
erties? What are the meaningful boundaries of programs to monitor
when gathering path information? Can we have intelligent strategies
to sample path behaviors of programs (e.g., strategies better than ran-
dom sampling)?
7.2 privacy implications of collaborative approaches
In this dissertation we introduce collaborative approaches for de-
tecting data races and for ﬁnding the root causes of failures. Both
approaches rely on gathering execution information from end users;
123
124 ongoing and future work
therefore, the information gathered from user endpoints may leak
sensitive user information.
We believe that privacy implications for data race detection (i.e.,
RaceMob) are minimal, because data race detection does not have
access to the actual data values of the variables involved in a data
race.
Root cause diagnosis (i.e., Gist) on the other hand, has access to
actual data values that it monitors as part of the static slice that it
tracks at user sites; therefore, it can potentially leak more private
information.
We believe that both RaceMob and Gist can beneﬁt from ways of
quantifying [196] and limiting the amount of execution information
extracted from user endpoints. It is also possible to forego monitoring
data values as part of root cause diagnosis: while this will improve
the level of privacy preservation, privacy can still leak through control
ﬂow events. One possible way to anonymize control ﬂow could be
through computing hashes describing the control ﬂow of a program.
However, it remains to be seen what are the right boundaries for
computing hashes of control ﬂow events. Effective computation of
execution hashes in the presence of nondeterminism is also an open
question.
7.3 exposing concurrency bugs
An effective strategy for detecting concurrency bugs is to expose
them by increasing the probability of their occurrence. In this disser-
tation we showed how RaceMob uses schedule steering to increase
the probability of manifestation of concurrency bugs (§3). Prior work
used various approaches for systematically exposing concurrency bugs
in the user space [161] and in the kernel space [69]. As a starting
point, we have looked at program transformations to alter a pro-
gram’s thread schedule to expose deadlock bugs [4].
In the future, we would like to explore whether we can use legit-
imate compiler transformations to increase the likelihood of a pro-
gram to violate a speciﬁcation (e.g., cause a crash or a hang). Prior
work [207, 208] used static analysis to identify places in the code
where compilers leveraged undeﬁned behavior to their advantage
and unwittingly introduced bugs. We would like to explore whether
we can infer potential compiler transformations (i.e., not being per-
formed today) that might introduce bugs in the future.
7.4 concurrency in large-scale distributed systems
Finally, many of the problems we attacked in this dissertation have
equivalents in large-scale distributed systems like search engines, dis-
tributed databases, and social media platforms. For instance, data
7.4 concurrency in large-scale distributed systems 125
races and atomicity violations manifest themselves as process-level
races [118] that cause correctness and performance problems as well
as resource leaks. We would like to adapt our techniques for the de-
tection and root cause diagnosis of concurrency bugs in the context
of large-scale distributed systems.

8
CONCLUS IONS
Concurrency bugs are some of the nastiest bugs that affect modern
software. As hardware becomes increasingly parallel, concurrency
bugs become more relevant. Concurrency bugs are hard to detect ef-
ﬁciently, because existing concurrency bug detection techniques rely
on heavyweight analyses. Even when concurrency bugs are detected,
it is challenging to understand which ones are truly harmful, and
how they can manifest during a program’s execution. Concurrency
bugs that rarely occur in production are even harder to tackle, be-
cause detection, root cause diagnosis, and classiﬁcation is even more
challenging in production.
In this dissertation, we develop techniques for the detection, root
cause diagnosis and classiﬁcation of in-production concurrency bugs.
In particular, this dissertation introduces:
— The ﬁrst highly-accurate data race detection technique that
can be used always-on in production . The key idea behind
this technique is to use a combination of in-house static analysis
and a new in-production dynamic analysis that is adaptive and
crowdsourced.
— Failure sketching, the ﬁrst in-production root cause diagno-
sis technique that does not rely on custom hardware or sys-
tem checkpointing infrastructure. The key idea behind failure
sketching is to combine in-house static analysis with lightweight
in-production dynamic analysis that gathers execution events
from user endpoints and build failure sketches that point devel-
opers to the root causes.
— The ﬁrst highly accurate data race classiﬁcation technique. The
key idea behind the approach is to explore multiple program
paths and schedules while observing the effects of data races
on programs’ externally visible outputs (rather than programs’
internal state) in order to perform classiﬁcation.
We built prototypes for all the techniques we introduce in this dis-
sertation and showed that our prototypes are effective, efﬁcient, accu-
rate and precise.
The techniques we develop also help developers reason about sub-
tle behaviors of concurrent programs and avoid concurrent program-
ming pitfalls. We believe that, in the future, our techniques can be
extended to better reason about large-scale distributed systems and
security properties of software systems.
127

B IBL IOGRAPHY
[1] Rui Abreu, Peter Zoeteweij, and Arjan J. C. van Gemund. “An
Evaluation of Similarity Coefﬁcients for Software Fault Local-
ization.” In: IEEE Paciﬁc Rim Intl. Symp. on Dependable Comput-
ing. 2006.
[2] Sarita Adve. “Data Races Are Evil with No Exceptions: Tech-
nical Perspective.” In: 2010.
[3] Rahul Agarwal, Amit Sasturkar, Liqiang Wang, and Scott D.
Stoller. “Optimized Run-time Race Detection and Atomicity
Checking Using Partial Discovered Types.” In: Intl. Conf. on
Automated Software Engineering. 2005.
[4] Baris Kasikci Ali Kheradmand and George Candea. “Lockout:
Efﬁcient Testing for Deadlock Bugs.” In: Workshop on Determin-
ism and Correctness in Parallel Programming. 2014.
[5] Glenn Ammons and James R. Larus. “Improving Data-ﬂow
Analysis with Path Proﬁles.” In: Intl. Conf. on Programming Lan-
guage Design and Implem. 1994.
[6] Apache httpd. http://httpd.apache.org. 2013.
[7] Cyrille Artho, Klaus Havelund, Armin Biere, and Annin Biere.
“High-Level Data Races.” In: STVR. 2003.
[8] Joy Arulraj, Po-Chun Chang, Guoliang Jin, and Shan Lu. “Production-
run Software Failure Diagnosis via Hardware Performance
Counters.” In: Intl. Conf. on Architectural Support for Program-
ming Languages and Operating Systems. 2013.
[9] Joy Arulraj, Guoliang Jin, and Shan Lu. “Leveraging the Short-
term Memory of Hardware to Diagnose Production-run Soft-
ware Failures.” In: Intl. Conf. on Architectural Support for Pro-
gramming Languages and Operating Systems. 2014.
[10] Piramanayagam Arumuga Nainar and Ben Liblit. “Adaptive
Bug Isolation.” In: Intl. Conf. on Software Engineering. 2010.
[11] Mohamed Faouzi Atig, Ahmed Bouajjani, Sebastian Burckhardt,
and Madanlal Musuvathi. “On the Veriﬁcation Problem for
Weak Memory Models.” In: Symp. on Principles of Programming
Languages. 2010.
[12] Amittai Aviram, Shu-Chun Weng, Sen Hu, and Bryan Ford.
“Efﬁcient system-enforced deterministic parallelism.” In: Symp.
on Operating Sys. Design and Implem. 2010.
[13] Gogul Balakrishnan and Thomas Reps. “Analyzing Memory
Accesses in x86 Executables.” In: Intl. Conf. on Compiler Con-
struction. 2004.
129
130 Bibliography
[14] Utpal Banerjee, Brian Bliss, Zhiqiang Ma, and Paul Petersen.
“A Theory of Data Race Detection.” In: Proceedings of the
Workshop on Parallel and Distributed Systems: Testing and
Debugging. 2006.
[15] Baris Kasikci. Are "data races" and "race condition" actually the
same thing in context of concurrent programming. http://stackoverﬂow.
com/questions/11276259/are-data-races-and-race-condition-
actually-the-same-thing-in-context-of-conc/. 2013.
[16] George Candea Baris Kasikci Benjamin Schubert. Gist. http :
//dslab.epﬂ.ch/proj/gist/. 2015.
[17] Rob von Behren, Jeremy Condit, Feng Zhou, George C. Nec-
ula, and Eric Brewer. “Capriccio: Scalable threads for Internet
services.” In: Symp. on Operating Systems Principles. 2003.
[18] Tom Bergan, Owen Anderson, Joseph Devietti, Luis Ceze, and
Dan Grossman. “CoreDet: a compiler and runtime system for
deterministic multithreaded execution.” In: Intl. Conf. on Archi-
tectural Support for Programming Languages and Operating Sys-
tems. 2010.
[19] Emery D. Berger, Ting Yang, Tongping Liu, and Gene No-
vark. “Grace: Safe Multithreaded Programming for C/C++.”
In: Conf. on Object-Oriented Programming, Systems, Languages,
and Applications. 2009.
[20] Al Bessey, Ken Block, Ben Chelf, Andy Chou, Bryan Fulton,
Seth Hallem, Charles Henri-Gros, Asya Kamsky, Scott McPeak,
and Dawson Engler. “A Few Billion Lines of Code Later: Using
Static Analysis to Find Bugs in the Real World.” In: Commun.
ACM (2010).
[21] Swarnendu Biswas, Jipeng Huang, Aritra Sengupta, and Michael
D. Bond. “DoubleChecker: Efﬁcient Sound and Precise Atom-
icity Checking.” In: Intl. Conf. on Programming Language De-
sign and Implem. 2014.
[22] Swarnendu Biswas, Minjia Zhang, and Michael D. Bond. Light-
weight Data Race Detection for Production Runs. Tech. rep. OSU-
CICRC-1/15-TR01. Ohio State University, 2015.
[23] Burton H. Bloom. “Space/Time Trade-offs in Hash Coding
with Allowable Errors.” In: Commun. ACM (1970).
[24] Robert L. Bocchino Jr., Vikram S. Adve, Danny Dig, Sarita V.
Adve, Stephen Heumann, Rakesh Komuravelli, Jeffrey Over-
bey, Patrick Simmons, Hyojin Sung, and Mohsen Vakilian. “A
type and effect system for deterministic parallel Java.” In: Conf.
on Object-Oriented Programming, Systems, Languages, and Appli-
cations. 2009.
Bibliography 131
[25] Hans Boehm. Programming with Threads: Questions Frequently
Asked by C and C++ Programmers. http://www.hboehm.info/c+
+mm/user-faq.html.
[26] Hans-J. Boehm. “How to miscompile programs with "benign"
data races.” In: USENIX Workshop on Hot Topics in Parallelism.
2011.
[27] Hans-J. Boehm. “Position paper: nondeterminism is unavoid-
able, but data races are pure evil.” In: ACM Workshop on Relax-
ing Synchronization for Multicore and Manycore Scalability. 2012.
[28] Hans-J. Boehm and Sarita V. Adve. “Foundations of the C++
concurrency memory model.” In: Proceedings of the 2008 ACM
SIGPLAN conference on Programming language design and imple-
mentation. Intl. Conf. on Programming Language Design and
Implem. 2008.
[29] Hans-J. Boehm and Sarita V. Adve. “You Don’t Know Jack
About Shared Variables or Memory Models.” In: Commun. ACM
(2012).
[30] Michael D. Bond, Katherine E. Coons, and Kathryn S. McKin-
ley. “PACER: Proportional detection of data races.” In: Intl.
Conf. on Programming Language Design and Implem. 2010.
[31] Derek Bruening, Timothy Garnett, and Saman Amarasinghe.
“An Infrastructure for Adaptive Dynamic Optimization.” In:
Intl. Symp. on Code Generation and Optimization. 2003.
[32] Bsdiff. http://www.daemonology.net/bsdiff/. 2015.
[33] Stefan Bucur, Vlad Ureche, Cristian Zamﬁr, and George Can-
dea. “Parallel Symbolic Execution for Automated Real-World
Software Testing.” In: ACM EuroSys European Conf. on Com-
puter Systems. 2011.
[34] Sebastian Burckhardt, Rajeev Alur, and Milo M. K. Martin.
“Bounded Model Checking of Concurrent Data Types on Re-
laxed Memory Models: A Case Study.” In: Intl. Conf. on Com-
puter Aided Veriﬁcation. 2006.
[35] Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. “KLEE:
Unassisted and Automatic Generation of High-Coverage Tests
for Complex Systems Programs.” In: Symp. on Operating Sys.
Design and Implem. 2008.
[36] Cristian Cadar, Vijay Ganesh, Peter M. Pawlowski, David L.
Dill, and Dawson R. Engler. “EXE: Automatically Generating
Inputs of Death.” In: Conf. on Computer and Communication Se-
curity. 2006.
132 Bibliography
[37] Subhachandra Chandra and Peter M. Chen. “Whither Generic
Recovery from Application Faults? A Fault Study Using Open-
Source Software.” In: Intl. Conf. on Dependable Systems and Net-
works. 2000.
[38] Haibo Chen, Jie Yu, Rong Chen, Binyu Zang, and Pen-Chung
Yew. “POLUS: A POwerful Live Updating System.” In: Intl.
Conf. on Software Engineering. 2007.
[39] Trishul M. Chilimbi, Ben Liblit, Krishna Mehra, Aditya V. Nori,
and Kapil Vaswani. “HOLMES: Effective Statistical Debugging
via Efﬁcient Path Proﬁling.” In: Intl. Conf. on Software Engineer-
ing. 2009.
[40] V. Chipounov and G. Candea. “Enabling sophisticated analy-
ses of x86 binaries with RevGen.” In: Intl. Conf. on Dependable
Systems and Networks. 2011.
[41] Vitaly Chipounov, Vlad Georgescu, Cristian Zamﬁr, and George
Candea. “Selective Symbolic Execution.” In: Workshop on Hot
Topics in Dependable Systems. 2009.
[42] Vitaly Chipounov, Volodymyr Kuznetsov, and George Candea.
“S2E: A Platform for In-Vivo Multi-Path Analysis of Software
Systems.” In: Intl. Conf. on Architectural Support for Program-
ming Languages and Operating Systems. 2011.
[43] Vitaly Chipounov, Volodymyr Kuznetsov, and George Can-
dea. “The S2E Platform: Design, Implementation, and Appli-
cations.” In: ACM Transactions on Computer Systems 30.1 (2012).
Special issue: Best papers of ASPLOS.
[44] Jong-Deok Choi, Keunwoo Lee, Alexey Loginov, Robert O’Callahan,
Vivek Sarkar, and Manu Sridharan. “Efﬁcient and precise datarace
detection for multithreaded object-oriented programs.” In: SIG-
PLAN Notices 37.5 (2002), pp. 258–269.
[45] Jong-Deok Choi and Andreas Zeller. “Isolating Failure-inducing
Thread Schedules.” In: Intl. Symp. on Software Testing and Analysis.
2002.
[46] Chris Lattner. libc++. http://libcxx.llvm.org/. 2012.
[47] Intel Corporation. Intel(R) Processor Trace Decoder Library. https:
//github.com/01org/processor-trace. 2015.
[48] Heming Cui, Jingyue Wu, Chia che Tsai, and Junfeng Yang.
“Stable Deterministic Multithreading through Schedule Mem-
oization.” In: Symp. on Operating Sys. Design and Implem. 2010.
[49] CVE’s related to races. http://www.cvedetails.com/vulnerability-
list/cweid-362/vulnerabilities.html.
Bibliography 133
[50] Joseph Devietti, Brandon Lucia, Luis Ceze, and Mark Oskin.
“DMP: deterministic shared memory multiprocessing.” In: Intl.
Conf. on Architectural Support for Programming Languages and
Operating Systems. 2009.
[51] A. Dinning and E. Schonberg. “An Empirical Comparison of
Monitoring Algorithms for Access Anomaly Detection.” In:
Symp. on Principles and Practice of Parallel Computing. 1990.
[52] Anne Dinning and Edith Schonberg. “Detecting Access Anoma-
lies in Programs with Critical Sections.” In: ACM SIGPLAN
Not. (1991).
[53] DRD. http://valgrind.org/docs/manual/drd-manual.html.
2015.
[54] Michel Dubois, Christoph Scheurich, and Faye Briggs. “Mem-
ory Access Buffering in Multiprocessors.” In: Proc. 13th Ann.
Intl. Symp. on Computer Architecture (1986), pp. 374–442.
[55] Laura Efﬁnger-Dean, Brandon Lucia, Luis Ceze, Dan Gross-
man, and Hans-J. Boehm. “IFRit: interference-free regions for
dynamic data-race detection.” In: Conf. on Object-Oriented
Programming, Systems, Languages, and Applications. 2012.
[56] Tayfun Elmas, Shaz Qadeer, and Serdar Tasiran. “Goldilocks:
A race and transaction-aware Java runtime.” In: Intl. Conf. on
Programming Language Design and Implem. San Diego, Califor-
nia, USA, 2007.
[57] Tayfun Elmas, Shaz Qadeer, and Serdar Tasiran. “Goldilocks:
Efﬁciently Computing the Happens-Before Relation Using Lock-
sets.” In: Intl. Conf. on Runtime Veriﬁcation. 2006.
[58] Dawson Engler and Ken Ashcraft. “RacerX: Effective, Static
Detection of Race Conditions and Deadlocks.” In: Symp. on
Operating Systems Principles. 2003.
[59] Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou,
and Benjamin Chelf. “Bugs as Deviant Behavior: A General
Approach to Inferring Errors in Systems Code.” In: Symp. on
Operating Systems Principles. 2001.
[60] Peter Eriksson. Parallel File Scanner. http://ostatic.com/pfscan.
2013.
[61] Brad Fitzpatrick. Memcached. http://memcached.org. 2013.
[62] Cormac Flanagan and Stephen N. Freund. “Adversarial mem-
ory for detecting destructive races.” In: Intl. Conf. on Program-
ming Language Design and Implem. 2010.
[63] Cormac Flanagan and Stephen N Freund. “Atomizer: A dy-
namic atomicity checker for multithreaded programs.” In: SIG-
PLAN Notices 39.1 (2004), pp. 256–267.
134 Bibliography
[64] Cormac Flanagan and Stephen N. Freund. “FastTrack: Efﬁ-
cient and precise dynamic race detection.” In: Intl. Conf. on
Programming Language Design and Implem. 2009.
[65] Cormac Flanagan and Stephen N. Freund. “Type-based Race
Detection for Java.” In: Intl. Conf. on Programming Language
Design and Implem. 2000.
[66] Cormac Flanagan, Stephen N. Freund, and Jaeheon Yi. “Velo-
drome: A sound and complete dynamic atomicity checker for
multithreaded programs.” In: Intl. Conf. on Programming Lan-
guage Design and Implem. 2008.
[67] Cormac Flanagan and Patrice Godefroid. “Dynamic partial-
order reduction for model checking software.” In: 2005.
[68] Pedro Fonseca, Cheng Li, and Rodrigo Rodrigues. “Finding
complex concurrency bugs in large multi-threaded applica-
tions.” In: ACM EuroSys European Conf. on Computer Systems.
2011.
[69] Pedro Fonseca, Rodrigo Rodrigues, and Björn B. Brandenburg.
“SKI: exposing kernel concurrency bugs through systematic
schedule exploration.” In: Symp. on Operating Sys. Design
and Implem. 2014.
[70] Marco Galluzzi, Ramón Beivide, Valentin Puente, José-Ángel
Gregorio, Adrian Cristal, and Mateo Valero. “Evaluating Kilo-
instruction Multiprocessors.” In: Workshop on Memory Per-
formance Issues. 2004.
[71] Vijay Ganesh and David L. Dill. “A decision procedure for
bit-vectors and arrays.” In: Intl. Conf. on Computer Aided Veriﬁ-
cation. 2007.
[72] Jeff Gilchrist. Parallel BZIP2. http://compression.ca/pbzip2.
2013.
[73] Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel
Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle,
and Galen Hunt. “Debugging in the (very) large: ten years of
implementation and experience.” In: Symp. on Operating Sys-
tems Principles. 2009.
[74] Patrice Godefroid, Michael Y. Levin, and David Molnar. “Au-
tomated Whitebox Fuzz Testing.” In: Network and Distributed
System Security Symp. 2008.
[75] Patrice Godefroid and Nachiappan Nagappan. “Concurrency
at Microsoft – An Exploratory Survey.” In: Intl. Conf. on Com-
puter Aided Veriﬁcation. 2008.
[76] Google Courgette. https://chromium.googlesource.com/chromium/
src/courgette/+/master.
Bibliography 135
[77] Jim Gray. Why do computers stop and what can be done about it?
Tech. rep. TR-85.7. Cupertino, CA: Tandem Computers, 1985.
[78] Hacking Starbucks for unlimited coffee. http://sakurity.com/
blog/2015/05/21/starbucks.html.
[79] Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom,
John D. Davis, Ben Hertzberg, Manohar K. Prabhu, Honggo
Wijaya, Christos Kozyrakis, and Kunle Olukotun. “Transac-
tional Memory Coherence and Consistency.” In: Intl. Symp.
on Computer Architecture. 2004.
[80] Steven Hand. “An experiment in determinism.” In: Communi-
cations of the ACM (2012).
[81] Matthias Hauswirth and Trishul M. Chilimbi. “Low-overhead
Memory Leak Detection Using Adaptive Statistical Proﬁling.”
In: Intl. Conf. on Architectural Support for Programming Languages
and Operating Systems. 2004.
[82] Helgrind. http://valgrind.org/docs/manual/hg- manual.
html. 2012.
[83] Maurice Herlihy and J. Eliot B. Moss. “Transactional memory:
Architectural support for lock-free data structures.” In: Intl.
Symp. on Computer Architecture. 1993.
[84] Maurice P. Herlihy and Jeannette M. Wing. “Linearizability: A
Correctness Condition for Concurrent Objects.” In: TOPLAS
(1990).
[85] C. A. R. Hoare. “Monitors: An Operating System Structuring
Concept.” In: Communications of the ACM 17.10 (1974).
[86] Jeff Huang, Patrick O’Neil Meredith, and Grigore Rosu. “Max-
imal Sound Predictive Race Detection with Control Flow Ab-
straction.” In: SIGPLAN Not. (2014).
[87] IEEE. “1003.1 Standard for Information Technology Portable
Operating System Interface (POSIX) Rationale (Informative).”
In: IEEE Std 1003.1-2001. Rationale (Informative) (2001).
[88] Intel. Intel 64 and IA-32 Architectures Software Developer’s Man-
ual. Vol. 2. 325383-038US. 2015.
[89] Intel Corp. Parallel Inspector. http://software.intel .com/en-
us/articles/intel-parallel-inspector. 2012.
[90] Intel Corporation. Intel Processor Trace. https : / / software .
intel.com/en-us/blogs/2013/09/18/processor-tracing.
2013.
[91] Intel TSX. https ://software . intel . com/en- us/ tags/20581.
2015.
[92] ISO/IEC 14882:2011: Information technology – Programming lan-
guages – C++. International Organization for Standardization.
2011.
136 Bibliography
[93] ISO/IEC 9899:2011: Information technology – Programming lan-
guages – C. International Organization for Standardization. 2011.
[94] Nicholas Jalbert, Cristiano Pereira, Gilles Pokam, and Koushik
Sen. “RADBench: A Concurrency Bug Benchmark Suite.” In:
USENIX Workshop on Hot Topics in Parallelism. 2011.
[95] Ali Jannesari and Walter F. Tichy. “Identifying Ad-hoc Syn-
chronization for Enhanced Race Detection.” In: Intl. Parallel
and Distributed Processing Symp. 2010.
[96] Java Synchronized Methods. https://docs.oracle.com/javase/
tutorial/essential/concurrency/syncmeth.html.
[97] Yang Liu Jiaqi Zhang Weiwei Xiong, Soyeon Park, Yuanyuan
Zhou, and Zhiqiang Ma. “ATDetector: Improving the Accu-
racy of a Commercial Data Race Detector by Identifying Ad-
dress Transfer.” In: IEEE/ACM International Symposium on Mi-
croarchitecture. 2011.
[98] Guoliang Jin, Aditya Thakur, Ben Liblit, and Shan Lu. “In-
strumentation and sampling strategies for cooperative concur-
rency bug isolation.” In: SIGPLAN Not. (2010).
[99] John Criswell. The Information Flow Compiler. https://llvm.
org/svn/llvm-project/giri/. 2011.
[100] Sebastian Burckhardt John Erickson Madanlal Musuvathi and
Kirk Olynyk. “Effective Data-Race Detection for the Kernel.”
In: Symp. on Operating Sys. Design and Implem. 2010.
[101] John Regehr. Race Condition vs. Data Race. http://blog.regehr.
org/archives/490. 2011.
[102] James A. Jones and Mary Jean Harrold. “Empirical Evaluation
of the Tarantula Automatic Fault-localization Technique.” In:
Intl. Conf. on Automated Software Engineering. 2005.
[103] Vineet Kahlon, Franjo Ivancˇic´, and Aarti Gupta. “Reasoning
About Threads Communicating via Locks.” In: Intl. Conf. on
Computer Aided Veriﬁcation. 2005.
[104] Vineet Kahlon, Nishant Sinha, Erik Kruus, and Yun Zhang.
“Static Data Race Detection for Concurrent Programs with Asyn-
chronous Calls.” In: FSE. 2009.
[105] Vineet Kahlon, Yu Yang, Sriram Sankaranarayanan, and Aarti
Gupta. “Fast and Accurate Static Data-race Detection for Con-
current Programs.” In: CAV. 2007.
[106] Baris Kasikci, Thomas Ball, George Candea, John Erickson,
and Madanlal Musuvathi. “Efﬁcient Tracing of Cold Code Via
Bias-Free Sampling.” In: USENIX Annual Technical Conf. 2014.
Bibliography 137
[107] Baris Kasikci, Cristiano Pereira, Gilles Pokam, Benjamin Schu-
bert, Madan Musuvathi, and George Candea. “Failure Sketches:
A Better Way to Debug.” In: Workshop on Hot Topics in Operat-
ing Systems. 2015.
[108] Baris Kasikci, Benjamin Schubert, Cristiano Pereira, Gilles Pokam,
and George Candea. “Failure Sketching: A Technique for Au-
tomated Root Cause Diagnosis of In-Production Failures.” In:
Symp. on Operating Systems Principles. 2015.
[109] Baris Kasikci, Cristian Zamﬁr, and George Candea. “Automated
Classiﬁcation of Data Races Under Both Strong and Weak Mem-
ory Models.” In: ACM Transactions on Programming Languages
and Systems 37.3 (2015).
[110] Baris Kasikci, Cristian Zamﬁr, and George Candea. “Data Races
vs. Data Race Bugs: Telling the Difference with Portend.” In:
Intl. Conf. on Architectural Support for Programming Languages
and Operating Systems. 2012.
[111] Baris Kasikci, Cristian Zamﬁr, and George Candea. “RaceMob:
Crowdsourced Data Race Detection.” In: Symp. on Operating
Systems Principles. 2013.
[112] M. G. Kendall. “A New Measure of Rank Correlation.” In:
Biometrika (1938).
[113] James C. King. “A new approach to program testing.” In: Intl.
Conf. on Reliable Software. 1975.
[114] James C. King. “Symbolic execution and program testing.” In:
Communications of the ACM (1976).
[115] Andi Kleen. Announcing simple-pt - A simple Processor Trace im-
plementation. http://halobates.de/blog/p/344. 2015.
[116] Andi Kleen. simple-pt Linux driver. https : / / github . com /
andikleen/simple-pt. 2015.
[117] Volodymyr Kuznetsov, Johannes Kinder, Stefan Bucur, and
George Candea. “Efﬁcient state merging in symbolic execu-
tion.” In: Intl. Conf. on Programming Language Design and Im-
plem. 2012.
[118] Oren Laadan, Nicolas Viennot, Chia-Che Tsai, Chris Blinn, Jun-
feng Yang, and Jason Nieh. “Pervasive detection of process
races in deployed systems.” In: Symp. on Operating Systems
Principles. 2011.
[119] Leslie Lamport. “Time, clocks, and the ordering of events in a
distributed system.” In: Communications of the ACM 21.7 (1978).
[120] Butler W. Lampson and David D. Redell. “Experience with
Processes and Monitors in Mesa.” In: Communications of the
ACM 23.2 (1980).
138 Bibliography
[121] Chris Lattner. “Macroscopic Data Structure Analysis and Opti-
mization.” PhD thesis. University of Illinois at Urbana-Champaign,
May 2005.
[122] Chris Lattner and Vikram Adve. “LLVM: A Compilation Frame-
work for Lifelong Program Analysis and Transformation.” In:
Intl. Symp. on Code Generation and Optimization. 2004.
[123] Dongyoon Lee, Peter M. Chen, Jason Flinn, and Satish Narayanasamy.
“Chimera: Hybrid program analysis for determinism.” In: Intl.
Conf. on Programming Language Design and Implem. 2012.
[124] Nancy G. Leveson and Clark S. Turner. “An Investigation of
the Therac-25 Accidents.” In: IEEE Computer (1993).
[125] Ben Liblit, Alex Aiken, Alice X. Zheng, and Michael I. Jordan.
“Bug isolation via remote program sampling.” In: Intl. Conf. on
Programming Language Design and Implem. 2003.
[126] Ben Liblit, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael
I. Jordan. “Scalable Statistical Bug Isolation.” In: Intl. Conf. on
Programming Language Design and Implem. 2005.
[127] Benjamin Robert Liblit. “Cooperative Bug Isolation.” PhD the-
sis. University of California, Berkeley, Dec. 2004.
[128] Linux branch with Intel PT support. https://github.com/virtuoso/
linux-perf/tree/intel_pt. 2015.
[129] Richard J. Lipton. “Reduction: A Method of Proving Properties
of Parallel Programs.” In: Commun. ACM (1975).
[130] Tongping Liu, Charlie Curtsinger, and Emery D. Berger. “Dthreads:
efﬁcient deterministic multithreading.” In: Symp. on Operating
Systems Principles. 2011.
[131] Shan Lu. “Understanding, Detecting and Exposing Concur-
rency Bugs.” PhD thesis. UIUC, 2008.
[132] Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. “Learn-
ing from Mistakes – A Comprehensive Study on Real World
Concurrency Bug Characteristics.” In: Intl. Conf. on Architec-
tural Support for Programming Languages and Operating Systems.
2008.
[133] Shan Lu, Joseph Tucek, Feng Qin, and Yuanyuan Zhou. “AVIO:
Detecting Atomicity Violations via Access Interleaving Invari-
ants.” In: Intl. Conf. on Architectural Support for Program-
ming Languages and Operating Systems. 2006.
[134] Brandon Lucia, Joseph Devietti, Karin Strauss, and Luis Ceze.
“Atom-Aid: Detecting and Surviving Atomicity Violations.” In:
Intl. Symp. on Computer Architecture. 2008.
Bibliography 139
[135] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Ar-
tur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi,
and Kim Hazelwood. “PIN: building customized program analysis
tools with dynamic instrumentation.” In: Intl. Conf. on Program-
ming Language Design and Implem. 2005.
[136] Nuno Machado, Brandon Lucia, and Luís Rodrigues. “Concur-
rency Debugging with Differential Schedule Projections.” In:
Intl. Conf. on Programming Language Design and Implem. 2015.
[137] Jeremy Manson, William Pugh, and Sarita V. Adve. “The Java
memory model.” In: Symp. on Principles of Programming Lan-
guages. 2005.
[138] Jeremy Manson, William Pugh, and Sarita V. Adve. “The Java
Memory Model.” In: Symp. on Principles of Programming
Languages. 2005.
[139] Daniel Marino, Madanlal Musuvathi, and Satish Narayanasamy.
“LiteRace: Effective sampling for lightweight data-race detec-
tion.” In: Intl. Conf. on Programming Language Design and Im-
plem. 2009.
[140] Daniel MarjamÃ€ki. Cppcheck. http://cppcheck.sourceforge.
net/. 2015.
[141] Cal McPherson. Ctrace. http://ctrace.sourceforge.net. 2012.
[142] John Mellor-Crummey. “On-the-ﬂy detection of data races for
programs with nested fork-join parallelism.” In: Supercomput-
ing. 1991.
[143] Memcached issue 127. http://code.google.com/p/memcached/
issues/detail?id=127.
[144] Scott Meyers and Andrei Alexandrescu. C++ and The Perils of
Double-Checked Locking. http://www.drdobbs.com/184405772.
[145] Barton P. Miller, Mark D. Callaghan, Jonathan M. Cargille, Jef-
frey K. Hollingsworth, R. Bruce Irvin, Karen L. Karavanic, Kr-
ishna Kunchithapadam, and Tia Newhall. “The Paradyn Par-
allel Performance Measurement Tool.” In: Computer (1995).
[146] Madanlal Musuvathi, Sebastian Burckhardt, Pravesh Kothari,
and Santosh Nagarakatte. “A Randomized Scheduler with Prob-
abilistic Guarantees of Finding Bugs.” In: Intl. Conf. on Architec-
tural Support for Programming Languages and Operating Systems.
2010.
[147] Madanlal Musuvathi, Shaz Qadeer, Thomas Ball, Gérard Basler,
Piramanayagam Arumuga Nainar, and Iulian Neamtiu. “Find-
ing and Reproducing Heisenbugs in Concurrent Programs.”
In: Symp. on Operating Sys. Design and Implem. 2008.
140 Bibliography
[148] Abdullah Muzahid, Dario Suárez, Shanxiang Qi, and Josep
Torrellas. “SigRace: Signature-based Data Race Detection.” In:
Intl. Symp. on Computer Architecture. 2009.
[149] Mayur Naik and Alex Aiken. “Conditional Must Not Aliasing
for Static Race Detection.” In: SIGPLAN Not. (2007).
[150] Mayur Naik, Alex Aiken, and John Whaley. “Effective static
race detection for Java.” In: Intl. Conf. on Programming Language
Design and Implem. 2006.
[151] Mayur Naik, Alex Aiken, and John Whaley. “Effective static
race detection for Java.” In: Intl. Conf. on Programming Language
Design and Implem. 2006.
[152] Satish Narayanasamy, Zhenghao Wang, Jordan Tigani, Andrew
Edwards, and Brad Calder. “Automatically classifying benign
and harmful data races using replay analysis.” In: Intl. Conf.
on Programming Language Design and Implem. (2007).
[153] George C. Necula, Scott McPeak, S.P. Rahul, and Westley Weimer.
“CIL: Intermediate Language and Tools for Analysis and Trans-
formation of C Programs.” In: Intl. Conf. on Compiler Construc-
tion. 2002.
[154] Robert H. B. Netzer and Barton P. Miller. “What are race con-
ditions?: Some issues and formalizations.” In: ACM Letters on
Programming Languages and Systems (1992).
[155] Hiroyasu Nishiyama. “Detecting Data Races Using Dynamic
Escape Analysis Based on Read Barrier.” In: Conference on
Virtual Machine Research And Technology Symposium. 2004.
[156] Gene Novark, Emery D. Berger, and Benjamin G. Zorn. “Ex-
terminator: Automatically correcting memory errors with high
probability.” In: Intl. Conf. on Programming Language Design and
Implem. 2007.
[157] Robert O’Callahan and Jong-Deok Choi. “Hybrid dynamic data
race detection.” In: Symp. on Principles and Practice of Parallel
Computing. 2003.
[158] Christos H. Papadimitriou. “The Serializability of Concurrent
Database Updates.” In: Journal of the ACM (1979).
[159] Mark S. Papamarcos and Janak H. Patel. “A Low-overhead Co-
herence Solution for Multiprocessors with Private Cache Mem-
ories.” In: Intl. Symp. on Computer Architecture. 1984.
[160] Chang-Seo Park and Koushik Sen. “Randomized Active Atom-
icity Violation Detection in Concurrent Programs.” In: Symp.
on the Foundations of Software Eng. 2008.
Bibliography 141
[161] Soyeon Park, Shan Lu, and Yuanyuan Zhou. “CTrigger: Ex-
posing Atomicity Violation Bugs from Their Hiding Places.”
In: Intl. Conf. on Architectural Support for Programming Lan-
guages and Operating Systems. 2009.
[162] Soyeon Park, Weiwei Xiong, Zuoning Yin, Rini Kaushik, Kyu
H. Lee, Shan Lu, and Yuanyuan Zhou. “Do You Have to Re-
produce the Bug at the First Replay Attempt? – PRES: Proba-
bilistic Replay with Execution Sketching on Multiprocessors.”
In: Symp. on Operating Systems Principles. 2009.
[163] Jeff H. Perkins, Sunghun Kim, Sam Larsen, Saman Amaras-
inghe, Jonathan Bachrach, Michael Carbin, Carlos Pacheco, Frank
Sherwood, Stelios Sidiroglou, Greg Sullivan, Weng-Fai Wong,
Yoav Zibin, Michael D. Ernst, and Martin Rinard. “Automati-
cally Patching Errors in Deployed Software.” In: Symp. on Op-
erating Sys. Design and Implem. 2010.
[164] Eli Pozniansky and Assaf Schuster. “Efﬁcient on-the-ﬂy data
race detection in multithreaded C++ programs.” In: Symp. on
Principles and Practice of Parallel Computing. 2003.
[165] Eli Pozniansky and Assaf Schuster. “MultiRace: Efﬁcient On-
the-ﬂy Data Race Detection in Multithreaded C++ Programs:
Research Articles.” In: Concurrency and Computation: Practice
and Experience (2007).
[166] Polyvios Pratikakis, Jeffrey S. Foster, and Michael Hicks. “LOCK-
SMITH: context-sensitive correlation analysis for race detec-
tion.” In: Intl. Conf. on Programming Language Design and Im-
plem. 2006.
[167] Christoph von Praun and Thomas R. Gross. “Object Race De-
tection.” In: SIGPLAN Not. (2001).
[168] Christoph von Praun and Thomas R. Gross. “Static Conﬂict
Analysis for Multi-threaded Object-oriented Programs.” In: Intl.
Conf. on Programming Language Design and Implem. 2003.
[169] Christoph von Praun and Thomas R. Gross. “Static Detection
of Atomicity Violations in Object-Oriented Programs.” In: Jour-
nal of Object Technology 3.6 (2004), pp. 103–122.
[170] Feng Qin, Joseph Tucek, Yuanyuan Zhou, and Jagadeesan Sun-
daresan. “Rx: Treating bugs as allergies – a safe method to
survive software failures.” In: ACM Transactions on Computer
Systems 25.3 (2007).
[171] Quora. What is a coder’s worst nightmware? http://www.quora.
com/What-is-a-coders-worst-nightmare.
[172] Ravi Rajwar and James R. Goodman. “Speculative Lock Eli-
sion: Enabling Highly Concurrent Multithreaded Execution.”
In: IEEE/ACM International Symposium on Microarchitecture.
2001.
142 Bibliography
[173] Sadun Anik Rastislav Bodik. “Path-sensitive value-ﬂow analysis.”
In: Symp. on Principles of Programming Languages. 1998.
[174] David D. Redell, Yogen K. Dalal, Thomas R. Horsley, Hugh C.
Lauer, William C. Lynch, Paul R. McJones, Hal G. Murray, and
Stephen C. Purcell. “Pilot: An Operating System for a Personal
Computer.” In: Comm. of the ACM (1980).
[175] Mozilla Research. Rust Programming Language. https://www.
rust-lang.org/.
[176] C. J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann,
1979.
[177] Martin C. Rinard and Monica S. Lam. “The Design, Implemen-
tation, and Evaluation of Jade.” In: ACM Trans. Program. Lang.
Syst. (1998).
[178] Caitlin Sadowski and Jaeheon Yi. “How Developers Use Data
Race Detection Tools.” In: Proceedings of the 5th Workshop on
Evaluation and Usability of Programming Languages and Tools. PLATEAU.
2014.
[179] Swarup Kumar Sahoo, John Criswell, and Vikram Adve. “An
Empirical Study of Reported Bugs in Server Software with Im-
plications for Automated Bug Diagnosis.” In: Intl. Conf. on
Software Engineering. 2010.
[180] Swarup Kumar Sahoo, John Criswell, Chase Geigle, and Vikram
Adve. “Using Likely Invariants for Automated Software Fault
Localization.” In: (2013).
[181] Amit Sasturkar, Rahul Agarwal, Liqiang Wang, and Scott D.
Stoller. “Automated Type-based Analysis of Data Races and
Atomicity.” In: Symp. on Principles and Practice of Parallel Com-
puting. 2005.
[182] Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobal-
varro, and Thomas Anderson. “Eraser: A dynamic data race
detector for multithreaded programs.” In: ACM Transactions
on Computer Systems 15.4 (1997).
[183] D. Schonberg. “On-the-ﬂy Detection of Access Anomalies.” In:
SIGPLAN Not. (1989).
[184] Edith Schonberg. “On-the-ﬂy detection of access anomalies
(with retrospective).” In: SIGPLAN Notices 39.4 (2004).
[185] Koushik Sen. “Race directed random testing of concurrent pro-
grams.” In: Intl. Conf. on Programming Language Design and Im-
plem. (2008).
[186] Koushik Sen, Darko Marinov, and Gul Agha. “CUTE: a con-
colic unit testing engine for C.” In: Symp. on the Foundations of
Software Eng. 2005.
Bibliography 143
[187] Konstantin Serebryany and Timur Iskhodzhanov. “ThreadSan-
itizer - Data race detection in practice.” In: Workshop on Binary
Instrumentation and Applications. 2009.
[188] Jaswinder Pal Singh, Wolf-Dietrich Weber, and Anoop Gupta.
SPLASH: Stanford Parallel Applications for Shared Memory. Tech.
rep. CSL-TR-92-526. Stanford University Computer Systems
Laboratory, 1992.
[189] Richard L. Sites, ed. Alpha architecture reference manual. 1992.
[190] Jiri Slaby. LLVM Slicer. https://github.com/jirislaby/LLVMSlicer/.
2014.
[191] Yannis Smaragdakis, Jacob Evans, Caitlin Sadowski, Jaeheon
Yi, and Cormac Flanagan. “Sound Predictive Race Detection
in Polynomial Time.” In: (2012).
[192] SQLite. http://www.sqlite.org/. 2013.
[193] Daniel Stenberg. Curl. http://curl.haxx.se/. 2015.
[194] Daniel Stenberg. Curl bug 965. http://sourceforge.net/p/curl/
bugs/965/. 2013.
[195] Bill Stoddard. Apache bug 21287. https : / / bz . apache . org /
bugzilla/show_bug.cgi?id=21287. 2003.
[196] Latanya Sweeney. “K-Anonymity: A Model for Protecting Pri-
vacy.” In: Intl. Journal on Uncertainty, Fuzziness and Knowledge-
based Systems. 2002.
[197] Takamitsu Tahara, Katsuhiko Gondow, and Seiya Ohsuga. “DRAC-
ULA: Detector of Data Races in Signals Handlers.” In: Asia-
Paciﬁc Software Engineering Conference. 2008.
[198] The Associated Press. General Electric Acknowledges Northeast-
ern Blackout Bug. http://www.securityfocus.com/news/8032.
Feb. 12, 2004.
[199] William Thies, Michal Karczmarek, and Saman P. Amarasinghe.
“StreamIt: A Language for Streaming Applications.” In: CC.
2002.
[200] Chen Tian, Vijay Nagarajan, Rajiv Gupta, and Sriraman Tal-
lam. “Dynamic recognition of synchronization operations for
improved data race detection.” In: Intl. Symp. on Software Test-
ing and Analysis. 2008.
[201] TIOBE Programming Community Index. http://www.tiobe.com/
tiobe_index/. Nov. 2004.
[202] Nicholas Hunt Tom Bergan Joseph Devietti and Luis Ceze.
“The Deterministic Execution Hammer: How Well Does it Ac-
tually Pound Nails?” In: Workshop on Determinism and Correct-
ness in Parallel Programming. 2011.
[203] Transmission. http://www.transmissionbt.com/. 2015.
144 Bibliography
[204] Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and
Yuanyuan Zhou. “Triage: diagnosing production run failures
at the user’s site.” In: Symp. on Operating Systems Principles.
2007.
[205] Kaushik Veeraraghavan, Peter M. Chen, Jason Flinn, and Satish
Narayanasamy. “Detecting and surviving data races using com-
plementary schedules.” In: Symp. on Operating Systems Princi-
ples. 2011.
[206] Jan Wen Voung, Ranjit Jhala, and Sorin Lerner. “RELAY: Static
race detection on millions of lines of code.” In: Symp. on the
Foundations of Software Eng. 2007.
[207] Xi Wang, Nickolai Zeldovich, M. Frans Kaashoek, and Ar-
mando Solar-Lezama. “A Differential Approach to Undeﬁned
Behavior Detection.” In: ACM Transactions on Computer Systems
(2015).
[208] Xi Wang, Nickolai Zeldovich, M. Frans Kaashoek, and Ar-
mando Solar-Lezama. “Towards Optimization-safe Systems: An-
alyzing the Impact of Undeﬁned Behavior.” In: Symp. on Op-
erating Systems Principles. 2013.
[209] Yan Wang, Harish Patil, Cristiano Pereira, Gregory Lueck, Ra-
jiv Gupta, and Iulian Neamtiu. “DrDebug: Deterministic Re-
play Based Cyclic Debugging with Dynamic Slicing.” In: Intl.
Symp. on Code Generation and Optimization. 2014.
[210] David L. Weaver and Tom Germond, eds. The SPARC Architec-
ture Manual, Version 9. 1994.
[211] Josef Weidendorfer. KCachegrind. http://kcachegrind.sourceforge.
net/html/Home.html. 2015.
[212] Weining Gu, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Zhen-
Yu Yang. Characterization of Linux Kernel Behavior under Errors.
2003.
[213] Mark Weiser. “Program slicing.” In: Intl. Conf. on Software En-
gineering. 1981.
[214] David Wheeler. SLOCCount. http://www.dwheeler.com/sloccount/.
2010.
[215] P. F. Wilson, L. D. Dell, and G. F. Anderson. Root Cause Analysis
: A Tool for Total Quality Management. American Society for
Quality, 1993.
[216] Robert P. Wilson and Monica S. Lam. “Efﬁcient context-sensitive
pointer analysis for C programs.” In: Intl. Conf. on Programming
Language Design and Implem. La Jolla, CA, 1995.
[217] Windows Process and Thread Functions. https://msdn.microsoft.
com/en-us/library/windows/desktop/ms684847(v=vs.85)
.aspx.
Bibliography 145
[218] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder
Pal Singh, and Anoop Gupta. “The SPLASH-2 programs: char-
acterization and methodological considerations.” In: Intl. Symp.
on Computer Architecture (1995).
[219] Jingyue Wu, Heming Cui, and Junfeng Yang. “Bypassing races
in live applications with execution ﬁlters.” In: Symp. on Oper-
ating Sys. Design and Implem. 2010.
[220] Weiwei Xiong, Soyeon Park, Jiaqi Zhang, Yuanyuan Zhou, and
Zhiqiang Ma. “Ad-Hoc Synchronization Considered Harmful.”
In: Symp. on Operating Sys. Design and Implem. 2010.
[221] Min Xu, Rastislav Bodík, and Mark D. Hill. “A Serializability
Violation Detector for Shared-memory Server Programs.” In:
Intl. Conf. on Programming Language Design and Implem.
2005.
[222] Junfeng Yang, Ang Cui, Sal Stolfo, and Simha Sethumadhavan.
“Concurrency Attacks.” In: USENIX Workshop on Hot Topics
in Parallelism. 2012.
[223] Yu Yang, Xiaofang Chen, Ganesh Gopalakrishnan, and Robert
M. Kirby. “Distributed dynamic partial order reduction based
veriﬁcation of threaded software.” In: Intl. SPIN Workshop. 2007.
[224] Jie Yu and Satish Narayanasamy. “A case for an interleaving
constrained shared-memory multi-processor.” In: Intl. Symp.
on Computer Architecture. Austin, TX, USA, 2009.
[225] Jie Yu and Satish Narayanasamy. “A Case for an Interleaving
Constrained Shared-Memory Multi-Processor.” In: Intl. Symp.
on Computer Architecture. 2009.
[226] Yuan Yu, Tom Rodeheffer, and Wei Chen. “RaceTrack: Efﬁcient
detection of data race conditions via adaptive tracking.” In:
Symp. on Operating Systems Principles. 2005.
[227] Ding Yuan, Haohui Mai, Weiwei Xiong, Lin Tan, Yuanyuan
Zhou, and Shankar Pasupathy. “SherLog: error diagnosis by
connecting clues from run-time logs.” In: Intl. Conf. on Architec-
tural Support for Programming Languages and Operating Systems.
2010.
[228] Cristian Zamﬁr, Gautam Altekar, George Candea, and Ion Sto-
ica. “Debug Determinism: The Sweet Spot for Replay-Based
Debugging.” In: Workshop on Hot Topics in Operating Systems.
2011.
[229] Cristian Zamﬁr and George Candea. “Execution Synthesis: A
Technique for Automated Debugging.” In: ACM EuroSys Euro-
pean Conf. on Computer Systems. 2010.
146 Bibliography
[230] Cristian Zamﬁr, Baris Kasikci, Johannes Kinder, Edouard Bugnion,
and George Candea. “Automated Debugging for Arbitrarily
Long Executions.” In: Workshop on Hot Topics in Operating Sys-
tems. 2013.
[231] Andreas Zeller and Ralf Hildebrandt. “Simplifying and Isolat-
ing Failure-Inducing Input.” In: IEEE Transactions on Software
Engineering (2002).
[232] Wei Zhang, Junghee Lim, Ramya Olichandran, Joel Scherpelz,
Guoliang Jin, Shan Lu, and Thomas Reps. “ConSeq: Detecting
Concurrency Bugs through Sequential Errors.” In: Intl. Conf.
on Architectural Support for Programming Languages and Operat-
ing Systems. 2011.
[233] Pin Zhou, Radu Teodorescu, and Yuanyuan Zhou. “HARD:
Hardware Assisted Lockset-based Race Detection.” In: Inter-
national Symposium on High-Performance Computer Architecture.
2007.
Baris Kasikci
Research Assistant, Ph.D. Candidate
Ecole Polytechnique Fédérale de Lausanne (EPFL)
EPFL - IC - DSLAB +41 (78) 707 19 13
Station 14, Oﬃce INN-321 baris.kasikci@epﬂ.ch
1015 Lausanne, Switzerland http://www.bariskasikci.org
Research Interests
My research is centered around developing techniques, tools and environments that help build more reliable
and secure software. I am interested in ﬁnding solutions that allow programmers to better reason about
their code, and that eﬃciently detect bugs, classify them, and diagnose their root cause. I especially focus
on bugs that manifest in production, because they are hard and time consuming. I am also interested in
eﬃcient runtime instrumentation, hardware and runtime support for enhancing system security, and program
analysis under various memory models.
Education
Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne, Switzerland
Ph.D. in Computer Science Sep. 2010–present
Thesis: Techniques for Detection, Root Cause Diagnosis,
and Classiﬁcation of In-Production Concurrency Bugs
Advisor: Prof. George Candea
Middle East Technical University (METU) Ankara, Turkey
M.Sc. in Electrical and Electronics Engineering Sep. 2006–Jun. 2009
Thesis: Variability Modeling in Software Product Lines
Graduated with the top grade
Advisor: Prof. Semih Bilgen
B.Sc. in Electrical and Electronics Engineering Sep. 2002–Jun. 2006
Project: Embedded Target Estimation, Detection, and Tracking
Graduated with High Honors
Advisor: Prof. Arzu Koc
Awards and Honors
Intel Corp. Software and Services Group, Grant 2014–2016
VMware Inc., Doctoral Fellowship 2014–2015
EPFL, Doctoral Fellowship 2010–2011
Scientiﬁc and Technological Research Council of Turkey, Master Scholarship 2006–2008
Middle East Technical University, Dean’s High Honor List 2006
Middle East Technical University,
Award for Best Team Performance, Undergraduate Final Project 2006
Turkish Customs Association, Scholarship 2002–2006
1
Research and Work Experience
Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne, Switzerland
Research Assistant Sep. 2010–present
Research on software reliability with an emphasis on concurrent software
• I developed Gist, the ﬁrst technique for accurately, eﬃciently, and automatically diagnosing the root
causes of in-production failures, by using a combination of static and dynamic program analysis.
• I developed RaceMob, the ﬁrst automated in-production data race detection technique that can be
kept always-on and provides high accuracy, by combining static data race detection with adaptive,
crowdsourced dynamic data race detection.
• I developed Portend, a high-accuracy technique to classify data races according to their potential
consequences under arbitrary memory models, by using symbolic program analysis to explore multiple
program paths and schedules to determine the eﬀects of data races.
• I developed Bias-Free Sampling, a technique that allows eﬃcient sampling of rarely executed code
(where bugs often lurk) without over-sampling frequently executed code, by using a new sampling
algorithm and existing hardware support.
Intel Corp. Santa Clara, CA, USA
Research Intern Jul. 2015–Sep. 2015
Automated root cause diagnosis of failures and security enhancements using hardware support
• I developed a tool that allows developers to determine which program statements operate on a given
data type at runtime using a mix of static program analysis and hardware support. In our experiments,
this tool reduces the number of statements to examine during debugging by an order of magnitude.
This tool is being extended internally at Intel.
• I began investigating hardware support for enhancing system security, in particular, eﬃcient path
proﬁling for auditing and detecting control ﬂow hijack attacks.
VMware Inc. Palo Alto, CA, USA
Research and Development Intern Jun. 2014–Sep. 2014
Automated debugging and runtime control ﬂow tracking
• I implemented a runtime for eﬃcient control ﬂow tracking in software. This work formed the basis of
my Gist work on root cause diagnosis.
• I designed and implemented an infrastructure to remotely debug and proﬁle VMware VCenter virtual
machine management software, while it is running in production. This infrastructure is used by
VCenter developers at VMWare.
Microsoft Research Redmond, WA, USA
Research Intern Jun. 2013–Sep. 2013
Eﬃcient runtime execution sampling technique and low overhead coverage measurement
• I worked on the design of the Bias-Free Sampling framework for eﬃcient runtime sampling. I
designed and implemented the bias-free sampling framework for managed code (i.e., C#). This tool
is internally used at Microsoft.
• I designed and implemented a fault injection tool to detect resource leakage problems using dynamic
binary instrumentation.
Siemens Corporate Technology Istanbul, Turkey
Senior Software Engineer Mar. 2008–May 2010
Embedded home and industrial automation software
• I designed and implemented a real-time embedded gateway software between Siemens communication
processors and a building automation system using C++ on top of VxWorks.
2
Aselsan Electronic Industries Ankara, Turkey
Software Engineer May 2006–Mar. 2008
Embedded motor control and functional testing infrastructure
• I was responsible for a real-time embedded control software for turret motor control. I also designed
and implemented a full-system functional testing software using C++ on top of VxWorks for Power
PC architectures.
Student Intern Jun. 2005–Jul. 2005
Embedded software development
• I developed embedded control software for a night vision camera using C++ and PIC assembly on a
PIC microcontroller.
Peer-reviewed Publications
[1] Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures. Baris
Kasikci, Benjamin Schubert, Cristiano Pereira, Gilles Pokam, and George Candea. Symp. on Operating
Systems Principles (SOSP), Monterey, CA, October 2015.
[2] Failure Sketches: A Better Way to Debug. Baris Kasikci, Benjamin Schubert, Cristiano Pereira, Gilles
Pokam, Madanlal Musuvathi, and George Candea. Workshop on Hot Topics in Operating Systems
(HotOS), Kartause Ittingen, Switzerland, May 2015.
[3] Automated Classiﬁcation of Data Races Under Both Strong and Weak Memory Models. Baris Kasikci,
Cristian Zamﬁr, and George Candea. ACM Transactions on Programming Languages and Systems
(TOPLAS), May 2015.
[4] Eﬃcient Tracing of Cold Code Via Bias-Free Sampling. Baris Kasikci, Thomas Ball, George Candea,
John Erickson, and Madanlal Musuvathi. USENIX Annual Technical Conf. (USENIX ATC), Philadel-
phia, PA, June 2014.
[5] Lockout: Eﬃcient Testing for Deadlock Bugs. Ali Kheradmand, Baris Kasikci, and George Candea.
5th Workshop on Determinism and Correctness in Parallel Programming (WoDet), Salt Lake City, UT,
March 2014.
[6] RaceMob: Crowdsourced Data Race Detection. Baris Kasikci, Cristian Zamﬁr, and George Candea.
Symp. on Operating Systems Principles (SOSP), Farmington, PA, November 2013.
[7] Automated Debugging for Arbitrarily Long Executions. Cristian Zamﬁr, Baris Kasikci, Johannes
Kinder, Edouard Bugnion, and George Candea. Workshop on Hot Topics in Operating Systems (HotOS),
Santa Ana Pueblo, NM, May 2013.
[8] CORD: A Collaborative Framework for Distributed Data Race Detection. Baris Kasikci, Cristian
Zamﬁr, and George Candea. Workshop on Hot Topics in Dependable Systems (HotDep), Hollywood,
CA, October 2012.
[9] Data Races vs. Data Race Bugs: Telling the Diﬀerence with Portend. Baris Kasikci, Cristian Zamﬁr,
and George Candea. Intl. Conf. on Architectural Support for Programming Languages and Operating
Systems (ASPLOS), London, UK, March 2012.
[10] Scalable Modeling of Software Product Line Variability. Baris Kasikci and Semih Bilgen. Workshop on
Scalable Modeling Techniques for Software Product Lines (SCALE), San Francisco, CA, August 2009.
3
Talks
Automated Root Cause Diagnosis of In-Production Failures
• Symposium on Operating System Principles (SOSP) Oct. 2015
• Intel Corp. Sep. 2015
• Google Sep. 2015
• VMware Inc. Sep. 2015
Failure Sketches: A Better Way to Debug
• EcoCloud Annual Event Jun. 2015
• Hot Topics in Operating Systems (HotOS) May 2015
Eﬃcient Tracing of Cold Code via Bias-Free Sampling
• USENIX Annual Technical Conference (USENIX ATC) Jun. 2014
Lockout: Eﬃcient Testing for Deadlock Bugs
• Workshop on Determinism and Correctness in Parallel Programming (WoDet) Mar. 2014
RaceMob: Crowdsourced Data Race Detection.
• Symposium on Operating System Principles (SOSP) Oct. 2013
• EPFL Systems Seminar Oct. 2013
CoRD: A Collaborative Framework for Distributed Data Race Detection
• Workshop on Hot Topics in System Dependability (HotDep) Oct. 2012
Data Races vs. Data Race Bugs: Telling the Diﬀerence with Portend
• International Conference on Architectural Support for Programming Languages and
Operating Systems (ASPLOS) Mar. 2012
How to Build Reliable Software?
• Seminar talk to the incoming undergraduate students at EPFL Sep. 2011
Professional Service
Reviewer
Transactions on Software Engineering 2015
Transactions on Software Engineering and Methodology 2015
PC Member
International Symposium on Software Testing and Analysis, Artifact Evaluation Committee 2014
Shadow PC Member
EuroSys Conference on Computer Systems (EuroSys) 2013, 2015
External Reviewer
Conference on Innovative Data Systems Research (CIDR) 2013
Intl. Conf. on Dependable Systems and Networks (DSN) 2011, 2013
EuroSys Conference on Computer Systems (EuroSys) 2011, 2012
Workshop on Hot Topics in Operating Systems (HotOS) 2011, 2013
USENIX Annual Technical Conference (USENIX ATC) 2011
Symposium on Cloud Computing (SOCC) 2012
Symp. on Operating Systems Principles (SOSP) 2011. 2013
Intl. SPIN Workshop on Model Checking of Software (SPIN) 2011
Committee Member
EPFL Doctoral School of Computer and Communication Sciences Audit Committee 2015
4
Professional Membership
ACM: student member
Usenix: student member
EuroSys: student member
Teaching Assistantship
Principles of Computer Systems (graduate level, EPFL) 2014
Software Engineering (3rd year undergraduate level, EPFL) 2011, 2012
In 2012, I was the head teaching assistant
Programming II (1st year undergraduate level, EPFL) 2010
Research Mentoring
Lisa Zhou (1st year Master’s) Sep. 2015–present
• Lisa and I are working on using hardware support for improving the security of software systems. In
that regard, Lisa and Benjamin (see below) are building a framework for reproducing security bugs
in large applications (e.g., Chrome).
Benjamin Schubert (3rd year undergraduate) Feb. 2015–present
• Benjamin and I have been working on a framework that enables reliably reproducing failures in
systems software like Apache and MySQL. We used this framework to evaluate my Gist work on
root cause diagnosis. We are now extending this framework to encompass security vulnerabilities.
Ali Kheradmand (3rd year undergraduate) Jul. 2013–Sep.2013
• Ali and I worked on the Lockout project and developed a technique to systematically perturb program
executions (without modifying program semantics) to increase the probability of deadlock manifes-
tation. Ali is currently pursuing his Ph.D. at UIUC.
Radu Coman (Master’s thesis) Jan. 2012–Sep. 2012
• Radu and I surveyed common concurrency bug patterns in open source software. After we identiﬁed
data races as a common bug pattern among the 100 bugs we looked at in Google Code, we built a
static data race detector, which I used in my RaceMob project. Radu is currently a senior software
engineer at Ixia.
Languages
English: ﬂuent
French: ﬂuent
Turkish: native
German: beginner
References
Available upon request
5
